Training Lora using Tensor in 2026

Training Lora using Tensor in 2026


Updated:

A LoRa (low rank adaptation layer) is the method of adding a small change to the model weights to re-create a set of images.

Creating LoRa has been around since the SD1.5 days , however models themselves have improved since then.

The key change is the use of natural language text encoders such as the Qwen and T5.

Additionally , Tensor now allows captioning images using very good LLM models. You can find the gemini and Qwen image captioner in the LoRa trainer.

The custom prompt 'itemize the image' followed by some example prompt (optional) allows for the image to be broken down into its components. This creates a very long image description (try it!) , but we are still within the 512 token batch size for the T5 encoder , or Qwen encoder.

//---//

Why is natural labguage text encoders important for LoRa training? Recall in earlier article that the prompt text you feed the model is a 'soundwave' , a noise that is built up from the sum of sinewaves in descending frquencies at given amplitudes.

The position a token (word) has in the written text is the freqequency of the sinewave.

The amplitude if the sine wave is the token (word) itself.

For CLIP_L model used in SD1.5 , each token is a set of 768 decimal values (float32) that make up the 'identity' of a word when training a model.

Consequentially , you need 768 'soundwaves' to create a text encoding for CLIP_L , one for each dimension of the token vector.

For the natural language models , the dinension is way higher than 768 and will need as many 'soundwaves' or 'sum of sinewaves' to build the text encoding.

For simplicity , look at a single dimension and assume the text encodiing to be just a single soundwave. Imagine a noise on a graph , thats your prompt.

For natural language text encoders , which are trained on an extensive number of written texts , there is no limit how you can write your 'request' for an image.

With an itemized caption , your LoRa training can train on many aspects of the image at once.

The longer the prompt within the 512 batch encoding , the broader range of frequencies you will have in you 'sound' , the 'harder' you can train the LoRa.

At the end of the day , LoRa training is about associating text with the image. From what we have learned so far , the text you have trained on is not a 'single thing' but a 'sum of frequencies' of concepts that train independendly from one another.

Similiar to how a soundwave that is 'rock music' at the lower frequency spectrum can play 'synthwave' at the higher frequency spectrum. Being at different frequencies , concepts in your prompt exist separate from one another.

//---//

But if a text encoding can hold concepts in parallell , is it possible to do the same with an image?

Yes. By creating 'image collages' of patterns , you can train multiple concepts at the same time.

Example of a LoRa training frame. The background is #181818 gray which is the exact same color as the background in the Tensor gallery. This creates a cool optical illusion in final result.

Key points when creating collage training frames.

1) The sizes of the image pattern relative to the width/length of the 1024x1024 is fixed! The AI model cannot enlarge or shrink a pattern that it trains on in LoRa training.

Therefore , you should scale all body images so that the 'head' and 'feet' fill the 1024 image length for a 'full body' training.

2) The checkpoint model itself is not a 'single thing' . It is divided into layers like stations at a car factory , each layer responsible for building up the 'car' step-by-step until the finished product (image) is rolled out at the end.

One important aspect in creating this 'car' in the checkpoint model is what I call 'the shape layer'.

Concept artists will tell you that what makes any design stand out is the silhuette , and for lora training this holds true as well.

First and foremost , you should choose your lora training material based on stuff that has a 'cool shape' .

The reason for this is; AI models have been found to gobble up 'low frequency noise' , the 'stuff-within-shapes' very easily in training.

If you have trained your AI model to create a shape , chances are it already has plenty if capability to make that shape look cool.

Therefore , focus on edges , outlines , color differences , that trains shapes in your lora training.

3) Training text. Often overlooked in lora training. Adding small labels if nonsense text to caption keeps the capability of generating fine print and such at no extra space. You can find fun text labels at pinterest.

Example of training image with text label , note that the girl is intentionally scaled in the training image at roughly one half the length of the 1024x1024 image. Keeping consistent character pattern size for heads , hands , bodies is very important.

Additionally , one can make image aplear large and volimous by adding depth training into the image as shown above. For extra effect , add subjects into the foreground blurred out slightly.

This blur effect can be created using an Edit model.

Example of adding a blurred out foreground shot to create illusion of depth in training image. Note how this immediately creates more space and life into the scene.

0