LoRA Training for Stable Diffusion 3.5


Updated:

Full article can be found here : Stable Diffusion 3.5 Large Fine-tuning Tutorial

Images should be cropped into these aspect ratios:

If you need help automatically pre-cropping your images, this is a lightweight, barebones [script](https://github.com/kasukanra/autogen_local_LLM/blob/main/detect_utils.py) I wrote to do it. It will find the best crop depending on:

(1024, 1024), (1152, 896), (896, 1152), (1216, 832),(832, 1216), (1344, 768), (768, 1344), (1472, 704)

1. Is there a human face in the image? If so, we’ll do the cropping oriented around that region of the image.

2. If there is no human face detected, we’ll do the cropping using a saliency map, which will detect the most interesting region of the image. Then, a best crop will be extracted centered around that region.

Here are some examples of what my captions look like:

k4s4, a close up portrait view of a young man with green eyes and short dark hair, looking at the viewer with a slight smile, visible ears, wearing a dark jacket, hair bangs, a green and orange background
k4s4, a rear view of a woman wearing a red hood and faded skirt holding a staff in each hand and steering a small boat with small white wings and large white sail towards a city with tall structures, blue sky with white clouds, cropped

If you don't have your own fine-tuning dataset, feel free to use this dataset of paintings by John Singer Sargent (downloaded from WikiArt and auto-captioned) or a synthetic pixel art dataset.

I’ll be showing results from several fine-tuned LoRA models of varying dataset size to show that the settings I chose generalize well enough to be a good starting point for fine-tuning LoRA.

repeats duplicates your images (and optionally rotates, changes the hue/saturation, etc.) and captions as well to help generalize the style into the model and prevent overfitting. While SimpleTuner supports caption dropout (randomly dropping captions a specified percentage of the time), it doesn’t support shuffling tokens (tokens are kind of like words in the caption) as of this moment, but you can simulate the behavior of kohya’s sd-scripts where you can shuffle tokenswhile keeping an n amount of tokens in the beginning positions. Doing so helps the model not get too fixated on extraneous tokens.

Steps calculation

Max training steps can be calculated based on a simple mathematical equation (for a single concept):

There are four variables here:

  • Batch size: The number of samples processed in one iteration.

  • Number of samples: Total number of samples in your dataset.

  • Number of repeats: How many times you repeat the dataset within one epoch.

  • Epochs: The number of times the entire dataset is processed.

There are 476 images in the fantasy art dataset. Add on top of the 5 repeats from multidatabackend.json . I chose a train_batch_size of 6 for two reasons:

  1. This value would let me see the progress bar update every second or two.

  2. It’s large enough in that it can take 6 samples in one iteration, making sure that there is more generalization during the training process.

If I wanted 30 or something epochs, then the final calculation would be this:

represents the number of steps per epoch, which is 396.

As such, I rounded these values up to 400 for CHECKPOINTING_STEPS .

⚠️ Although I calculated 11,900 for MAX_NUM_STEPS, I set it to 24,000 in the end. I wanted to see more of samples of the LoRA training. Thus, anything after the original 11,900 would give me a good gauge on whether I was overtraining or not. So, I just doubled the total steps 11,900 x 2 = 23,800, then rounded up.

CHECKPOINTING_STEPS represents how often you want to save a model checkpoint. Setting it to 400 is pretty close to one epoch for me, so that seemed fine.

CHECKPOINTING_LIMIT is how many checkpoints you want to save before overwriting the earlier ones. In my case, I wanted to keep all of the checkpoints, so I set the limit to a high number like 60.

Multiple concepts

The above example is trained on a single concept with one unifying trigger word at the beginning: k4s4. However, if your dataset has multiple concepts/trigger words, then your step calculation could be something like this so:

2 concepts [a, b]

Lastly, for learning rate, I set it to 1.5e-3 as any higher would cause the gradient to explode like so:

The other relevant settings are related to LoRA.

{
  "--lora_rank": 768,
  "--lora_alpha": 768,
  "--lora_type": "standard"
}

Personally, I received very satisfactory results using a higher LoRA rank and alpha. You can watch the more recent videos on my YouTube channel for a more precise heuristic breakdown of how image fidelity increases the higher you raise the LoRA rank (in my opinion).

Anyway, If you don’t have the VRAM, storage capacity, or time to go so high, you can choose to go with a lower value such as 256 or 128 .

As for lora_type , I’m just going with the tried and true standard . There is another option for the lycoris type of LoRA, but it’s still very experimental and not well explored. I have done the deep-dive of lycoris myself, but I haven’t found the correct settings that produces acceptable results.

Custom config.json miscellaneous

There are some extra settings that you can change for quality of life.

{
  "--validation_prompt": "k4s4, a waist up view of a beautiful blonde woman, green eyes",
  "--validation_guidance": 7.5,
  "--validation_steps": 200,
  "--validation_num_inference_steps": 30,
  "--validation_negative_prompt": "blurry, cropped, ugly",
  "--validation_seed": 42,
  "--lr_scheduler": "cosine",
  "--lr_warmup_steps": 2400,
}

"--validation_prompt": "k4s4, a waist up view of a beautiful blonde woman, green eyes"

"--validation_guidance": 7.5 "--validation_steps": 200 "--validation_num_inference_steps": 30 "--validation_negative_prompt": "blurry, cropped, ugly"

"--lr_scheduler": "cosine"

"--lr_warmup_steps": 2400

These are pretty self-explanatory:

"--validation_prompt"

The prompt that you want to use to generate validation images. This is your positive prompt.

"--validation_negative_prompt"

Negative prompt.

"--validation_guidance"

Classifier free guidance (CFG) scale.

"--validation_num_inference_steps"

The number of sampling steps to use.

"--validation_seed"

Seed value when generating validation images.

"--lr_warmup_steps"

SimpleTuner has set the default warm up to 10% of the total training steps behind the scenes if you don’t set it, and that’s a value I use often. So, I hard-coded it in (24,000 * 0.1 = 2,400). Feel free to change this.

"--validation_steps"

The frequency at which you want to generate validation images is set with "--validation_steps". I set mine to 200, which is a 1/2 of 400 (number of steps in an epoch for my fantasy art example dataset). This means that I generate a validation image every 1/2 of an epoch. I suggest generating validation images at least every half epoch as a sanity check. If you don’t, you might not be able to catch errors as quickly as you can.

Lastly is "--lr_scheduler" and "--lr_warmup_steps".

I went with a cosine scheduler. This is what it will look like:

### Memory usage

If you aren’t training the text encoders (we aren’t), `SimpleTuner` saves us about `10.4 GB` of VRAM.

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/4e8dae13-2612-4518-91a4-53485ccdba7c/316002db-297b-45a9-b919-cec6b311c773/image.png)

With the settings of `batch size` of `6` and a `lora rank/alpha` of `768`, the training consumes about `32 GB` of VRAM.

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/4e8dae13-2612-4518-91a4-53485ccdba7c/c2aac70a-8c65-4f6f-b602-487f24de4bd2/image.png)

Understandably, this is out of the range of consumer `24 GB` VRAM GPUs. As such, I tried to decrease the memory costs by using a `batch size` of `1` and `lora rank/alpha` of `128` .

Tentatively, I was able to bring the VRAM cost down to around `19.65 GB` of VRAM.

However, when running inference for the validation prompts, it spikes up to around `23.37 GB` of VRAM.

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/4e8dae13-2612-4518-91a4-53485ccdba7c/0c5240d6-6f71-404e-bea7-b18cc35ee5ad/image.png)

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/4e8dae13-2612-4518-91a4-53485ccdba7c/026be306-8331-45a2-9c02-541005f2cdfd/image.png)

To be safe, you might have to decrease the `lora rank/alpha` even further to `64`. If so, you’ll consume around `18.83 GB` of VRAM during training.

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/4e8dae13-2612-4518-91a4-53485ccdba7c/5edcaaf9-bf0d-4db0-a183-cfab44963b8e/image.png)

During validation inference, it will go up to around `21.50 GB` of VRAM. This seems safe enough.

![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/4e8dae13-2612-4518-91a4-53485ccdba7c/bd41ce4e-a0db-443b-b3d2-63eac136779d/image.png)

If you do decide to go with the higher spec training of `batch size` of `6` and `lora rank/alpha` of `768` , you can use the `DeepSpeed` config I provided [above](https://www.notion.so/Stable-Diffusion-3-5-Large-Fine-tuning-Tutorial-11a61cdcd1968027a15bdbd7c40be8c6?pvs=21) if your GPU VRAM is insufficient and you have enough CPU RAM.

0