TensorArt Studios官方运用了先进的DMD2 蒸馏技术,在Animagine 3.1 的基础上进行了蒸馏学习。这一创新方法使我们能够在短短一秒钟内,通过四个推理步骤生成高质量的二次元图片,显著提升了图像生成的效率和质量,且可叠加其它lora使用。每张图目前只消耗0.2算力,助力你实现生图自由(请使用如下推荐的生图参数)。
We applied DMD2 distillation technology based on Animagine XL 3.1. This innovative approach allows us to generate high-quality anime images in just one second through four denoising steps. And it can be stacked with other Loras. Each image currently only consumes 0.2 computing power, helping you achieve freedom in generating images (please use the recommended parameters for image generation).
lora strength: 0.8~1
sampler: lcm
scheduler: simple
steps: 4~5
cfg: 1.0
----------------------------------------------------------------------------------------------------------
基于开源项目:( https://github.com/tianweiy/DMD2, NeurIPS 2024), 其在训练过程的不再要求一张噪声图一定生成对应的一张图片,而是采用一批噪声图片来生成对应的一批图片,通过训练来缩小教师模型和学生模型输出分布的差异(KL散度)。在训练过程中利用 real score function 和 fake score function的差值来更新生成器,使得生成的图片既具有真实性,又具有多样性。
总共消耗L20x3 3days 或RTX4080S-32gx7。需要算力及训练支援可以联系官方。
我们在原有基础上主要做了如下改进:
修改优化器adam为adam8bit, 大大减少了原实验的显存占用使得实验能够继续。
扩增可训练图片的分辨率,原有的图片输入的分辨率只能是1024x1024, 修改相关代码把裁剪信息和时间进行编码传给模型,使之能够传入更多样的分辨率的图片来进行训练。
采用新的无监督美学打分器,通过CLIP-IQA(https://github.com/IceClear/CLIP-IQA, AAAI 2023) 打分来过滤质量低下的图片,即用选中的图像和{“高质量”; “低质量”}等正负描述性文本进行对比,从而判断图像的质量。
重构dataloader,之前数据载入用的是LMDB,是一种内存非关系型数据库,前期需要把图片通过vae转为latent,过程繁琐,占用空间较大。改为webdataset库之后,把vae的过程延后,惰性载入,减少了数据预处理步骤,理论上支持更大规模数据,且加快了训练速度。训练速度从之前的12.5it/s提速到8.5it/s。同时在数据载入的时候对原有的数据再进行过滤,根据负面标签集合["bad", "nfsw", "text", "negative"...]将某些不符合训练要求的数据筛去。
分段训练,保证学习率适中,避免模型快速过拟合。
prompt数据集数据量提升至400k, 图文对数据集提升到100k。
Based on the open-source project: https://github.com/tianweiy/DMD2, during the training process, it does not require that a single noise will necessarily generate one image, but rather a batch of noise images to produce a batch of images. The generator is updated based on the difference between the real score function and the fake score function, ensuring that the generated images are both realistic and diverse.Each image currently only consumes 0.2 computing power, helping you achieve the freedom of generating images (please use the recommended parameters for image generation).
Total consumption: L20x3 3 days or RTX4080S-32gx7. For computing power and training support, please contact the official.
The following improvements have been made:
The optimizer adam has been modified to adam8bit, significantly reducing memory usage and allowing experiments to continue.
Originally, the input image resolution was limited to 1024x1024; relevant code has been adjusted to encode cropping information and time for transmission to the model, enabling training with images of various resolutions.
A new aesthetic filter has been adopted, using CLIP-IQA scores (https://github.com/IceClear/CLIP-IQA) to filter out low-quality images. The CLIP-IQA project assesses image quality by comparing images with descriptive texts like "high quality" or "low quality" using an unsupervised pre-trained CLIP model.
The dataloader has been redesigned; previously, LMDB, a non-relational in-memory database, was used for data loading—a cumbersome process that required converting images into latent space via VAE and occupied significant storage space. Switching to the webdataset library defers the VAE process and employs lazy loading, simplifying data preprocessing steps theoretically supporting larger datasets and accelerating training speed from 12.5it/s to 8.5it/s. Additionally, during data loading, existing data is further filtered based on negative label sets ["bad", "nfsw", "text", "negative"...], removing unsuitable items for training requirements.
Segmented training, ensuring a moderate learning rate to avoid rapid overfitting of the model.
The prompt dataset size has increased to 400k with prompts exceeding 2000 words in length added; paired text-image datasets now total up to 100k entries