LyCORIS are LoRA-like matrix decomposition adapters that modify the cross-attention layer of the UNet. In the same spirit of maximizing matrix rank while minimizing parameter count, LoKr (Low-rank adaptation with Kronecker product) as another viable option, employs Kronecker products for matrix decomposition. Using Kronecker products lies in the multiplicative nature of their ranks, allowing to move beyond the limitations of low-rank assumptions.
This project studies the primarily focus on unknnown-concept fine-tuning with a small dataset (130 images) that spans across 3DCG and digital illustration copositions. Contraty to most guides the alpha project studies the no-captioning training scheme.
Hypothesis
Only a set of unlabeled images in the general domain is required to train a text-to-image generator. First, the embedding of the image in the united language-vision embedding space is extracted with its CLIP encoder. Next, the image is converted into a sequence of discrete tokens. Once the training is complete, the hotfix can generate coherent image tokens based on the text embedding extracted from the text encoder of CLIP upon an input text. The dataset also includes a regularization set composed by 90 composite image grids.
Experimental details, alpha version:
No captioned (increasing) dataset ,
Dim3 Alpha2 Factor8
TELr: 0.00004 UnetLr: 0.0001
Experimental details for beta version:
No captioned (static) dataset,
Dim3 Alpha2 Factor8
TELr: 0.00002 UnetLr: 0.00005
This project is inspired by the following works:
Discoverings
The phrase "A 3DCG digital illustration masterpiece" somehow improves the output; i don't know why.
FAQ
What is this?
An endeavor to improve 3DCG artificial generations, hence this project focuses on CGI graphics and Digital Illustration prompts.
Should you use the latest version?
No, the latest version is just a continuation of the training, doesn't mean is better, is just a more trained version for research purposes.
What parameters should you use:?
Stable version: alpha ALT {v1.0 - v05}, optimal {v0.9 - v0.7}
ㅤ🟢 weight = 0.35 - 0.45
ㅤ🟡 weight = 0.45 - 0.75
ㅤ🔴 weight = 0.75 +
ㅤ⚠️ guidance = 3.0 - 3.5
note: less trained version may support higher weights.
Stable version: alpha {v1.0 - v0.5}, optimal {v0.9 - v0.7}
ㅤ🟢 weight = 0.45 - 0.55
ㅤ🟡 weight = 0.60 - 0.70
ㅤ🔴 weight = 0.70 +
ㅤ⚠️ guidance = 3.0 - 3.5
note: less trained version may support higher weights.
🚧◥◣◥◣◥◣◥◣◥◣◥◣◥◣◥◣◥◣◥◣◥◣◥◣◥◣◥◣◥◣◥◣◥◣◥◣◥◣🚧
1. Input Layer (Prompt Input)
The input prompt (which could be a text description, image features, or other data) is first processed by an embedding layer or some encoding mechanism to convert the prompt into a numerical representation that the network can understand. This happens at the stage where we see img, timestamps, grid_index, and potentially PE (Positional Encoding).
These inputs could represent different aspects of the data:
img: Could be image features or pixel data if the input is partially an image.
timestamps and grid_index: These might encode spatial or temporal information relevant to how the prompt aligns with the data.
PE (Positional Encoding): Common in transformer architectures, this helps the model understand where in the sequence (or the image space) certain features lie.
2. Embedding and Modulation (MLP Emb & Modulation)
The MLP Emb. block transforms the encoded prompt into a higher-dimensional space, making it easier for the network to extract useful features from the data. This embedding step could take the prompt’s semantics and break it down into numerical representations of abstract features (e.g., “a red sunset” might break down into numerical vectors representing “color,” “time of day,” etc.).
Modulation adjusts the image feature maps or embeddings dynamically based on the input prompt. This helps the network apply specific features or adjustments, such as style, color, or texture, in alignment with the user prompt. It ensures that the network is conditioned on the input prompt as it processes the data.
3. DoubleStream Block (Early Feature Extraction)
The DoubleStream Block likely handles initial feature extraction and processing. It may be responsible for processing different aspects of the image or prompt in parallel streams.
One stream might focus on spatial features (e.g., edges, shapes), while the other might focus on texture or color.
These streams help the model capture multiple facets of the input data simultaneously, which is useful for generating coherent and detailed images from a prompt.
After this block, there’s a Cat (concatenation) operation, which fuses the outputs of the double streams, bringing together all the extracted features for further processing.
4. SingleStream Block (Detailed Feature Processing)
The SingleStream Block refines and processes the combined features. By now, the network has an intermediate representation of the prompt and its related features, and this block helps smooth out inconsistencies or add more nuance to the data.
The Conv1D layers within the SingleStream Block suggest that sequential or spatial information is being processed. For instance, in image generation, this could correspond to generating finer details along pixel sequences (or in time series, if applicable).
LayerNorm and GELU (Gaussian Error Linear Unit) ensure stability and efficiency during training, helping the network learn better representations without becoming unstable.
5. Attention Mechanism (QKNorm and CA Blocks)
The QKNorm block indicates that some form of attention mechanism might be at play here. In typical models like transformers, Q-K-V (Query-Key-Value) attention is used to focus the model on important parts of the input while ignoring irrelevant details.
For image generation, this could mean paying special attention to certain parts of the image that are highly relevant to the prompt. For example, if the prompt is “a red car in a green field,” the network might attend to the car and the field more than the sky.
CA (Concatenation) blocks in this context could be combining information from different stages or attention heads, allowing the network to integrate insights from multiple parts of the image.
6. Final Layers (LastLayer)
The LastLayer is where all the processed information is aggregated and passed through final transformations to produce the final image.
After the detailed features have been refined, combined, and processed by the earlier blocks, the output might pass through fully connected layers or another type of decoder that translates these processed features back into the pixel space, creating the final image based on the input prompt.
7. Output (Generated Image)
The network outputs an image that reflects the prompt given. If this architecture is designed for image generation, the layers transform the abstract features (derived from the input prompt) into a coherent and detailed image.
Depending on the task, this final output could be high-dimensional (like an image matrix) or could involve probabilities and further post-processing to map latent features back to the pixel space.