Cross Attention - (Work in progress)
Revisiting earlier article on what prompts actually are: https://tensor.art/articles/913912699476253681We know that the prompt is a 'sound' where concepts are overlaid as sine waves.Its amplitude decided by the token word out of a vocabulary of ~50K items:https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/tokenizer/vocab.jsonAnd its frequency by the token position in the prompt.Just as how two very distinctly different sounds can be overlaid into one sound , two very different prompts can be overlaid into one using the very generous 512 positions available in the T5 encoder.//--//Did you know that each word in your prompt are automatically assigned a place in the image?The 'sound' wave text encoding is converted to Key , Query And Value matricesWhereQuery Matrix = the image generated thus far , 'the pixels we have painted in the image from earlier steps'When training an image model , an existing image (the Query matrix) has noise added to it , where the AI model is trained to 'fill in the gaps' using a written prompt.Noise is added in training until the AI model can create an image from a written text prompt using nothing but 100% noise in its Query matrix (image) as its starting point.Key Matrix = X_desired , 'the thing to generate' calculated asX_desired = CFG * x_conditional + (1+CFG) * x_unconditional wherex_unconditional = improvisation in the model , 'guess in the gaps of this image based on adjent pixels for the image generated thus far'CFG is a decimal number , andx_conditional = (1+Guidance) * x_positive - Guidance * x_negativewhereGuidance is a decimal number ,x_positive = the 'sound wave' text encoding for the written promptx_negative = the 'sound wave' text encoding for the written negative promptValue Matrix = 'the area to paint on the image at this given step'This matrix assigns a probability to each section of the image which can be visualized as a heat map for a given input image.Using Cross Attention knowledge in prompts:1) When promoting concepts , keep in mind where in the image the concepts apply , using this prior knowledge of cross attention. 2) Forcing the same cross attention placement , e.g 'the middle is green . the middle is a curtain . the middle is a high resolution photo. ' is an effective way of blending concepts together 3) Alternating placement of foreign concepts , e.g ' this is a single photo . the left is a photo is a cgi render of samus aran from metroid . the right is a real photo of a football stadium full of people. the left is blah blah. the right is blub blub ' can be an effective way to blend highly foreign concepts into a single image by assigning separate placements for them in the prompt. Semantic About Chroma Modulation Layer:Chroma is a unique model in that it can both faithfully recreate photoreal , anime , furry , video footage and 3DCG concepts.Chroma is based on the FLUX Schnell model , which is a 12B parameter model of which 3.3B parameters make up the modulation layer.The modulation layer handles prompt adherance in the text to image model.In Chroma , the modulation layer is approximated down making Chroma a 9B parameter model in contrast to the original 12B parameter model.Benefits:Smaller size means lower VRAM cost when running Chroma compared to FLUX.Smaller size means Chroma can learn concepts more easily in training than base FLUX.Drawbacks:Replacing the modulation layer in the Flux.1 Schnell AI model with pruned versions, such as in distilled variants like Chroma, can introduce several drawbacks:1. Reduced Prompt Adherence: The modulation layer, with its ~3.3 billion parameters, is critical for integrating text prompt information (via T5 and CLIP embeddings) into the diffusion process. Pruning it can weaken the model's ability to accurately interpret and reflect complex or nuanced prompts, leading to outputs that may deviate from the intended description.2. Loss of Output Quality: While pruning preserves much of the visual fidelity, there can be a noticeable drop in fine-grained details, color accuracy, or overall coherence, especially for intricate or stylistically specific generations. The full modulation layer enhances feature modulation for richer outputs, which pruned versions may struggle to replicate.3. Limited Generalization: Pruned models are often optimized for specific use cases or hardware constraints. This can reduce their versatility across diverse prompts or artistic styles compared to the original Flux Schnell, which leverages the full modulation layer for broader applicability.4. Potential Artifacts: Aggressive pruning may introduce visual artifacts or inconsistencies, such as unnatural textures or distortions, particularly in edge cases where the model encounters prompts outside its optimized scope.5. Trade-off in Robustness: The distillation process (e.g., latent adversarial diffusion) paired with pruning can make the model less robust to variations in input conditions or noise schedules, potentially affecting stability during generation, especially in low-step (1–4 step) inference.While pruned versions like Chroma (with ~8.8–8.9 billion parameters) offer faster inference and lower resource demands, these benefits come at the cost of some degradation in the model's ability to handle complex prompts and maintain top-tier output quality. The exact impact depends on the pruning method and the specific use case.(to be continued later)