Create

Create

Adcom

754389913230900026

🤘All LoRas free to download
Discord for DMs : discord.gg/exBKyyrbtG

617 Followers

182 Following

80.3K Runs

1.5K Downloads

9K Likes

645 Stars

617

Followers

182

Following

80.3K

Runs

1.5K

Downloads

9K

Likes

645

Stars

Models

935795873529481764

LORA ChromaUpdated

Grunge Maids -r1

Adcom

931490704612441244

Made in Abyss-e52

Adcom

924023824511410678

Claymore-e52

Adcom

917331597596840620

Tsutomo Nihei-e81

Adcom

913182174931336383

Kuvshinov w. scifi manga (Chroma HD trained)-e49

Adcom

829324267784098302

40K Painterly Style-40K

Adcom

Workflows

842054125647664304

18 Nodes

From To Steps

842032371638353908

18 Nodes

From to Steps

Articles

Collage training for Chroma and Illustrious

Collage training for Chroma and Illustrious

Consider that you can create AI images from existing training imagesThat means that the AI model learns patterns from the training image , and uses those different patterns it knows to make new imagesIf you zoom in on the pixel pattern in any AI image you can find jpeg artifacts withinthe AI image , these normally only appear at the edges in normal imagesSuffice to say this method works 😀Practically , if you make a sword in illustrious , you see how the sword sometimes goes out both ends?The CFG parameter in the AI model is a blend between what you prompt , and 'adjecent pixels in the image'The patterns which the AI model learns are 'localized' within the training imageSo the reason why the collage method works is thanks to unconditional prompting parameterThe prompt you actually feed the AI model isnt purely the text promptAsk GROK on Twitterwhat is the relationship between cfg conditional prompting and unconditional prompting?The relation is X = CFG x_conditional + (1-CFG) x_unconditionalThe x_unconditional uses the image created thus far as the input argumentSo it 'fills in patterns' where it is likely for those patterns to beBut the gist is that this relationship is all about pixel to pixel adjecencySo location in the training image for specific pattern don't matterThe collages is like a difficult math problem you task the AI model to solve , so it can become more adept at solving easier math problems later onSo the AI model will usually never be able to recreate the collage training images 1:1 , but the AI model will become very adept at recreating the patterns within the training image in the attemptThe tool to build collages doesn't matterThe most important thing to know is that the trained pattern will always be relative to the imageSo ideally one should have at least one pattern that goes end to end in the imageExample image hereThe second thing to know is that AI images learn patterns that have good contrast to backgroundimageThe 'shape layer' as I call itBenefit of collage training that you can easily crop out bad patterns. Condensing the set to only good patterns. One more thing; if the background of the collage is #181818 gray it will perfectly match the gray background on TensorThis creates a cool 3D effect in the gallery 😀Syntax has examples too in illustriousI always link this video as a source if you wanna know absolutely all of the theory. Its from the SD1.5 days but still is true for present models https://youtu.be/sFztPP9qPRc?si=B_B353yktSpeKeicThis one has a lot of nonsense overly dense information but the gradient illustration is very cool https://youtu.be/NrO20Jb-hy0?si=6us5FRM7qhmD_auHCheers!

Prompt size - things to be aware of

Prompt size - things to be aware of

Pony models use score system A quirk in training means one has to preface prompt with 'score_9 score_8 score_7 ' to reach the training data for the model. Problem might be fixed in latest pony model. Check docs.Prompt does not have to be verbatim. Just a kind if 'vibe' is enough.Illustrious models do not have this issue.See 'prompt as soundwaves' article: https://tensor.art/articles/913912699476253681To mix / enforce concepts its best to repeat stuff at different points in the prompt , with a good space of other kinds if text in between themCLIP has context size of 75 tokens. You can count tokens here , see coverohoto for this article: sd-tokenizer dot rocker dot boo (Google it)Exceeding the token limit <150 tokens will create two encoding vectors A and B , and the average (A+B)/2 will be used for inputYou can also blend within each encodings , as long as there is a decent amount of separation , see article for further details why that works//--//For SDXL especially , avoid token lengths slightly above 75 tokens , or slighly above 150 tokens as that will create an empty batch encoding vector B to the actual 75 token prompt A , when calculated as the final encoding (A+B)/2 would just make A be added to an empty vector B , essentially the same as prompting at half strength (A:0.5) when only slightly exceeding 75 tokens in prompt length.//---//To fix this; repeat the prompt again so the prompt is at a size slightly below 150 tokens.Or reduce word count so prompt is below 75 tokens.Capital letters A-Z and symbols are counted as single tokens.You can browse all of the token vectors here: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/tokenizer/vocab.json

Cross Attention - (Work in progress)

Cross Attention - (Work in progress)

Revisiting earlier article on what prompts actually are: https://tensor.art/articles/913912699476253681We know that the prompt is a 'sound' where concepts are overlaid as sine waves.Its amplitude decided by the token word out of a vocabulary of ~50K items:https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/tokenizer/vocab.jsonAnd its frequency by the token position in the prompt.Just as how two very distinctly different sounds can be overlaid into one sound , two very different prompts can be overlaid into one using the very generous 512 positions available in the T5 encoder.//--//Did you know that each word in your prompt are automatically assigned a place in the image?The 'sound' wave text encoding is converted to Key , Query And Value matricesWhereQuery Matrix = the image generated thus far , 'the pixels we have painted in the image from earlier steps'When training an image model , an existing image (the Query matrix) has noise added to it , where the AI model is trained to 'fill in the gaps' using a written prompt.Noise is added in training until the AI model can create an image from a written text prompt using nothing but 100% noise in its Query matrix (image) as its starting point.Key Matrix = X_desired , 'the thing to generate' calculated asX_desired = CFG * x_conditional + (1+CFG) * x_unconditional wherex_unconditional = improvisation in the model , 'guess in the gaps of this image based on adjent pixels for the image generated thus far'CFG is a decimal number , andx_conditional = (1+Guidance) * x_positive - Guidance * x_negativewhereGuidance is a decimal number ,x_positive = the 'sound wave' text encoding for the written promptx_negative = the 'sound wave' text encoding for the written negative promptValue Matrix = 'the area to paint on the image at this given step'This matrix assigns a probability to each section of the image which can be visualized as a heat map for a given input image.Using Cross Attention knowledge in prompts:1) When promoting concepts , keep in mind where in the image the concepts apply , using this prior knowledge of cross attention. 2) Forcing the same cross attention placement , e.g 'the middle is green . the middle is a curtain . the middle is a high resolution photo. ' is an effective way of blending concepts together 3) Alternating placement of foreign concepts , e.g ' this is a single photo . the left is a photo is a cgi render of samus aran from metroid . the right is a real photo of a football stadium full of people. the left is blah blah. the right is blub blub ' can be an effective way to blend highly foreign concepts into a single image by assigning separate placements for them in the prompt. Semantic About Chroma Modulation Layer:Chroma is a unique model in that it can both faithfully recreate photoreal , anime , furry , video footage and 3DCG concepts.Chroma is based on the FLUX Schnell model , which is a 12B parameter model of which 3.3B parameters make up the modulation layer.The modulation layer handles prompt adherance in the text to image model.In Chroma , the modulation layer is approximated down making Chroma a 9B parameter model in contrast to the original 12B parameter model.Benefits:Smaller size means lower VRAM cost when running Chroma compared to FLUX.Smaller size means Chroma can learn concepts more easily in training than base FLUX.Drawbacks:Replacing the modulation layer in the Flux.1 Schnell AI model with pruned versions, such as in distilled variants like Chroma, can introduce several drawbacks:1. Reduced Prompt Adherence: The modulation layer, with its ~3.3 billion parameters, is critical for integrating text prompt information (via T5 and CLIP embeddings) into the diffusion process. Pruning it can weaken the model's ability to accurately interpret and reflect complex or nuanced prompts, leading to outputs that may deviate from the intended description.2. Loss of Output Quality: While pruning preserves much of the visual fidelity, there can be a noticeable drop in fine-grained details, color accuracy, or overall coherence, especially for intricate or stylistically specific generations. The full modulation layer enhances feature modulation for richer outputs, which pruned versions may struggle to replicate.3. Limited Generalization: Pruned models are often optimized for specific use cases or hardware constraints. This can reduce their versatility across diverse prompts or artistic styles compared to the original Flux Schnell, which leverages the full modulation layer for broader applicability.4. Potential Artifacts: Aggressive pruning may introduce visual artifacts or inconsistencies, such as unnatural textures or distortions, particularly in edge cases where the model encounters prompts outside its optimized scope.5. Trade-off in Robustness: The distillation process (e.g., latent adversarial diffusion) paired with pruning can make the model less robust to variations in input conditions or noise schedules, potentially affecting stability during generation, especially in low-step (1–4 step) inference.While pruned versions like Chroma (with ~8.8–8.9 billion parameters) offer faster inference and lower resource demands, these benefits come at the cost of some degradation in the model's ability to handle complex prompts and maintain top-tier output quality. The exact impact depends on the pruning method and the specific use case.(to be continued later)

The vector behind your prompt

The vector behind your prompt

What are weighted prompts?When setting a weight in a prompt , e.g ' (banana :0.3) you are scaling down the magnitude of the token vector(s) sine wave component by a factor of 0.3Weighted prompts are NOT intended design for the text encoder and the user is discouraged to rely on it for results !Better strategy than using weighted promptsUse the knowledge from this article that prompts are sine waves at descending frequencies from its position in the prompt. Know that the better strategy is to instead REPEAT the key concepts in the prompts at different positions in the prompt !Repeating the words at different positions in the prompt ensures the soundwave will carry the concept both at the low frequency sine wave range , and the higher frequency sine wave range.Inversely , foreign concepts in training data will blend more easily the closer proximity these concepts have in the positional training data.A quirk with the T5 is that blank space " " is a strong discriminator between concepts.Removing the blank space " " separator between concepts increases the likelyhood of a good concept blend in the output encoding , e.g writing "carbananatree" instead of "car banana tree" .What is Guidance? And what is a negative prompt?The final prompt encoding vector 'pos_prompt' is subtracted by the negative prompt 'neg_prompt'using the equationconditional = guidance_scale * pos_prompt - neg_promptMost comfyUI setups uses the CFG system shown on the left side of this diagramThe Guidence scale 'alpha' is the symbol Phi in this diagram.However , there is an unseen parameter here called the 'CFG parameter' which sets the ratio betwen conditional generation ('make this thing!' ) vs. 'unconditional generation' ('fill in the gaps in this images based on adjecent pixels!' )If the CFG parameter is called 'x' then the the AI model output to our prompt 'result' will beresult = (1+x) * unconditional + x * conditional The vector 'result' is decoded by the Variational autoencoder (VAE) into the 'desired image' .The 'Variational' part of the Variational autoencoder is the reason why you get different image output for different seeds.This 'desired image' is then attempted to he recreated by the sampler (actually an algorithm for a differential equation solver) over N steps.For example , this is the Heun sampler differential equation solver: https://en.m.wikipedia.org/wiki/Heun%27s_methodMore generation steps = better result right?In this example I run Chroma Heun , https://tensor.art/models/895078034153975868?source_id=njuypVDrnECwpvAqYH718xIg at 5 steps , 10 steps and 15 stepsWhile this demonstration is not conclusive , for Chroma models in particular the 'aesthetic quality' improves at LOWER step counts , with the ideal at ~10 steps (0.4 Tensor credits per image).However text legibility is better at 15 steps (0.6 Tensor credits per image).The composition in the image is unchanged at 5 steps (0.2 Tensor credits per image).What is a prompt actually?Your prompt is actually a soundwave. The prompt is several soundwaves in fact , but lets assume its a single 'noise' built from a set of sinewaves at different amplitudes.What decides the amplitude if the sinewaves? The token vector sets the amplitude.What decides the frequencies of the sinewaves? The position the word has in the prompt!A token vector are like the alphabet A-Z , where each letter has its own amplitude characteristic.Except this alphabet has been expanded to include common English words making the 'alphabet' a size of about ~50K components.For example, the word 'banana' is its own token vector in this token vector alphabet , as well as 'blueberry' , the letter 'A' or the number '8' or an emoji '👍' .You can test how your text is tokenized into token vectors here: https://sd-tokenizer.rocker.boo/You can bowse the vocab.json token vector 'alphabet' here: https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/tokenizer/vocab.jsonHence every sentence or word combinations you can think of are converted into a their respective 'soundwave'To determine the frequencies of the sine waves used in the positional encodings for the T5 (Text-to-Text Transfer Transformer) model, we need to examine how positional encodings are constructed in the T5 architecture, specifically focusing on the sinusoidal positional encoding scheme commonly used in transformer models like T5.Positional Encoding in T5The T5 model, like many transformer-based models, uses positional encodings to incorporate the position of each token in the input sequence, as transformers do not inherently capture sequence order.For sinusoidal positional encodings, the frequencies of the sine (and cosine) waves are determined by a formula that assigns a unique encoding to each position in the sequence based on its position index and the dimensionality of the embedding.Sinusoidal Positional Encoding FormulaT5-Specific ParametersCalculating the FrequenciesSummary of frequency of sine wave to positional encoding

Chroma Models - The Stuff I Know

Chroma Models - The Stuff I Know

Chroma is a finetune of FLUX Schnell created by Lodestones.HF Repo: https://huggingface.co/lodestones/ChromaAllegedly a technical report on Chroma may be released in the future.Don't hold your breath though. From my own personal experience and others , Lodestone is not keen on explaining exactly what he is doing or what he is planning on doing with the Chroma models.TLDR; There is no documentation for Chroma. We just have to figure it out ourselves. I'm writing this guide despite having nothing close to factual information with regards to exact training data , recommended use and background information on the Chroma models.Aside from the total lack of documentation , the Chroma models are an excellent upgrade to base FLUX model and Lodestones deserve full credit for his efforts.The cost for training the Chroma models is allegedly (at present) over 200K USD for running the GPU. Model Architecture for ChromaKey feature is that this model has been pruned from the FLUX Schnell model , i.e the architecture is different.The Keys for the .safetensor file of FLUX Dev fp8 (B) , FLUX KREA (K) and FLUX Chroma (C)As such , don't expect good results from running FLUX Dev Trained LoRa on Chroma.Another minor changes in archetecture include the removal of the CLIP_L encoder. Chroma relies solely on the T5 encoder.Architecture (Sub-models)The Chroma models have different versionsChroma V49 - The latest trained checkpoint on Chroma. It has undergone 'high resolution training'. Unlike V48 , it is assumed Chroma V49 has undergone 'hi-res training' like the V50 , but not confirmed due to lack of documentation. https://tensor.art/models/895059076168345887Chroma V50 Annealed - A checkpoint merge for the last 10 Chroma checkpoints V49-V39 , from which it has undergone 'hi res training'. https://tensor.art/models/895041239169116458'Annealed' I have been told on Discord means that the model has undergone one final round of training through all 5 million images in training data at a very low learning rate. Plans are to make the V50 Annealed the 'official' FLUX Chroma model under the name 'Chroma1-HD'Chroma V50 - A bulls1t checkpoint merge created to secure funding for training the other checkpoint models. Don't use it.Chroma V50 Heun - An 'accidental' checkpoint off-shoot that arose when training the Chroma model. It works surprisingly well for photorealism at 'Heun' or 'Euler' sampler with 'Beta' Scheduler at 10 steps 1 CFG , hence the model name. https://tensor.art/models/895078034153975868Chroma V46 Flash - Another 'accidental' offshoot in training that boasts the highest stability in output of all the Chroma checkpoints. Try running at Euler Sampler with SGM Uniform sampler at 10 Steps , 1 CFG. An excellent model! https://tensor.art/models/889032308265331973What model should I use for LoRa training? Either V49 or V50 Annealed are excellent choices in my opinion.The V49 and V50 Annealed models can both run at 10 steps with Beta Scheduler at CFG = 1 and Guidance Scale = 5 , at the cost of 0.4 credits per image generation here on Tensor. TrainingThe Chroma model can do anime , furry and photorealistic content alike , including NSFW , using both natural language captions and danbooru tags.The training data has been captioned using Google Gemma 12B model. A repo assembled by me has a collection of training text-image pairs used to train Chroma , which are stored as parquet files accessible via Jupyter Notebook in the same repo:https://huggingface.co/datasets/codeShare/chroma_prompts/blob/main/parquet_explorer.ipynbYou'll need to download parquet to your Google Drive to read the prompts:Example output from the E621 setLodestones repo's (⬆️ items from these sets are included in my Chroma prompts repo for ease of use) :https://huggingface.co/datasets/lodestones/pixelprosehttps://huggingface.co/datasets/lodestones/e621-captions/tree/mainTip; Ask GROK on Twitter for Google Colab code to read items in these sets.//---//The Redcaps datasetA peculiar thing is that Chroma is trained on the redcaps dataset redcaps.xyzThese are text - image pairs where the image is a image found on reddit and the trxt prompt is the title of the reddit post!If you want to have fun time prompting Chroma; copy paste a reddit title either off the redcaps.xyz page , or from the chroma_prompts repo parquet files , and see for yourself.Example of a redcaps prompt:I found this blue thing in my backyard. Can someone tell me what it is? The 'Aesthetic' tagsThe pixelprose dataset used to train Chroma has an 'aesthetic' score assigned to each image as a float valueThis value has been rounded down as 'aesthetic 1, aesthetic 2, .... , aesthetic 10'Additionally , all AI images used to train Chroma has been tagged as 'aesthetic 11'(more later)Anime testPromptwhat is the aesthetic 0 style type of art? anime screencap with a title in red text Fox-like girl holding a wrench and a knife, dressed in futuristic armor, looking fierce with yellow eyes. Her outfit is a dark green cropped jacket and a skirt-like bottom. \: title the aesthetic 0 style poster "Aesthetic ZERO" CaptioningGemma 12B model was used to caption Chroma prompts , however this model dies not run on free tier T4 Colab GPU like the well established Joycaptions.To mitigate this ; I'm training the Gemma 4B model to specialize in captioning images in the same format as the Chroma training data.More info on the project here: https://huggingface.co/codeShare/flux_chroma_image_captionerFinding PromptsI recommend you visit the AI generator at perchance.org for Chroma prompts. They have had the Chroma model for their T2i generator for awhile and there are lots of users posting to the galleries.Its hard to browse old posts on perchance so it will do you well to 'rescue' some prompts and post them here to Tensor Art.ResolutionsRefer to the standard values for Chroma and SDXL models//---//

Posts