Adcom

Adcom

754389913230900026
🤘All LoRas free to download
Discord for DMs : discord.gg/exBKyyrbtG
600
Followers
170
Following
69.8K
Runs
1.5K
Downloads
7.5K
Likes
607
Stars

Articles

View All
Prompt size - things to be aware of

Prompt size - things to be aware of

Pony models use score system A quirk in training means one has to preface prompt with 'score_9 score_8 score_7 ' to reach the training data for the model. Problem might be fixed in latest pony model. Check docs.Prompt does not have to be verbatim. Just a kind if 'vibe' is enough.Illustrious models do not have this issue.See 'prompt as soundwaves' article: https://tensor.art/articles/913912699476253681To mix / enforce concepts its best to repeat stuff at different points in the prompt , with a good space of other kinds if text in between themCLIP has context size of 75 tokens. You can count tokens here , see coverohoto for this article: sd-tokenizer dot rocker dot boo (Google it)Exceeding the token limit <150 tokens will create two encoding vectors A and B , and the average (A+B)/2 will be used for inputYou can also blend within each encodings , as long as there is a decent amount of separation , see article for further details why that works//--//For SDXL especially , avoid token lengths slightly above 75 tokens , or slighly above 150 tokens as that will create an empty batch encoding vector B to the actual 75 token prompt A , when calculated as the final encoding (A+B)/2 would just make A be added to an empty vector B , essentially the same as prompting at half strength (A:0.5) when only slightly exceeding 75 tokens in prompt length.//---//To fix this; repeat the prompt again so the prompt is at a size slightly below 150 tokens.Or reduce word count so prompt is below 75 tokens.Capital letters A-Z and symbols are counted as single tokens.You can browse all of the token vectors here: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/tokenizer/vocab.json
Cross Attention - (Work in progress)

Cross Attention - (Work in progress)

Revisiting earlier article on what prompts actually are: https://tensor.art/articles/913912699476253681We know that the prompt is a 'sound' where concepts are overlaid as sine waves.Its amplitude decided by the token word out of a vocabulary of ~50K items:https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/tokenizer/vocab.jsonAnd its frequency by the token position in the prompt.Just as how two very distinctly different sounds can be overlaid into one sound , two very different prompts can be overlaid into one using the very generous 512 positions available in the T5 encoder.//--//Did you know that each word in your prompt are automatically assigned a place in the image?The 'sound' wave text encoding is converted to Key , Query And Value matricesWhereQuery Matrix = the image generated thus far , 'the pixels we have painted in the image from earlier steps'When training an image model , an existing image (the Query matrix) has noise added to it , where the AI model is trained to 'fill in the gaps' using a written prompt.Noise is added in training until the AI model can create an image from a written text prompt using nothing but 100% noise in its Query matrix (image) as its starting point.Key Matrix = X_desired , 'the thing to generate' calculated asX_desired = CFG * x_conditional + (1+CFG) * x_unconditional wherex_unconditional = improvisation in the model , 'guess in the gaps of this image based on adjent pixels for the image generated thus far'CFG is a decimal number , andx_conditional = (1+Guidance) * x_positive - Guidance * x_negativewhereGuidance is a decimal number ,x_positive = the 'sound wave' text encoding for the written promptx_negative = the 'sound wave' text encoding for the written negative promptValue Matrix = 'the area to paint on the image at this given step'This matrix assigns a probability to each section of the image which can be visualized as a heat map for a given input image.Using Cross Attention knowledge in prompts:1) When promoting concepts , keep in mind where in the image the concepts apply , using this prior knowledge of cross attention. 2) Forcing the same cross attention placement , e.g 'the middle is green . the middle is a curtain . the middle is a high resolution photo. ' is an effective way of blending concepts together 3) Alternating placement of foreign concepts , e.g ' this is a single photo . the left is a photo is a cgi render of samus aran from metroid . the right is a real photo of a football stadium full of people. the left is blah blah. the right is blub blub ' can be an effective way to blend highly foreign concepts into a single image by assigning separate placements for them in the prompt. Semantic About Chroma Modulation Layer:Chroma is a unique model in that it can both faithfully recreate photoreal , anime , furry , video footage and 3DCG concepts.Chroma is based on the FLUX Schnell model , which is a 12B parameter model of which 3.3B parameters make up the modulation layer.The modulation layer handles prompt adherance in the text to image model.In Chroma , the modulation layer is approximated down making Chroma a 9B parameter model in contrast to the original 12B parameter model.Benefits:Smaller size means lower VRAM cost when running Chroma compared to FLUX.Smaller size means Chroma can learn concepts more easily in training than base FLUX.Drawbacks:Replacing the modulation layer in the Flux.1 Schnell AI model with pruned versions, such as in distilled variants like Chroma, can introduce several drawbacks:1. Reduced Prompt Adherence: The modulation layer, with its ~3.3 billion parameters, is critical for integrating text prompt information (via T5 and CLIP embeddings) into the diffusion process. Pruning it can weaken the model's ability to accurately interpret and reflect complex or nuanced prompts, leading to outputs that may deviate from the intended description.2. Loss of Output Quality: While pruning preserves much of the visual fidelity, there can be a noticeable drop in fine-grained details, color accuracy, or overall coherence, especially for intricate or stylistically specific generations. The full modulation layer enhances feature modulation for richer outputs, which pruned versions may struggle to replicate.3. Limited Generalization: Pruned models are often optimized for specific use cases or hardware constraints. This can reduce their versatility across diverse prompts or artistic styles compared to the original Flux Schnell, which leverages the full modulation layer for broader applicability.4. Potential Artifacts: Aggressive pruning may introduce visual artifacts or inconsistencies, such as unnatural textures or distortions, particularly in edge cases where the model encounters prompts outside its optimized scope.5. Trade-off in Robustness: The distillation process (e.g., latent adversarial diffusion) paired with pruning can make the model less robust to variations in input conditions or noise schedules, potentially affecting stability during generation, especially in low-step (1–4 step) inference.While pruned versions like Chroma (with ~8.8–8.9 billion parameters) offer faster inference and lower resource demands, these benefits come at the cost of some degradation in the model's ability to handle complex prompts and maintain top-tier output quality. The exact impact depends on the pruning method and the specific use case.(to be continued later)
2
3
The vector behind your prompt

The vector behind your prompt

What are weighted prompts?When setting a weight in a prompt , e.g ' (banana :0.3) you are scaling down the magnitude of the token vector(s) sine wave component by a factor of 0.3Weighted prompts are NOT intended design for the text encoder and the user is discouraged to rely on it for results !Better strategy than using weighted promptsUse the knowledge from this article that prompts are sine waves at descending frequencies from its position in the prompt. Know that the better strategy is to instead REPEAT the key concepts in the prompts at different positions in the prompt !Repeating the words at different positions in the prompt ensures the soundwave will carry the concept both at the low frequency sine wave range , and the higher frequency sine wave range.Inversely , foreign concepts in training data will blend more easily the closer proximity these concepts have in the positional training data.A quirk with the T5 is that blank space " " is a strong discriminator between concepts.Removing the blank space " " separator between concepts increases the likelyhood of a good concept blend in the output encoding , e.g writing "carbananatree" instead of "car banana tree" .What is Guidance? And what is a negative prompt?The final prompt encoding vector 'pos_prompt' is subtracted by the negative prompt 'neg_prompt'using the equationconditional = guidance_scale * pos_prompt - neg_promptMost comfyUI setups uses the CFG system shown on the left side of this diagramThe Guidence scale 'alpha' is the symbol Phi in this diagram.However , there is an unseen parameter here called the 'CFG parameter' which sets the ratio betwen conditional generation ('make this thing!' ) vs. 'unconditional generation' ('fill in the gaps in this images based on adjecent pixels!' )If the CFG parameter is called 'x' then the the AI model output to our prompt 'result' will beresult = (1+x) * unconditional + x * conditional The vector 'result' is decoded by the Variational autoencoder (VAE) into the 'desired image' .The 'Variational' part of the Variational autoencoder is the reason why you get different image output for different seeds.This 'desired image' is then attempted to he recreated by the sampler (actually an algorithm for a differential equation solver) over N steps.For example , this is the Heun sampler differential equation solver: https://en.m.wikipedia.org/wiki/Heun%27s_methodMore generation steps = better result right?In this example I run Chroma Heun , https://tensor.art/models/895078034153975868?source_id=njuypVDrnECwpvAqYH718xIg at 5 steps , 10 steps and 15 stepsWhile this demonstration is not conclusive , for Chroma models in particular the 'aesthetic quality' improves at LOWER step counts , with the ideal at ~10 steps (0.4 Tensor credits per image).However text legibility is better at 15 steps (0.6 Tensor credits per image).The composition in the image is unchanged at 5 steps (0.2 Tensor credits per image).What is a prompt actually?Your prompt is actually a soundwave. The prompt is several soundwaves in fact , but lets assume its a single 'noise' built from a set of sinewaves at different amplitudes.What decides the amplitude if the sinewaves? The token vector sets the amplitude.What decides the frequencies of the sinewaves? The position the word has in the prompt!A token vector are like the alphabet A-Z , where each letter has its own amplitude characteristic.Except this alphabet has been expanded to include common English words making the 'alphabet' a size of about ~50K components.For example, the word 'banana' is its own token vector in this token vector alphabet , as well as 'blueberry' , the letter 'A' or the number '8' or an emoji '👍' .You can test how your text is tokenized into token vectors here: https://sd-tokenizer.rocker.boo/You can bowse the vocab.json token vector 'alphabet' here: https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/tokenizer/vocab.jsonHence every sentence or word combinations you can think of are converted into a their respective 'soundwave'To determine the frequencies of the sine waves used in the positional encodings for the T5 (Text-to-Text Transfer Transformer) model, we need to examine how positional encodings are constructed in the T5 architecture, specifically focusing on the sinusoidal positional encoding scheme commonly used in transformer models like T5.Positional Encoding in T5The T5 model, like many transformer-based models, uses positional encodings to incorporate the position of each token in the input sequence, as transformers do not inherently capture sequence order.For sinusoidal positional encodings, the frequencies of the sine (and cosine) waves are determined by a formula that assigns a unique encoding to each position in the sequence based on its position index and the dimensionality of the embedding.Sinusoidal Positional Encoding FormulaT5-Specific ParametersCalculating the FrequenciesSummary of frequency of sine wave to positional encoding
3
2
Photoreal in Chroma - Things you can do

Photoreal in Chroma - Things you can do

Pixelprose is (likely) part of Chroma photoreal set:https://huggingface.co/datasets/lodestones/pixelproseSince you are using training data; Clear the negatives completely.I'm using Chroma V49 at Heun 10 steps with Beta SchedulerCC12M Dataset (Chroma Training Data)Excerpts from CC12M in pixelprose 'vlm_caption' field (without negatives):PROMPT: a group of people on a cruise ship. There are approximately 25 people in the image. They are all wearing casual clothes and are standing around the pool on the ship. There is one person in the center of the image who is dressed up in a costume. They are wearing a pink and green tutu, a lei, and a large pair of sunglasses. They are also holding a tambourine. All of the people in the image are smiling and appear to be enjoying themselves. The background of the image is a blue sky with white clouds. The floor is made of wood and there are several chairs and tables around the pool. The image is a photograph. It is taken from a low angle and the people in the image are all in focus. The colors in the image are vibrant and the lighting is bright. NEG: (none)//----//PROMPT: A post-apocalyptic woman holding a crossbow. She is crouched on a pile of rubble. She is wearing a tattered gray cloak and a pair of goggles. Her face is dirty and she has a scar on her left cheek. Her hair is long and white. She is holding the crossbow in her right hand and it is pointed at the viewer. She has a knife in her left hand. The knife has a long, curved blade. The background is a blur of gray rubble. The image is in a realistic style and the woman's expression is one of determination.NEG: (none)( For photoreal one might add prompt to this prompt , or specify it using the 'aesthetic' tag)Trying again (with fixes):PROMPT: A post-apocalyptic real photo aesthetic woman holding a crossbow. She is crouched on a pile of rubble. She is wearing a tattered gray cloak and a pair of goggles. Her face is dirty and she has a scar on her left cheek. Her hair is long and white. She is holding the crossbow in her right hand and it is pointed at the viewer. She has a knife in her left hand. The knife has a long, curved blade. The background is a blur of gray rubble. The image is in a realistic style and the woman's expression is one of determination.NEG:fantasy_illustration gray_illustration (Negatives are tokenized one by one separated by whitespace hence the underscore '_' )//----//PROMPT:A scene from the movie Planet of the Apes, where a group of monkeys are driving cars on a bridge. In the foreground, a monkey is standing on the roof of a car, while another is sitting in the driver's seat. In the background, several other monkeys are driving cars, and one is standing on the roof of a car, holding a gun. The background is a destroyed city. NEG: (none)//----//PROMPT:A man and a woman walking and talking. The man is on the left side of the image, and the woman is on the right side. They are both smiling. The man is wearing a dark blue suit jacket, pants, and shoes. The woman is wearing a white dress and matching shoes with a red clutch in her right hand. They are walking on a stone path lined with trees and grass on either side. In the background, there is a building with large windows. The image is a photograph taken from a slightly elevated angle. Negatives: (none)//----//Redcaps Dataset (Chroma Training Data)A pecuiliar set within pixelprose is the Redcaps set.https://redcaps.xyz/TLDR; prompt like a reddit title w/o negatives , get photoreal resultsRefer to redcaps.xyz for examplesPrompts from redcaps without negatives:PROMPT: leaves in an alley NEG: (none)PROMPT: i swear, his color just shines in the mornings. NEG: (none)PROMPT: advice for a new owner? canon t7i, 24mm , f8./200s, 100 iso , r/beardeddragons , spiro the dragon NEG: (none)The reason why ` canon t7i, 24mm , f8./200s, 100 iso ` is because its actual titles people use at r/amateurphotography (weirdos) , and thats part of the redcaps set , and thats why such nonsense terminology can be useful in chroma.Finally photoreal NSFW:We don't know what photoreal NSFW sets are used.But writing prompts like a th0t on r/gonewild works for photoreal.elf girl fundays. just got this high collared black bodysuit off amazon. gorgeous green background. Here is my white bed. real photo aesthetic. showing off my braids and nerd glasses. any love for an eighteen blonde elf ….🤔💕(f) ? NEG : onlyfans_footage casual_illustration Similarly I reckon writing pr0n video titles ought to work well for photorealistic NSFW.Feel free to match the CC12M against the collection on NSFW story excerpts 1-30 with 1K paragraphs in each generator : https://perchance.org/fusion-t2i-nsfw-stories-1Batch encoding size for the T5 is 512 tokens. Verify the size here: https://sd-tokenizer.rocker.boo/I'll leave that for something people can try for themselves with above tips as a guide.Getty imagesGetty Images hosts captions for their photos https://www.gettyimages.comCopy paste for easy photoreal results.2012 Monaco Grand Prix - Saturday 2012 Monaco Grand Prix - Saturday Monte Carlo, Monaco 26th May 2012 Force India girls. Photo by Andrew Ferraro/LAT Images Negativestelevision_screen plastic_wig gray_3D_blur Mel C performs at the V99 festival in Chelmsford on August 21st 1999 CHELMSFORD, ENGLAND - AUGUST 21: ormer spice girl Melanie Chisholm performs her first major solo gig at the V99 festival in Chelmsford on August 21, 1999. (Photo by Dave Hogan/Getty Images) Negativestelevision_screen plastic_wig gray_3D_blur Fangrowth GeneratorFor NSFW try this generator : https://www.fangrowth.io/onlyfans-caption-generator/Works well in combination with : https://perchance.org/fusion-t2i-phototitle-1For example: the tag 'Amateur' =>I can’t think of a a few things we could do to make this pool more fun I don’t even know why I put a bathing suit on ;) Everything about this moment felt right Jiggly in all the best places Now I’m a tanned milf lol //---//Finally the conclusion I draw from Gonkee's video on embeddings in SD models: https://youtu.be/sFztPP9qPRc?si=dckBPPpLeUMAoTnlRepetition of concepts at various places prompts is better than adding weights, as stuff like ( blah blah :1.2) was never intended use for the FLUX / Chroma model
3
5
Chroma Models - The Stuff I Know

Chroma Models - The Stuff I Know

Chroma is a finetune of FLUX Schnell created by Lodestones.HF Repo: https://huggingface.co/lodestones/ChromaAllegedly a technical report on Chroma may be released in the future.Don't hold your breath though. From my own personal experience and others , Lodestone is not keen on explaining exactly what he is doing or what he is planning on doing with the Chroma models.TLDR; There is no documentation for Chroma. We just have to figure it out ourselves. I'm writing this guide despite having nothing close to factual information with regards to exact training data , recommended use and background information on the Chroma models.Aside from the total lack of documentation , the Chroma models are an excellent upgrade to base FLUX model and Lodestones deserve full credit for his efforts.The cost for training the Chroma models is allegedly (at present) over 200K USD for running the GPU. Model Architecture for ChromaKey feature is that this model has been pruned from the FLUX Schnell model , i.e the architecture is different.The Keys for the .safetensor file of FLUX Dev fp8 (B) , FLUX KREA (K) and FLUX Chroma (C)As such , don't expect good results from running FLUX Dev Trained LoRa on Chroma.Another minor changes in archetecture include the removal of the CLIP_L encoder. Chroma relies solely on the T5 encoder.Architecture (Sub-models)The Chroma models have different versionsChroma V49 - The latest trained checkpoint on Chroma. It has undergone 'high resolution training'. Unlike V48 , it is assumed Chroma V49 has undergone 'hi-res training' like the V50 , but not confirmed due to lack of documentation. https://tensor.art/models/895059076168345887Chroma V50 Annealed - A checkpoint merge for the last 10 Chroma checkpoints V49-V39 , from which it has undergone 'hi res training'. https://tensor.art/models/895041239169116458'Annealed' I have been told on Discord means that the model has undergone one final round of training through all 5 million images in training data at a very low learning rate. Plans are to make the V50 Annealed the 'official' FLUX Chroma model under the name 'Chroma1-HD'Chroma V50 - A bulls1t checkpoint merge created to secure funding for training the other checkpoint models. Don't use it.Chroma V50 Heun - An 'accidental' checkpoint off-shoot that arose when training the Chroma model. It works surprisingly well for photorealism at 'Heun' or 'Euler' sampler with 'Beta' Scheduler at 10 steps 1 CFG , hence the model name. https://tensor.art/models/895078034153975868Chroma V46 Flash - Another 'accidental' offshoot in training that boasts the highest stability in output of all the Chroma checkpoints. Try running at Euler Sampler with SGM Uniform sampler at 10 Steps , 1 CFG. An excellent model! https://tensor.art/models/889032308265331973What model should I use for LoRa training? Either V49 or V50 Annealed are excellent choices in my opinion.The V49 and V50 Annealed models can both run at 10 steps with Beta Scheduler at CFG = 1 and Guidance Scale = 5 , at the cost of 0.4 credits per image generation here on Tensor. TrainingThe Chroma model can do anime , furry and photorealistic content alike , including NSFW , using both natural language captions and danbooru tags.The training data has been captioned using Google Gemma 12B model. A repo assembled by me has a collection of training text-image pairs used to train Chroma , which are stored as parquet files accessible via Jupyter Notebook in the same repo:https://huggingface.co/datasets/codeShare/chroma_prompts/blob/main/parquet_explorer.ipynbYou'll need to download parquet to your Google Drive to read the prompts:Example output from the E621 setLodestones repo's (⬆️ items from these sets are included in my Chroma prompts repo for ease of use) :https://huggingface.co/datasets/lodestones/pixelprosehttps://huggingface.co/datasets/lodestones/e621-captions/tree/mainTip; Ask GROK on Twitter for Google Colab code to read items in these sets.//---//The Redcaps datasetA peculiar thing is that Chroma is trained on the redcaps dataset redcaps.xyzThese are text - image pairs where the image is a image found on reddit and the trxt prompt is the title of the reddit post!If you want to have fun time prompting Chroma; copy paste a reddit title either off the redcaps.xyz page , or from the chroma_prompts repo parquet files , and see for yourself.Example of a redcaps prompt:I found this blue thing in my backyard. Can someone tell me what it is? The 'Aesthetic' tagsThe pixelprose dataset used to train Chroma has an 'aesthetic' score assigned to each image as a float valueThis value has been rounded down as 'aesthetic 1, aesthetic 2, .... , aesthetic 10'Additionally , all AI images used to train Chroma has been tagged as 'aesthetic 11'(more later)Anime testPromptwhat is the aesthetic 0 style type of art? anime screencap with a title in red text Fox-like girl holding a wrench and a knife, dressed in futuristic armor, looking fierce with yellow eyes. Her outfit is a dark green cropped jacket and a skirt-like bottom. \: title the aesthetic 0 style poster "Aesthetic ZERO" CaptioningGemma 12B model was used to caption Chroma prompts , however this model dies not run on free tier T4 Colab GPU like the well established Joycaptions.To mitigate this ; I'm training the Gemma 4B model to specialize in captioning images in the same format as the Chroma training data.More info on the project here: https://huggingface.co/codeShare/flux_chroma_image_captionerFinding PromptsI recommend you visit the AI generator at perchance.org for Chroma prompts. They have had the Chroma model for their T2i generator for awhile and there are lots of users posting to the galleries.Its hard to browse old posts on perchance so it will do you well to 'rescue' some prompts and post them here to Tensor Art.ResolutionsRefer to the standard values for Chroma and SDXL models//---//
2
4

Posts