Beginners' Basics By Beginner (PART 2)
Assumes part 1 read.
Model & Model Training Background Understanding for Image Generation
One way to think of a AI model is that its a program that resulted from other program(s) using a lot of data to learn patterns and correlations. This process we call it “training”. For image generation, for example, this creates a basis for your text to be translated int the various image elements. There are different programs and ways to train. TA has its own too. For beginner we won't go into actually creating models but just to understand image model basics to help us generate images better.
Quality & Scope
Firstly, the model creator chooses the input images. Generally speaking the more the better. Quality of input images – clarity, resolution, being representative, etc all affects the end model. Knowing that creators are often ad hoc efforts with various constraints and preferences means that models vary greatly in quality and scope. Quality could be clarity, having more flexibility in subject/object/environment etc, less flaws (deformed, distorted, outright wrong etc). Scope is what that model can produce in object/subject/colors/mood/genre etc.
How it impacts us:
Company produced models (usually base models) usually have a better dependability, so base models are your default first choice unless something worthwhile is offered by a checkpoint.
Free-lance or hobby creators can have a “reputation” for their works to help us choose whether to try future models by them. If they previously did good works and good samples are visible, a new model lacking in samples may be worth trying just by reputation alone. Whether you want to chance some credits to try untested creators and models is basically how attractive its promised offering is to you. Sometimes a model can be “printed” from elsewhere meaning it was copied over. So the source site may have samples.
LoRAs give an add-on feature, for example more “special effects” for your images. But if you do not need the special effects then you don't need the LoRA. Stick with what you need so that you have less chances for flaws. In part 1, we mentioned that LoRAs can interfere with one another, so adding LoRAs that you don't need simply add chance for problems. More LoRAs may mean more add-on features, but it can also mean more parts failing to work together.
Cleaning and Organizing Data
The creators also need to remove duplicates, avoid over representation or skewing the data etc. After which things like “label & tag" the images needs to be done. This is sort of like describing them so that a prompt can pull from the model to help generate the image. Of course there is more to it than what was just described but it is good enough for a basic understanding
How it impacts us:
Everyone works differently whether in quality or quantity so that affects data source, inputs and treatment. Everyone thinks differently, so the “label and tag" will be different. Therefore, sometimes a prompt works for a LORA but not for another , some prompted word work more often and some less often for the same model, some prompts work for some models and not others. These means we need to realize that when we learn to generate images, the lessons cannot be applied blindly across models. Each model have its particularities that must be learned. Limited by time and effort you need to discover your main few model preferences and spend most of your budgeted time and efforts on those, while maybe setting aside some little amount of time occasionally just having a “look see” at what other models can do.
Training a Model
The actual crunching is by a program and computer, there are more than one option/method, that produces models that behave differently. One aspect of this "training" is where the AI processes the inputs such that far more outputs are possible for generating images than what was put in. A simple way to think of it is that the AI is mixing, matching and coming up with new variations and permutations.
How it impacts us:
The AI is not a thinking person or even thinking anything. It is a program – so it can make mistakes that a program can make which includes hallucination and flaws. Hallucinations are when wrong, misleading or odd results come out. For images having 3 arms on a person, distortions, deformities are examples. These problems are not necessarily purely because of the training, but for simplicity we treat it as such for beginner basics.
So when we prompt models, we may feel that many and detailed prompts are good. For purposes of getting as close to what we want, more prompts is good. However, more prompts also mean more complexity because more variables are involved. This increases the likelihood of strange or bad results. Hence some users preference for simple short prompts. Another way to think of it is that when you prompt, you need to know what is the most important to you for the image. In a manner of speaking you need to ration your words so that the key aspects are covered and leave the rest to the model. As AI improve, we may be able to prompt more and maybe even have a responsive AI that ask questions and clarify with you, but for now, too much words is a bad thing. All the extra words will use up the “attention” of the AI and give more chance for flaws. I prefer short concise keywords that contain the critical elements of what is needed. A lot of words like – is are that … etc have no meaning for the image generated. This model is not human and does not pick up meaning from natural language. It picks up the keywords and the context of the keywords. This whole paragraph is unfortunately contrary to what some people suggest when they say to write a lot and use a lot of descriptive words as if you are writing a scene in a novel. I happen to disagree with that method. Its up to you to decide what you believe, look at what others produced or you can test it out for yourself.
There are actually far more aspects to training a model but these basic background knowledge should be enough for a beginner simply trying to generate images.