NatViS: Natural Vision

CHECKPOINT
Original


Updated:

5.6K

Please Read Description

NatViS (Natural Vision) is a photorealistic full-parameter fine-tune of SDXL that uses Natural Language prompting to generate high quality SFW/NSFW images. Trained on 1M+ image-caption pairs on a dataset that’s been expanded and refined for over a year.
Note: NatViS is still being trained. V1 (epoch 68) wrapped up training on July 19th, 2024.

I’ve never been a fan of e-begging, however SDXL fine-tunes at this scale are becoming expensive to tune. So I will begrudgingly ask; if you like what I do and would like to support my models. Consider donating on Ko-Fi 💗
I will be begin posting updates, answering questions, taking feedback, and releasing early access (NOT EXCLUSIVE) models to supporters.


Questions/Feedback/Support

Visit my thread on the Unstable Diffusion Discord

Buy me a coffee ❤

https://ko-fi.com/ndimensional

All donations will be used to fund the creation of new Stable Diffusion fine-tunes and open-source AI tools.


Usage Tips

Note: These are simply recommendations, feel free to experiment.

Prompting

NatViS leverages SDXL’s bigG text-encoder to allow for Natural Language prompting.

What is Natural Language Prompting?
Since the release of Stable Diffusion v1.4 — people have become accustom to comma delimited lists of visually descriptive tags/phrases. This was a necessity for early Stable Diffusion models due to the architecture and choice of text-encoder. With SDXL’s dual text-encoder/tokenizer architecture we are able to write more naturally descriptive prompts.

Simply describe the image you want to generate, just as you would describe the image to a person.

For example;
Comma delimited list: a woman, standing, outdoors, sun beams, dappled light, apple tree, wearing denim jeans, flannel shirt, brown hair, long hair, looking at viewer, highest quality, atmospheric, 35mm, masterpiece

Natural Language: A masterpiece, 35mm-style photo of a woman with long brown hair, standing outdoors in dappled sunlight beneath an apple tree. She wears denim jeans and a flannel shirt, gazing directly at the viewer with an atmospheric quality.

Note: This is just an example to highlight how to write a natural language prompt. For better examples, see the sample images.

Will NatViS Understand Everything I tell it?
Absolutely, not.
Due to various limitations in both the architecture and size of the data I’m able to fine-tune as one person. There will be instances where the model will simply not generate what you want. Often, you experiment with different wording, placement of tokens (i.e., moving a sentence or individual token closer to the start or end of a prompt), remove potentially conflicting tokens, ect… Their really is no definitive solution I can, as it varies from prompt-to-prompt. Unfortunately there will times when no solution/workaround is successful.

Can I still use Tags?
Short answer: Yes
SDXL’s dual text-encoder/tokenizer architecture can process tokens/sequences with both encoders in parallel. Meaning, you don’t have to use natural language prompting.


Note: Since the training data was purely captioned with Natural Language descriptions, not all the common descriptive tags people are familiar with will be understood by the model. Especially Booru, Booru-style tags.

I found a hybrid system works well, as seen in many of the sample images.


For example;
Say you tried your natural language prompt, but want to make the results a bit more cinematic. Instead of modifying the entire prompt; you can simply append cinematic lighting, harmonious, film still, ect.. To the end of your prompt.

Quality Tags/Classifiers? (score_up_x)
Blasphemy.
You can use quality rank/classifiers if you want. But they will not part of the training data.

Negative Prompt
Similar to other SDXL models. Use tags separated with commas and keep it short. Add/Remove tokens from the negative prompt as needed.

Generation Parameters

CFG:

  • Recommended: 5-7

  • 7+ to enforce a specific style/medium

Sampler/Sampling Steps:
This can be quite subjective, so I will just share what I typically use instead of giving direct recommendations.

  • Sampler - DPM++ 2M SDE

  • Scheduler - Karras

  • Steps - 55

ADetailer: (Extension)
Link
Again, subjective so I’ll just share my settings.

  • Model - mediapipe_face_full (use mediapipe for photorealism)

  • Confidence - 0.45

  • Everything else is default.

CFG Rescale: (Extension)
Link
I forgot that I had this installed, I’m not quite sure if it was enforcing the zero terminal SNR to the noise schedule or not. Since the parameter was null, it shouldn’t have.

  • Phi - 0


Important

If you struggle to replicate the sample images, even with the exact seed and parameters. It’s likely because of the noise scheduler. I enabled the fix for this in Webui, but had since reinstalled webui and forgot to re-enable it. This only applies to V1 of NatViS.


Training Info

TO-DO
This will take a while to write up. So in the meantime:
TLDR; 1M+ images, processed/cleaned via personal Dataset Toolkit I’m developing, captioned via Multimodal Large Language Model (MLLM) with unified feature space (part of Dataset Toolkit, not GPT). Training Data, Configs, Custom Scripts will be made available and open-sourced when the final version is released. Dataset Toolkit has no announced release date.


Check out my other models

SDXL Checkpoints: https://civitai.com/collections/966964

SDXL LoRAs: https://civitai.com/collections/966969

40K Series: https://civitai.com/collections/956187

SD1.5 Checkpoints: https://civitai.com/collections/966974

SD1.5 LoRAs: https://civitai.com/collections/966972

Version Detail

SDXL 1.0
What's New? Uploaded NatViS v2.5 Updates to text-encoder(s) to reintroduce tag/booru-style prompting capabilities that were broken in v2.0 Subset of data included from new (improved) dataset, specifically image-caption pairs with short n' punchy captions. Info on new dataset (for future models/update): Includes more variation of caption styles and all automation is manually verified by a human (i.e., me). Introduced more analog photography and classic cinematic film image data to further the push for more authentic realism. What's Next? General: Review SD3.5 license to see if it's worth touching. It's not terrible. Will start research into models architecture for fine-tuning/LoRA. General: Release Anti-Pony Alpha model (Anime, Digital Illustrations). In advance, it's not nearly as robust as Pony. This is a test to see if there's enough interest in the idea to pursue crowd funding for training. Trained with character knowledge and quality in-mind, novel booru+ tagging system & natural language prompting, multiple styles/mediums, artist knowledge, no silly quality ranking tags, SDXL compatible (i.e., not overfit and broken) More info will come out soon. NatViS: Release of Lightning variants for NatViS v2.5. Done more effectively this time. NatViS: Finally getting around to creating, and releasing a PDF guide. NatViS: Continue fine-tuning of v3.0.

Project Permissions

    Use Permissions

  • Use in TENSOR Online

  • As a online training base model on TENSOR

  • Use without crediting me

  • Share merges of this model

  • Use different permissions on merges

    Commercial Use

  • Sell generated contents

  • Use on generation services

  • Sell this model or merges

Comments

Related Posts

Describe the image you want to generate, then press Enter to send.