Update on pussyjob alpha - After testing, I'm not able to get cum shooting, only post-cum shots. I realized that some of my prompt captions were messed up. I am retraining now.
She's grinding on you... try not to blow your load!
Wan T2V, 14B
Prompting:
A woman is straddling a man, engaging in sexual intercourse. The man is laying down on a bed. The woman is grinding on top of him. The scene is shot from the man's point of view.
Prompting for v2 - alpha pussyjob:
A woman is straddling a man on the floor. Both of them are naked. The woman is rubbing her ***** on the man's penis. The man's penis shoots a stream of cum as he ejaculates. The scene is a medium shot from the man's point of view. Realistic.
Cfg: around 3.0 seems to work well
Training:
This lora was created using diffusion-pipe on a 4090. Mostly default settings. See toml files inline below.
I trained using 57 clips @ 256x256, 24fps, 48 frames each. The clips were extracted from 6 longer publicly available vids @720p-1080p resolution each, using davinci resolve (free). I trained for 60 epochs and tested. Results were ok, but not spectacular. I let it run overnight for a total of around 10 hours of training and landed at 160 epochs with the results you see here. I captioned each clip using guidance from this article. I also refined my understanding of captioning through discussions with it's author (thanks @ComfyTinker!)
Captioning:
Each file was captioned manually with something like this example:
A woman is straddling a man, engaging in sexual intercourse. The man is laying down on the bed. The woman is grinding on top of him. Their faces are not visible. The woman and man are naked. She is fit and has large breasts. She has long, flowing, blonde hair. A green wall is visible in the background. The woman is lit from the side by bright, natural light. The scene is shot from the man's point of view. Realistic.
Notice that I did not use a made up keyword! Major learning for me here was that with wan/hunyuan we're not training the CLIP model, so using made-up words will result in a 'conceptual' (ie: always applied, regardless of prompts) lora, rather than a targeted lora that responds to prompts. This is because we aren't able to add new terms to the CLIP model with current training methods, so it either drops the made-up keyword or does something else unknown with it.
Other learnings: I had previously trained with ~50, 128x128 clips and used the keyword gr1nd1ng. The female motion result was great! The male was a jumbled mess, likely due to the low resolution.
Feedback and questions welcome!
config.toml:
# Dataset config file. output_dir = '/mnt/d/Projects/video-training/grinding/output' dataset = '/mnt/d/Projects/video-training/grinding/dataset_256px.toml' # Training settings epochs = 200 micro_batch_size_per_gpu = 1 pipeline_stages = 1 gradient_accumulation_steps = 4 gradient_clipping = 1.0 warmup_steps = 100 # eval settings eval_every_n_epochs = 1 eval_before_first_step = true eval_micro_batch_size_per_gpu = 1 eval_gradient_accumulation_steps = 1 # misc settings save_every_n_epochs = 10 checkpoint_every_n_epochs = 10 #checkpoint_every_n_minutes = 30 activation_checkpointing = true partition_method = 'parameters' save_dtype = 'bfloat16' caching_batch_size = 1 steps_per_print = 1 video_clip_mode = 'single_middle' blocks_to_swap = 15 # 10 was too low and caused too much swapping/slow training (180s/step vs 25s/step) [model] type = 'wan' # 1.3B #ckpt_path = '/mnt/d/software_tools/diffusion-pipe/models/wan/Wan2.1-T2V-1.3B' # 14B ckpt_path = '/mnt/d/software_tools/diffusion-pipe/models/wan/Wan2.1-T2V-14B' transformer_path = '/mnt/d/software_tools/diffusion-pipe/models/wan/Wan2_1-T2V-14B_fp8_e5m2.safetensors' #kijai vae_path = '/mnt/d/software_tools/diffusion-pipe/models/wan/Wan_2_1_VAE_bf16.safetensors' #kijai llm_path = '/mnt/d/software_tools/diffusion-pipe/models/wan/umt5-xxl-enc-bf16.safetensors' #kijai dtype = 'bfloat16' timestep_sample_method = 'logit_normal' [adapter] type = 'lora' rank = 32 dtype = 'bfloat16' [optimizer] type = 'adamw_optimi' lr = 5e-5 betas = [0.9, 0.99] weight_decay = 0.01 eps = 1e-8
dataset.toml (stolen from hearmeman's runpod, but mostly default values from tdrussel):
# Resolutions to train on, given as the side length of a square image. You can have multiple sizes here. # !!!WARNING!!!: this might work differently to how you think it does. Images are first grouped to aspect ratio # buckets, then each image is resized to ALL of the areas specified by the resolutions list. This is a way to do # multi-resolution training, i.e. training on multiple total pixel areas at once. Your dataset is effectively duplicated # as many times as the length of this list. # If you just want to use predetermined (width, height, frames) size buckets, see the example cosmos_dataset.toml # file for how you can do that. resolutions = [256] # You can give resolutions as (width, height) pairs also. This doesn't do anything different, it's just # another way of specifying the area(s) (i.e. total number of pixels) you want to train on. # resolutions = [[1280, 720]] # Enable aspect ratio bucketing. For the different AR buckets, the final size will be such that # the areas match the resolutions you configured above. enable_ar_bucket = true # The aspect ratio and frame bucket settings may be specified for each [[directory]] entry as well. # Directory-level settings will override top-level settings. # Min and max aspect ratios, given as width/height ratio. min_ar = 0.5 max_ar = 2.0 # Total number of aspect ratio buckets, evenly spaced (in log space) between min_ar and max_ar. num_ar_buckets = 7 # Can manually specify ar_buckets instead of using the range-style config above. # Each entry can be width/height ratio, or (width, height) pair. But you can't mix them, because of TOML. # ar_buckets = [[512, 512], [448, 576]] # ar_buckets = [1.0, 1.5] # For video training, you need to configure frame buckets (similar to aspect ratio buckets). There will always # be a frame bucket of 1 for images. Videos will be assigned to the longest frame bucket possible, such that the video # is still greater than or equal to the frame bucket length. # But videos are never assigned to the image frame bucket (1); if the video is very short it would just be dropped. frame_buckets = [1, 16, 32, 48] # If you have >24GB VRAM, or multiple GPUs and use pipeline parallelism, or lower the spatial resolution, you could maybe train with longer frame buckets # frame_buckets = [1, 33, 65, 97] [[directory]] # Path to directory of images/videos, and corresponding caption files. The caption files should match the media file name, but with a .txt extension. # A missing caption file will log a warning, but then just train using an empty caption. path = '/mnt/d/Projects/video-training/grinding/8-256px/' # You can do masked training, where the mask indicates which parts of the image to train on. The masking is done in the loss function. The mask directory should have mask # images with the same names (ignoring the extension) as the training images. E.g. training image 1.jpg could have mask image 1.jpg, 1.png, etc. If a training image doesn't # have a corresponding mask, a warning is printed but training proceeds with no mask for that image. In the mask, white means train on this, black means mask it out. Values # in between black and white become a weight between 0 and 1, i.e. you can use a suitable value of grey for mask weight of 0.5. In actuality, only the R channel is extracted # and converted to the mask weight. # The mask_path can point to any directory containing mask images. #mask_path = '/home/anon/data/images/grayscale/masks' # How many repeats for 1 epoch. The dataset will act like it is duplicated this many times. # The semantics of this are the same as sd-scripts: num_repeats=1 means one epoch is a single pass over all examples (no duplication). num_repeats = 1 # Example of overriding some settings, and using ar_buckets to directly specify ARs. # ar_buckets = [[448, 576]] # resolutions = [[448, 576]] # frame_buckets = [1] # You can list multiple directories. # If you have a video dataset as well remove the hashtag from the following 3 lines and set your repeats # [[directory]] # path = '/video_dataset_here' # num_repeats = 5