This was trained on WAN 14B I2V, on 45 videos normalized to 480P / 24FPS and trimmed to 3 seconds using diffusion-pipe. However, it works okay with the T2V model, and I've included some examples from that. My captioning approach:
45 videos, resized to 480P, 3 seconds, 24FPS.
Ran each one through ComfyUI_Qwen2-VL-Instruct to generate a base video description, but this unfortunately doesn't pick up on any NSFW bits. This usually took a few tries on the same image, since the LLM almost seemed "disgusted" by the suggestion. :D
Grabbed my "favorite" frame and ran that through Joy Caption 2, then I manually combined the Qwen description and the Joy Caption Two caption to make the final .txt file.