Text to Video: The Tensor Basics
Disclaimer: This article links to images and videos, some of which the auto-mod may have tagged as “NSFW”. The thematic genre of the prompt used is “1980s sword-and-sorcery” in the artistic style of Frank Frazetta, Boris Vallejo, and Julie Bell… Which does tend toward a certain insufficiently attired aesthetic. (Think chainmail bikini-clad warrior Red Sonja and scantily-clad Sorceresses.) So you may want to keep that in mind. If you don’t have “Mature Content” toggled to on any links to images tagged as “NSFW” may not work. Sorry about that. I did put “safe for work” in the prompt after the first few but… better safe than sorry.
That said this article is about using text to video, specifically comparing outputs. The best way to do that is testing one simple prompt in multiple models. I figured the best way to test the various models available here was asking them to generate a person walking toward the camera. Too simple, you say? I agree. That’s why I asked for a sorceress. No point using credits if you’re not testing the model’s limits, right?
Of course, by now, most reading this probably think they have pretty much “mastered” the basics of text to image prompt crafting and figure text to video is just as simple. Is that you? I know it was me. And boy-howdy was I wrong!
If you want to try creating video from a prompt, or have already tried, maybe by copying and pasting an existing prompt, like I started out doing, you will quickly discover that those prompts don’t quite work. Or the result didn’t turn out as you expected, despite generating a half dozen test images. I thought that was good enough. It wasn’t. Happens to us all. Maybe it’s happened to you?
Problem is video generators are costly to use. Typically they require way more credits than generating an image. Even at the “free” level most other platforms only allow for 2-3 generations per day. And, too often, that’s just not enough to learn much of anything by trial and error.
So, what to do?
Not give up. That’s where this article begins. With the basics. Rule #1: text to image prompts WILL NOT ALWAYS WORK for text to video generation. (More rules to follow in other articles, if there’s a need.) Practically speaking this rule means, if you have a prompt that you’d like to use, a prompt that works really well, you will need to tweak that prompt for use by txt2vid generators. It’s a pain, but not really that hard to do. In fact once you learn the basics of what does and does not work it gets easier. But, even after reading this article, it will take some trial and error. So don’t give up.
Let’s get started looking as some basic test images using the same (or only slightly tweaked) prompt:
The images in the above links use a prompt similar to, but not identical with, the sample prompt we’ll be playing with later in this article. Also, they had multiple LORAs applied to create a specific artistic aesthetic. LORAs won’t be available for most txt2vid models here. Something to keep in mind when crafting a prompt. Go ahead. Take a second re-reading the prompts if you want. Okay. Great. Notice anything about them? You’re right. They’re relatively simple, if verbose, lacking the usual string of descriptors. And that’s a good thing.
Why?
If you read my previous articles on creating images I noted how it is possible to generate an image with simple prompts like:
Cinematic, Hyper-realistic, dancing toads, chibi.
Hyper-realistic, Nicholas Cage, superman punching Lex Luthor, Metropolis.
Comic Book Art, Maisie Williams, Raven casting spells, Teen Titans.
As it happens the first two prompts might actually work, as they contain directions that are also actions, but that last one probably won’t work very well. A good text to video prompt should be like a mini script, providing camera directions and specifying action and/or movements like:
1) Camera pans on a cinematic action scene of dancing toads depicted in the chibi style.
2) Technicolor film scene, light grain, a dramatic cinematic scene of Nicholas Cage as superman punching Idris Elba as Lex Luthor on a Metropolis skyscraper rooftop.
3) Comic Book action scene of Maisie Williams floating in the sky as Teen Titan Raven casting spells at a horde of charging DC super villains turned zombies.
Those should work. But will they? Without hot linking offsite (and the magic of reading an article long after it’s been written and edited multiple times) I can positively say…
1) Yes, I got dancing frogs. Needs more of a description as it was very basic.
2) Sort of. I did get Nicholas Cage, actually two of him. Fighting himself. He knocked himself off the rooftop into the horizon. (Kind of a cheat. I used director mode and inserted camera direction prompts.) So a technical fail. But hilarious.
If you try this prompt maybe using a different actor will generate better results? You may also try describing Lex Luthor’s appearance to better distinguish him from Superman or just choose a different villain.
3) Not at all. It did look like a comic book panel, sort of, if traced by a 5 year old. Hardly any actual animation. Very badly drawn. Horrible looking, actually. Don’t waste your time or credits trying it. Use this prompt instead:
Dynamic cinematic action scene of Maisie Williams as the Teen Titan Raven. Raven is floating above a hoard of zombies, clawed hands reaching up toward her. As Raven hovers above the milling zombies, who resemble characters from the DC Comics universe, tendrils of black mystical energy leap from her fingertips dramatically destroying the zombies that her dark magic touches.
That is the type of generic prompt that works best. By which I mean it provides a solid foundation to build on. You can insert camera directions, easily expand upon it, and it actually works as written without additional descriptors. At least my one test of it did.
But what about those prompts from the beginning of the article?
Good question. Thanks for remembering.
Check out the prompt of this (typos and all): https://tensor.art/images/829469892941049736?post_id=829469892936855433&source_id=njeyo1nrnEW3oPYsaX309xkk
Below is a simplified version of the prompt used to generate the image linked to above. The below prompt will work, in both txt2img and txt2vid platforms, with varying degrees of success depending upon the chosen model. Since most of what was pruned was descriptive fluff, in other words description bloat, this also serves as a quick lesson in prompt craft. Remember try to keep your prompt concise and simple. Unless you’re on a platform giving you more than a 5 second video output, then have at it.
Here's the shortened prompt:
Hyper-realistic, creative, Frank Frazetta and Boris Vallejo inspired cinematic action scene of the beautiful Milla Jovovich as a insufficiently attired sorceress stalking through a torch lit dungeon catacomb toward the camera. Her radiant statuesque form moving with elegant precision, right hand holding a gleaming wizard's staff capped with a silver skull. As Milla walks she looks around, waving her staff around defensively, as if expecting a lurking monster to leap out of the shadows. From her wary expression Milla is ready to cast a deadly spell at any attacker. Her smile is pure malice, her face exquisitely detailed with symmetrical, well-defined features—high cheekbones, full lips, and piercing, otherworldly eyes with golden iridescent flecks. A shimmering halo of magical energy surrounds her as a a magic shield. Her long, lustrous windswept hair whips around her, each strand meticulously rendered, subtly shifting as if a writhing serpent.
Okay. Here we are. We have a prompt to play with. Now what?
(Continues in Part 2)
Part 2: https://tensor.art/articles/829676356309759027
Continues in Part 3 (Hunyuan vs SkyReels)