Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference
🎉 Key Features
Direct Align: We introduce a new sampling strategy for diffusion fine-tuning that can effectively restore highly noisy images, leading to an optimization process that is more stable and less computationally demanding, especially during the initial timesteps.
Faster Training: By rolling out only a single image and optimizing directly with analytical gradients—a key distinction from GRPO—our method achieves significant performance improvements for in under 10 minutes of training. To further accelerate the process, our method supports replacing online rollouts entirely with a small dataset of real images; we find that fewer than 1500 images are sufficient to effectively train .
Free of Reward Hacking: We have improved the training strategy for method that direct backpropagation on reward signal (such as ReFL and DRaFT). Moreover, we directly regularize the model using negative rewards, without the need for KL divergence or a separate reward system. In our experiments, this approach achieves comparable performance with multiple different rewards, improving the perceptual quality of without suffering from reward hacking issues, such as overfitting to color or oversaturation preferences.
Potential for Controllable Fine-tuning: For the first time in online RL, we incorporate dynamically controllable text conditions, enabling on-the-fly adjustment of reward preference towards styles within the scope of the reward model.
🔥News
[2025.9.12]: 🎉 We released the complete training code. We also share tips and experiences to help you train your models. You’re welcome to discuss and ask questions in the issues! 💬✨
[2025.9.12]: 🎉 We provide a standard workflow—feel free to use it in ComfyUI.
[2025.9.8]: 🎉 We released the paper, checkpoint, inference code.
Abstract
Recent studies have demonstrated the effectiveness of directly aligning diffusion models with human preferences using differentiable reward. However, they exhibit two primary challenges: (1) they rely on multistep denoising with gradient computation for reward scoring, which is computationally expensive, thus restricting optimization to only a few diffusion steps; (2) they often need continuous offline adaptation of reward models in order to achieve desired aesthetic quality, such as photorealism or precise lighting effects. To address the limitation of multistep denoising, we propose Direct-Align, a method that predefines a noise prior to effectively recover original images from any time steps via interpolation, leveraging the equation that diffusion states are interpolations between noise and target images, which effectively avoids over-optimization in late timesteps. Furthermore, we introduce Semantic Relative Preference Optimization (SRPO), in which rewards are formulated as text-conditioned signals. This approach enables online adjustment of rewards in response to positive and negative prompt augmentation, thereby reducing the reliance on offline reward fine-tuning. By fine-tuning the model with optimized denoising and online reward adjustment, we improve its human-evaluated realism and aesthetic quality by over 3x.
Acknowledgement
We sincerely appreciate contributions from the research community to this project. Below are quantized versions developed by fellow researchers.
8bit(fp8_e4m3fn/Q8_0) version by wikeeyang: https://huggingface.co/wikeeyang/SRPO-Refine-Quantized-v1.0
bf16 version by rockerBOO: https://huggingface.co/rockerBOO/flux.1-dev-SRPO
GGUF version by befox: https://huggingface.co/befox/SRPO-GGUF



