Tuning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling

TL; DR: Long video generation with multi-event dynamics in a tuning-free manner


1 KAIST       2 Adobe Research



arXiv    code

Note

If some visuals are not displaying correctly, please try refreshing the page. Best viewed on a monitor and with a Chrome browser



Abstract

While recent advancements in text-to-video diffusion models enable high-quality short video generation from a single prompt, generating real-world long videos in a single pass remains challenging due to limited data and high computational costs. To address this, several works propose tuning-free approaches, i.e., extending existing models for long video generation, specifically using multiple prompts to allow for dynamic and controlled content changes. However, these methods primarily focus on ensuring smooth transitions between adjacent frames, often leading to content drift and a gradual loss of semantic coherence over longer sequences. To tackle such an issue, we propose Synchronized Coupled Sampling (SynCoS), a novel inference framework that synchronizes denoising paths across the entire video, ensuring long-range consistency across both adjacent and distant frames. Our approach combines two complementary sampling strategies: reverse and optimization-based sampling, which ensure seamless local transitions and enforce global coherence, respectively. However, directly alternating between these samplings misaligns denoising trajectories, disrupting prompt guidance and introducing unintended content changes as they operate independently. To resolve this, SynCoS synchronizes them through a grounded timestep and a fixed baseline noise, ensuring fully coupled sampling with aligned denoising paths. Extensive experiments show that SynCoS significantly improves multi-event long video generation, achieving smoother transitions and superior long-range coherence, outperforming previous approaches both quantitatively and qualitatively.



Key observations

Temporal co-denoising is a promising approach for seamlessly connecting short video clips into a longer sequence. However, existing methods suffer from either divergence or excessive convergence. In contrast, SynCoS strikes a balance between them.


t-SNE visualization of CLIP features

t-SNE visualization of CLIP features for the predicted video frames, x0|t, at each timestep using different samplings. Faded colors indicate earlier timesteps (t ≈ 1000), while vivid colors indicate later, small timesteps (t ≈ 0), illustrating feature trajectory evolution over time (top to bottom)

1. Temporal co-denoising with DDIM (green dots): smooth transition but divergent denoising paths
2. Temporal co-denoising with CSD (red dots): global coherency but overly collapsed into noise-like artifacts
3. Temporal co-denoising with SynCoS (blue dots): balanced capturing both global coherency and local smoothness


  • Temporal co-denosing with DDIM

  • Temporal co-denosing with CSD

  • Temporal co-denosing with SynCoS

Global prompt: A handsome young man sits at a wooden table, enjoying a moment of relaxation with a cup of coffee.
Local prompt 1: He takes a slow sip enjoying his drink.
Local prompt 2: A beautiful young lady sits beside him.



Overview

Our tuning-free inference framework, Synchronized Coupled Sampling (SynCoS)


Description of the image

1. Perform temporal co-denoising with DDIM and applying fusion for local smoothness. At this stage, instead of fully reverting each chunk to its previous timestep, SynCoS proceeds until obtaining the data space samples through DDIM update, then fuse them to ensure smooth transitions between overlapping chunks.
2. Refine the locally fused output to enforce global coherency. By intervening in the DDIM update and conceptualizing them as a refinement task, SynCoS effectively achieves global synchronization across multiple video chunks.
3. Resume the temporal co-denoising with DDIM using locally and globally refined output. SynCoS reverts the refined and fused output to the previous timestep.



Comparisons with baselines

SynCoS outperforms baselines in long-range temporal coherence while avoiding noise-like artifacts and severe overlapping between frames.


On CogVideoX-2B


Global prompt: An astronaut in a white spacesuit with reflective visor is seen in the surreal landscape of Mars.
Local prompt 1: The astronaut walks carefully through a small puddle of liquid on the Martian surface, the red terrain stretching out behind, as ripples form with each step.
Local prompt 2: The astronaut stands still, watching fireworks explode in the alien sky, their vibrant colors contrasting against the red Martian horizon and rocky landscape.


On Open-Sora Plan (v1.3)


Global prompt: A camel in a snowy environment, its presence standing out against the cold and wintry landscape.
Local prompt 1: The camel is running across the snow field.
Local prompt 2: The camel stands still on the snow field.