ECCV 2026  ·  Manifold Reward for Text-to-Video

Your Data Manifold is Secretly a Reward Model Shell-LCC for Text-to-Video Generation

Shihao Zhang1, Yunzhi Li1, Yuguang Yan2, Junzhe Zhang1, Wei Zhao1, Bohan Wang1, Hanwang Zhang1
1Huawei Central Research Institute    2Guangdong University of Technology

A cost-free reward, hidden in your data

We argue that your data manifold is secretly a reward model. By modeling the manifold of high-quality SFT video patches and pulling generated latents onto it, Shell-LCC yields dense, differentiable, nearly free reward signals — no human labels, no external reward model.

Qualitative comparison
From top to bottom: SFT baseline, +LCC, +Shell-LCC (ours), and +DPO. Shell-LCC restores high-frequency details (red), recovers realistic micro-textures instead of the over-smoothed “plastic” look (yellow), and synthesizes complex scenes with intricate local detail (blue) — while better preserving dense structure than DPO and avoiding the mean regression of LCC.

Abstract

Recent text-to-video (T2V) diffusion models rely heavily on auxiliary reward signals (e.g., via reward models or DPO) to align generated content with human aesthetics and improve realism. These signals, however, incur substantial computational overhead, require costly human annotations, and often yield limited improvement in fine-grained local details. In this paper, we argue that your data manifold is secretly a reward model. By explicitly modeling the manifold structure of high-quality Supervised Fine-Tuning (SFT) data and encouraging video latents to lie on this manifold, we derive dense, differentiable, and nearly cost-free reward signals that significantly improve video quality, particularly in mitigating low-level distortions. Our modeling builds upon Local Coordinate Coding (LCC), which captures the ‘skeleton’ of the manifold. However, directly applying LCC suffers from mean regression, pulling latents toward the geometric mean and losing high-frequency details. We therefore extend it to Shell Local Coordinate Coding (Shell-LCC), which models the manifold ‘surface’ as an isotropic shell to align with the true high-density region. Experiments demonstrate that our approach improves realism, enhances high-frequency details, reduces over-smoothing artifacts, and alleviates motion blur.

Why a shell, not a point?

In high dimensions, probability mass concentrates on a thin shell (Gaussian Annulus Theorem), not at the mean. Standard LCC reconstructs toward the local mean — an empty, low-density core that looks blurry. Shell-LCC instead pulls generated latents onto the high-density shell, preserving sharpness.

Radial reconstruction reveals shell geometry
Radial reconstruction. Starting from the LCC reconstruction \(\hat z\) (local mean) and moving outward, decoded videos transition from mean-like blur, to sharp realistic structure on the shell, to distortion when pushed too far — directly revealing the shell-shaped latent manifold.
Manifold distance distribution
Manifold distance \(R_{dist}\). Real (GT) latents concentrate at a stable non-zero radius (0.880±0.076), while generated latents drift outward (0.918±0.083) — making \(R_{dist}\) a discriminative, differentiable reward that penalizes generative distortion.

Video comparisons — Wan-T2V-1.3B

Left: baseline. Right: + Shell-LCC. Same prompt, same seed. The finetuned Wan2.1-T2V-1.3B checkpoint is released on HuggingFace.

“Campfire at night in a snowy forest, with a starry sky in the background.”

Baseline
+ Shell-LCC

“Slow-motion close-up: a galloping horse kicks up dust across an open plain, under dramatic rim lighting. Cinematic, highly detailed, photorealistic.”

Baseline
+ Shell-LCC

“Two students meet in the campus plaza, exchanging a handshake then a warm embrace; golden autumn light, sycamore leaves drifting, cinematic warm tones.”

Baseline
+ Shell-LCC

“A close-up of a cat grooming itself with its tongue, detailed fur and whiskers, warm light.”

Baseline
+ Shell-LCC

Scales to larger T2V models

The same Shell-LCC manifold transfers across model scales without retraining: on a 4.5B T2V model it sharpens fine detail while preserving composition, and on Wan2.1-T2V-14B it lifts overall visual quality and prompt fidelity.

4.5B — “Two students meet in the campus plaza, exchanging a handshake then a warm embrace; golden autumn light, cinematic warm tones.”

Baseline
+ Shell-LCC

4.5B — “Through an open kitchen window a young mother stirs soup; outside, a boy swings happily in the backyard as swallows cross the orange sunset — a split-screen of warm indoor and outdoor scenes.”

Baseline
+ Shell-LCC

Wan2.1-T2V-14B — “On a dim center stage, a magician in a black tailcoat waves his wand; the wand traces an arc of light and a white dove flies out of a silk top hat, feathers glittering in the spotlight.”

Baseline
+ Shell-LCC

Generalizes across open-source models

Shell-LCC sharpens fine structures and suppresses over-smoothing on both Wan-T2V-1.3B and UltraWan-T2V-1.3B.

Open-source qualitative comparison
From top to bottom: Wan-T2V-1.3B, Wan-T2V-1.3B + Shell-LCC, UltraWan-T2V-1.3B, and UltraWan-T2V-1.3B + Shell-LCC, for “A 3D model of an 1800s Victorian house”. Shell-LCC sharpens window lattices and facade details while suppressing baseline over-smoothing.
ModelAesthetic Q.Imaging Q.Overall Consist.Motion Smooth.Subject Consist.
Wan-T2V-1.3B0.62440.66290.22710.98480.9654
  + Shell-LCC0.62530.7037 +4.10.22920.98710.9715
UltraWan-T2V-1.3B0.57440.67560.22880.96570.9225
  + Shell-LCC0.6299 +5.60.7396 +6.40.22360.99180.9658

VBench dimensions; higher is better. Shell-LCC sharply improves Imaging Quality (low-level fidelity) without trading off aesthetics, semantic alignment, or temporal consistency.

Controlled deblurring

Shell-LCC progressively removes motion blur, but over-optimization re-induces mean regression and collapse — so the reward is applied with early stopping.

Training dynamics
Motion deblurring across iterations. Shell-LCC reduces baseline motion blur, but prolonged training (e.g., iter. 4999) regresses toward the mean \(\hat z\), highlighting the trade-off between removing out-of-manifold blur and preserving genuine high-frequency detail.

BibTeX

@inproceedings{zhang2026shelllcc,
  title     = {Your Data Manifold is Secretly a Reward Model:
               Shell-LCC for Text-to-Video Generation},
  author    = {Zhang, Shihao and Li, Yunzhi and Yan, Yuguang and
               Zhang, Junzhe and Zhao, Wei and Wang, Bohan and Zhang, Hanwang},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}