Shell-LCC: Your Data Manifold is Secretly a Reward Model

A cost-free reward, hidden in your data

We argue that your data manifold is secretly a reward model. By modeling the manifold of high-quality SFT video patches and pulling generated latents onto it, Shell-LCC yields dense, differentiable, nearly free reward signals — no human labels, no external reward model.

human labels / external reward models — the reward comes from the SFT data itself

2.5M

manifold parameters — 0.2% of Wan‑T2V‑1.3B, near-zero training overhead

+4.1

VBench Imaging Quality on Wan‑T2V‑1.3B (66.3 → 70.4), aesthetics & semantics intact

100 × 5-s clips

enough video to train the manifold — 90.1% distinguishability, vs 91.4% with the full 2,500-clip set

Qualitative comparison — From top to bottom: **SFT baseline**, **+LCC**, **+Shell-LCC (ours)**, and **+DPO**. Shell-LCC restores high-frequency details (red), recovers realistic micro-textures instead of the over-smoothed “plastic” look (yellow), and synthesizes complex scenes with intricate local detail (blue) — while better preserving dense structure than DPO and avoiding the mean regression of LCC.

Abstract

Recent text-to-video (T2V) diffusion models rely heavily on auxiliary reward signals (e.g., via reward models or DPO) to align generated content with human aesthetics and improve realism. These signals, however, incur substantial computational overhead, require costly human annotations, and often yield limited improvement in fine-grained local details. In this paper, we argue that your data manifold is secretly a reward model. By explicitly modeling the manifold structure of high-quality Supervised Fine-Tuning (SFT) data and encouraging video latents to lie on this manifold, we derive dense, differentiable, and nearly cost-free reward signals that significantly improve video quality, particularly in mitigating low-level distortions. Our modeling builds upon Local Coordinate Coding (LCC), which captures the ‘skeleton’ of the manifold. However, directly applying LCC suffers from mean regression, pulling latents toward the geometric mean and losing high-frequency details. We therefore extend it to Shell Local Coordinate Coding (Shell-LCC), which models the manifold ‘surface’ as an isotropic shell to align with the true high-density region. Experiments demonstrate that our approach improves realism, enhances high-frequency details, reduces over-smoothing artifacts, and alleviates motion blur.

🎯 Manifold as reward

The intrinsic manifold of SFT data serves as a cost-free reward model: built at the spatio-temporal patch level, it yields dense, differentiable rewards without the annotation and compute costs of RLHF.

🛡️ Shell-LCC

Standard LCC suffers a provable mean-regression bias. Shell-LCC models the data surface as an isotropic shell (point-to-surface alignment), preserving high-frequency structural detail.

📈 Works across models

Improves realism and fine-grained imaging quality on a proprietary 4.5B model, Wan‑T2V‑1.3B and UltraWan — without sacrificing semantics or temporal consistency, and orthogonal to DPO.

Method: from data manifold to dense reward

Extract dense patches. Encode SFT videos with the frozen 3D VAE and flatten the latent volume into spatio-temporal patches — a single 5-second clip already yields 230,400 patches.
Learn the skeleton (LCC). Approximate the patch manifold with M=4096 learnable anchors and an amortized coordinate predictor: each patch is a sparse local linear combination of nearby anchors.
Learn the surface (Shell-LCC). Calibrate dimensions with a learnable scale σ (diagonal Mahalanobis) and constrain the normalized residual to a unit shell, with a log-σ regularizer — the negative log-likelihood of a shell-structured Gaussian.
Freeze & reward. The manifold distance \(R_{dist}\) of generated patches is a dense, differentiable reward: fine-tuning the T2V model to keep latents on the shell removes blur, noise and motion artifacts while keeping legitimate high-frequency detail.

Why a shell, not a point?

Two provable failure modes of naive manifold rewards motivate the shell.

Mean-regression theorem

For convex local reconstruction, the LCC objective decomposes into a reconstruction pull plus a local-variance pull toward the anchor centroid — so the optimum systematically shrinks toward the local mean. Enforcing plain LCC reconstruction therefore over-smooths: blurred textures, lost detail.

The empty core (Gaussian Annulus)

In high dimensions, probability mass concentrates on a thin shell of radius \(\approx\sqrt{d}\); the region near the mean holds exponentially small mass. The LCC reconstruction sits exactly in this hollow, low-density core — the worst possible reward target.

Radial reconstruction reveals shell geometry — **Radial reconstruction.** Starting from the LCC reconstruction \(\hat z\) (local mean) and moving outward, decoded videos transition from mean-like blur, to sharp realistic structure on the shell, to distortion when pushed too far — directly revealing the shell-shaped latent manifold.

Manifold distance distribution — **Manifold distance \(R_{dist}\).** Real (GT) latents concentrate at a stable non-zero radius (0.880±0.076), while generated latents drift outward (0.918±0.083) — making \(R_{dist}\) a discriminative, differentiable reward that penalizes generative distortion.

Manifold distinguishability: % of pairs with \(R_{dist}(z_{gt}) < R_{dist}(z_{gen})\)
Manifold	Acc (%)
LCC (ep. 200)	92.7
Reconstruction only	83.2
Shell-LCC	91.4

Data efficiency: videos used to train the manifold
#Videos	100	1,000	2,500 (full)
Acc (%)	90.1	91.4	91.4

Dropping the locality constraint collapses distinguishability (−9.5%): pure reconstruction degenerates into an identity map with no geometry. Shell-LCC trades a sliver of trivial accuracy (won by exploiting high-variance dimensions) for equal sensitivity to fine-grained, low-variance detail. And because every 5-second clip contributes 230k patches, 100 videos already reach 90.1% — the manifold is extremely data-efficient, and robust to hyperparameters (91.1–91.8% across M, τ₁, τ₂ settings).

Video comparisons — Wan-T2V-1.3B

Left: baseline. Right: + Shell-LCC. Same prompt, same seed. The finetuned Wan2.1-T2V-1.3B checkpoint is released on HuggingFace.

“Campfire at night in a snowy forest, with a starry sky in the background.”

Baseline

+ Shell-LCC

“Slow-motion close-up: a galloping horse kicks up dust across an open plain, under dramatic rim lighting. Cinematic, highly detailed, photorealistic.”

Baseline

+ Shell-LCC

“Two students meet in the campus plaza, exchanging a handshake then a warm embrace; golden autumn light, sycamore leaves drifting, cinematic warm tones.”

Baseline

+ Shell-LCC

“A close-up of a cat grooming itself with its tongue, detailed fur and whiskers, warm light.”

Baseline

+ Shell-LCC

Scales to larger T2V models

The same Shell-LCC manifold transfers across model scales without retraining: on a 4.5B T2V model it sharpens fine detail while preserving composition, and on Wan2.1-T2V-14B it lifts overall visual quality and prompt fidelity.

4.5B — “Two students meet in the campus plaza, exchanging a handshake then a warm embrace; golden autumn light, cinematic warm tones.”

Baseline

+ Shell-LCC

4.5B — “Through an open kitchen window a young mother stirs soup; outside, a boy swings happily in the backyard as swallows cross the orange sunset — a split-screen of warm indoor and outdoor scenes.”

Baseline

+ Shell-LCC

Wan2.1-T2V-14B — “On a dim center stage, a magician in a black tailcoat waves his wand; the wand traces an arc of light and a white dove flies out of a silk top hat, feathers glittering in the spotlight.”

Baseline

+ Shell-LCC

Wan2.1-T2V-14B — “A baker pulls a tray of golden bread from a stone oven, lit by dramatic rim lighting. Shallow depth of field macro. Ultra-detailed, realistic motion blur.”

Baseline

+ Shell-LCC

Results: better imaging quality, nothing traded away

On a proprietary 4.5B model and two open-source backbones, Shell-LCC lifts VBench Imaging Quality while aesthetics, semantics and temporal consistency hold steady — the failure mode of preference-based baselines.

Open-source qualitative comparison — From top to bottom: Wan-T2V-1.3B, Wan-T2V-1.3B + Shell-LCC, UltraWan-T2V-1.3B, and UltraWan-T2V-1.3B + Shell-LCC, for “A 3D model of an 1800s Victorian house”. Shell-LCC sharpens window lattices and facade details while suppressing baseline over-smoothing.

Model	Aesthetic Q.	Imaging Q.	Overall Consist.	Motion Smooth.	Subject Consist.
Our 4.5B SFT baseline	67.24	75.09	26.37	98.91	96.80
+ DPO	67.84	73.97 −1.1	26.77	98.40	96.37
+ Shell-LCC (ours)	67.35	76.31 +1.2	26.70	99.00	96.82
Wan-T2V-1.3B	62.44	66.29	22.71	98.48	96.54
+ Shell-LCC (ours)	62.53	70.37 +4.1	22.92	98.71	97.15
UltraWan-T2V-1.3B	57.44	67.56	22.88	96.57	92.25
+ Shell-LCC (ours)	62.99 +5.6	73.96 +6.4	22.36	99.18	96.58

VBench scores in % (higher is better); green/red deltas are percentage points vs. the corresponding baseline. Shell-LCC sharply improves Imaging Quality (low-level fidelity) without trading off aesthetics, semantic alignment, or temporal consistency.

Shell-LCC vs. DPO — complementary, not competing. They operate at different granularities: DPO uses video-level preferences for global alignment and gains aesthetics at the cost of imaging quality (+0.60 / −1.12); Shell-LCC uses dense patch-level geometry and targets exactly those low-level distortions (+1.22 on our 4.5B model, +4.08 on Wan-T2V-1.3B) without touching content. In an independent human A/B test, Shell-LCC scores 108 vs 100 on realism with semantic alignment unchanged (101 vs 100).

Shell-LCC vs DPO, zoomed comparison — **Shell-LCC vs. DPO on the same prompt** (crops from the qualitative comparison at the top of the page). DPO restyles the scene but keeps the over-smoothed “plastic” skin texture (yellow boxes); Shell-LCC recovers realistic micro-texture and crisp background detail (red / blue boxes) while preserving the layout.

Controlled deblurring

Shell-LCC progressively removes motion blur, but over-optimization re-induces mean regression and collapse — so the reward is applied with early stopping.

Training dynamics — Motion deblurring across iterations. Shell-LCC reduces baseline motion blur, but prolonged training (e.g., iter. 4999) regresses toward the mean \(\hat z\), highlighting the trade-off between removing out-of-manifold blur and preserving genuine high-frequency detail.

Low-level distortions such as motion blur are entangled with genuine high-frequency information: over-aggressive optimization discards both and re-induces mean regression. We currently mitigate this with early stopping; combining the patch-level reward with preference-based methods (DPO) and scaling to larger backbones are natural next steps.

Over-optimization, live on Wan2.1-T2V-14B — “A close-up of a cat grooming itself with its tongue, detailed fur and whiskers, warm light.” As reward optimization proceeds (step 10 → 40), fur and whisker detail on the subject keeps increasing — while reward hacking gradually stamps high-frequency noise onto the flat background (watch the clean orange wall turn grainy). The reward cannot tell legitimate detail from synthetic texture, which is exactly why early stopping matters.

Baseline

Step 10

Step 20

Step 40 — background noise

Reward step	Sharpness (lap, ×baseline)	High-freq energy (hf, ×baseline)	Content change
10	2.38	1.76	0.16
20	3.97	2.40	0.17
40	6.06	2.98	0.18

Medians over 8 evaluation prompts. lap = Laplacian-variance ratio vs. the baseline (>1 = sharper edges/textures); hf = fraction of FFT energy above ¼ Nyquist, vs. baseline (>1 = more fine texture); change = mean pixel difference to the same-seed baseline video (how much the content moved). The numbers confirm a real, growing detail gain on 14B — but note they keep rising through step 40: high-frequency metrics cannot tell legitimate detail from reward-hacked texture (the background noise above scores as “detail” too). Numbers select the candidate; frames make the call.

Your Data Manifold is Secretly a Reward Model Shell-LCC for Text-to-Video Generation

A cost-free reward, hidden in your data

Abstract

🎯 Manifold as reward

🛡️ Shell-LCC

📈 Works across models

Method: from data manifold to dense reward

Why a shell, not a point?

Mean-regression theorem

The empty core (Gaussian Annulus)

Video comparisons — Wan-T2V-1.3B

Scales to larger T2V models

Results: better imaging quality, nothing traded away

Controlled deblurring

BibTeX