dTRPO: Trajectory Reduction in Policy Optimization of Diffusion Large Language Models

Abstract

Diffusion Large Language Models (dLLMs) introduce a new paradigm for language generation, which in turn presents new challenges for aligning them with human preferences. In this work, we aim to improve the policy optimization for dLLMs by reducing the cost of the trajectory probability calculation, thereby enabling scaled-up offline policy training. We prove that: (i) under reference policy regularization, the probability ratio of the newly unmasked tokens is an unbiased estimate of that of intermediate diffusion states, and (ii) the probability of the full trajectory can be effectively estimated with a single forward pass of a re-masked final state. By integrating these two trajectory reduction strategies into a policy optimization objective, we propose dTRPO (dLLM Trajectory Reduction Policy Optimization). We evaluate dTRPO on 7B dLLMs across instruction-following and reasoning benchmarks. Results show that it substantially improves the core performance of state-of-the-art dLLMs, achieving gains of up to 9.6% on STEM tasks, up to 4.3% on coding tasks, and up to 3.0% on instruction-following tasks. Moreover, dTRPO exhibits strong training efficiency due to its offline, single-forward nature, and achieves improved generation efficiency through high-quality outputs.

Method

State Reduction (Theorem 1)

The trajectory probability can be unbiased estimated with one sampled timestep per block:

$$\log \pi_\theta(\pmb{\tau}) = \sum_{s=1}^{N_B} \mathbb{E}_{t \sim U(1, T_B)} \; T_B \log \pi_\theta(\pmb{\tau}_{s,t} \mid \pmb{\tau}_{s,t-1}, t)$$

This reduces computation from $T$ timesteps to $N_B$ blocks, enabling a single forward pass via block attention.

Ratio Reduction (Theorem 2)

If we only need ratios between the current policy and a reference policy in policy optimization, all schedule-dependent coefficients cancel in the ratio:

$$\frac{\pi_\theta(\pmb{\tau}_{t-1} \mid \pmb{\tau}_t, t)}{\pi_{\text{ref}}(\pmb{\tau}_{t-1} \mid \pmb{\tau}_t, t)} = \prod_{i \in \mathcal{I}_t} \frac{\mu_\theta(\tau_{t-1}^{(i)} \mid \pmb{\tau}_t)}{\mu_{\text{ref}}(\tau_{t-1}^{(i)} \mid \pmb{\tau}_t)}$$

where $\mathcal{I}_t$ is the set of newly unmasked coordinates at step $t$. The ratio depends only on categorical probabilities at newly unmasked positions, independent of the masking schedule.

dTRPO Objective

Combining both reductions with DPO, the final objective is:

$$\mathcal{L}_{\text{dTRPO}}(\theta) = -\mathbb{E}_{(\pmb{y}^+, \pmb{y}^-) \sim \mathcal{D}} \log \sigma\!\left(\lambda T_B \left[S(\pmb{y}^+; \theta, \text{ref}) - S(\pmb{y}^-; \theta, \text{ref})\right]\right)$$

where $S(\pmb{y}; \theta, \text{ref})$ sums log probability ratios over newly unmasked tokens across all blocks. This enables offline, single-forward-pass training comparable to DPO for ARMs.

Model	GPQA(cot, diamond)	GSM8K(cot)	MATH	LCBv6	MBPP+(extra)	HEval+(extra)	IFEval(prompt)	ArenaHard(V2.0)	MTBench
dLLM from Scratch
LLaDA Instruct	19.19	78.47	42.48	6.07	38.1	34.1	53.23	-	-
LLaDA 1.5	19.19	79.45	43.64	6.54	37.0	39.0	59.52	-	-
Qwen 2.5 7B Instruct vs. dLLM from Qwen2.5 7B Instruct
Qwen2.5 Instruct	36.36	87.87	73.06	24.42	67.5	74.4	71.38	10.43	8.08
Dream Instruct	28.79	75.36	50.22	12.61	54.5	53.0	50.65	6.79	3.88
Fast-dLLM-v2	20.71	82.34	60.26	11.56	51.6	59.1	62.11	1.26	3.17
Fast-dLLM-v2+ELBO	12.63	79.98	58.48	11.56	52.4	59.1	51.02	0.17	1.01
Fast-dLLM-v2+VRPO	24.24	83.17	63.32	12.89	50.5	57.3	65.06	7.32	6.37
Fast-dLLM-v2+DPO w/ MF	23.74	85.37	63.20	11.00	46.3	51.8	65.62	6.02	6.48
Fast-dLLM-v2+dTRPO (Ours)	30.30	85.97	64.30	15.17	51.6	63.4	65.06	7.41	6.53

Qualitative Comparison

Side-by-side comparison of generation outputs on Arena-Hard V2.0 from Qwen 2.5, Fast-dLLM-v2, and our dTRPO model.

Prompt (Arena-Hard V2.0): Write 3 verses of iambic pentameter about a cat drinking milk at bedtime.

Qwen 2.5 7B Instruct

In twilight's hush, where shadows softly play, A cat with eyes aglow, in moonlight's ray, Approaches slow the bowl where milk does wait, Its paws upon the cool, white surface skate. With gentle purrs and satisfied sighs, It laps the creamy liquid, rich and high, Each sip a moment of pure, blissful ease, As night descends, and all its cares release. The world outside grows quiet, calm, and still, While in the cozy nook, the cat doth dwell, Content in dreams of fish and sunny spots, Till dawn breaks forth, and peace is overtaken.

Fast-dLLM-v2

1. In the hush of twilight's embrace, a scene so sweet, A cat, sleek, in white, with fur so soft and neat, Approers the bowl, with milk so, in the lam's light, A purring symphony, in the quiet of night. 2. The cat laps gently, in a rhythm so neat, From the ceramic bowl, where the shadows do meet, Each sip a pleasure, in the cool of the night, A moment of contentment, in the cat's delight. 3. As the clock ticks, and the world grows still, The cat, now full, purrs still, , In the dance of dreams, the cat's song, A tale of milk, and love, and a bedtime long.

Fast-dLLM-v2 + dTRPO (Ours)

In iambic meter, a rhythmic flow, A cat at twilight, with a purring glow, Milk in a bowl, set out with care, Whiskers twitch, as she sips with flair. Two paws on the floor, a contented pose, As the liquid flows, her fur sheathes, Each sip and swallow, a moment of delight, In this quiet scene, there's no need for fright. But beware the last sip, for it may bring, A bout of mischief, or a gentle spring, Yet in this dance, both grand and small, The cat's revel, in her nightly thrall.

Prompt (Arena-Hard V2.0): Write a 30 second motivational speech about keeping up the effort. Use a short fictional story.

Qwen 2.5 7B Instruct

Imagine a young sapling in a vast, barren field. Every day, it struggles to grow, facing the harsh winds and scorching sun. One day, a wise old tree passes by and sees the sapling's determination. "Why do you keep trying?" asks the old tree. The sapling replies, "I must reach the sky, no matter how hard it is." And so, it continues to push through, inch by inch. Just like that sapling, each one of us has our own journey. We face challenges, but with every step forward, we grow stronger. Keep pushing, keep striving, because the view from the top is worth every struggle. You are capable of more than you think. Thank you.

Fast-dLLM-v2

Imagine a forest, a young tree named Alex, standing tall and strong. But one day, a storm hits, and the wind threatens to uproot Alex. The tree whispers, "I can't hold on." But deep down, it knows it must. Through the rain, the tree digs its roots deeper, reaching for the sky. With every gust, it grows stronger, its branches reaching out, embracing the challenge. Finally,, the storm passes, and the tree stands tall, a symbol of resilience and perseverance. So remember, young tree, and every one, the effort you make, no matter how small, it matters. Keep striving, and you'll reach new heights. Keep up the effort, and you'll conquer your dreams.

Fast-dLLM-v2 + dTRPO (Ours)

Imagine a runner in a marathon, one lap left. They're exhausted, but they haven't given up. They keep pushing, every step bringing them closer to the finish line. You are that runner. Every effort you make, no matter how small, brings you closer to your goal. Keep going, and you'll cross that finish line, proud of your perseverance. Keep up the effort, and watch your dreams come true.

Prompt (Arena-Hard V2.0): A ball is dropped from a height of 100 meters. It bounces back to 50% of its previous height after each bounce. How far does the ball travel before it comes to rest?

Qwen 2.5 7B Instruct

Initial Drop: 100 meters (down). First Bounce: up 50m + down 50m = 100m. Second Bounce: up 25m + down 25m = 50m. ... The bounce distances form a geometric series: S = 2 × 50 / (1 - 0.5) = 200 meters Total distance: D = 100 + 200 = 300 meters

Fast-dLLM-v2

Initial Drop: 100 meters. First Bounce: 50 + 50 = 100 meters. Second Bounce: 25 + 25 = 50 meters. Third Bounce: 12.5 + 12.5 = 25 meters. Fourth Bounce: 6.25 + 6.25 = 12.5 meters. Geometric series with a = 100, r = 0.5: S = 100 / (1 - 0.5) = 200 meters Total distance: 200 meters (forgot to add initial drop separately)

Fast-dLLM-v2 + dTRPO (Ours)

Initial Drop: 100 meters. First Bounce: 50 + 50 = 100 meters. Second Bounce: 25 + 25 = 50 meters. Bounce distances form geometric series: a = 100, r = 0.5 S = 100 / (1 - 0.5) = 200 meters Total distance: = Initial drop + bounce series = 100 + 200 = 300 meters

ARM vs dLLM Generation