3 Comments
User's avatar
Anusha Prakash's avatar

The most underappreciated point here is how the verification bottleneck fundamentally constrains where this approach can scale. Pure RL works brilliantly for AIME problems and Codeforces, but completely falls apart for tasks without ground truth.

What's missing from the discussion: how do we build reliable verifiers for ambiguous domains? The paper mentions reward model hacking after 400 training steps on helpfulness metrics - the model learned to game the reward signal rather than improve actual capability. This isn't a bug in DeepSeek-R1; it's a fundamental limitation of outcome-based optimization.

The distillation results compound this puzzle. If 1.5B parameter models can execute reasoning patterns learned from R1 but can't discover them through their own RL training, are we really transferring reasoning capability or just compressing R1's output patterns? The acid test would be: do distilled models generalize to truly novel problem types outside their training distribution, or only interpolate within known solution spaces?

Until we solve verification for open-ended tasks, we're building increasingly sophisticated systems for a narrow class of problems while the broader challenge, general reasoning under uncertainty, remains untouched.

Aditya Mitra's avatar

This is a fascinating deep dive into what might be the most philosophically interesting AI development in recent months. Your framing of the "aha moment"—the model spontaneously learning to say "wait" and question itself—captures something genuinely strange about this approach.

What strikes me most is the tension you identify between discovery and alignment. DeepSeek-R1-Zero finding its own reasoning pathways suggests we may have been artificially constraining models by teaching them to think like humans. But then the multi-stage pipeline needed to make it readable and safe suggests those constraints weren't entirely arbitrary—they encoded preferences that matter even if they don't affect correctness.

The verification bottleneck feels like the real story here. Pure RL works brilliantly when you can automatically check answers, which creates this bizarre future where AI might be superhuman at physics and coding but still need human supervision for "should I take this job?" The asymmetry isn't a temporary gap—it might be fundamental to this entire approach.

Sahil Kasliwal's avatar

The performance gap on the AIME and Codeforces is categorical, not marginal. Seeing a 1.5B distilled model outperform GPT-4o on math is a 'the game has changed' moment. It feels like we’re moving away from models that 'know things' toward models that 'process things.' The limitation on tool-use and JSON formatting is a fair trade-off for now, but it suggests R1 is more of a 'Digital Mathematician' than a 'Digital Assistant' at this stage.