Comm-efficient GRPO one-pager

Bigger K makes learning stale. Q still compresses.

K=20 collapses because the anchor gradient is older and more off-policy. Q is not the main failure: after its first update, Q error stays low.

Main claim

K hurts by aging the learning signal. At K=20, the anchor gradient comes from older weights and older policy samples. The merger then reuses that stale direction long enough for wrong-sign bias to accumulate.

Q is only an activation compression basis. It can stay good while the policy gets worse.

K=5
recovers

Fresher anchor gradient; validation ends near 0.735.

K=20
collapses

Staler anchor gradient; validation falls to 0.444.

Q error
low

High at bootstrap, then ≈0.04 after first update.

1. K increases staleness

The anchor computes M at old weights, then the live model applies it later. Larger K means more drift.

2. K increases off-policyness

GRPO samples come from the policy. A stale anchor learns from pi(theta_{t-K}), not today's policy.

3. Q tracks activations, not reward

PowerSGD Q tracks activation geometry. It does not know answer quality, reward, KL, or length.

Why anchor gradients hurt

  • M can point in an old direction.
  • The merger copies stale signs.
  • The stale correction is reused, so bias builds.

Why GRPO is sensitive

  • Rewards are sparse and sequence-level.
  • Longer answers can look reward-flat.
  • No KL/entropy brake lets length drift grow.

Why Q is not the collapse

  • Bootstrap Q error is high.
  • First Q update drops it sharply.
  • Late K=20 collapse only mildly raises it.
Observed score and validation comparison showing K=5 recovering while K=20 collapses.
Observed learning. K=5 (pns1le3x) recovers and ends near val 0.735. K=20 (fxo8chsv) drifts down and ends near val 0.444.
Q telemetry showing Q reconstruction error dropping after first update, clip fraction spikes at refresh steps, and K=20 response length instability late in training.
Q sanity check. Q error drops after the first anchor-owned correction and remains low. The policy collapses because stale anchor learning is bad, not because Q stopped compressing activations.

Bottom line: K=20 makes anchor learning stale and off-policy. Q still compresses, but Q cannot tell whether the policy is learning the right thing.