Comm-efficient GRPO one-pager

Bigger K makes learning stale. Q still compresses.

K=20 collapses because the anchor gradient is older and more off-policy. Q is not the main failure: after its first update, Q error stays low.

Main claim

K hurts by aging the learning signal. At K=20, the anchor gradient comes from older weights and older policy samples. The merger then reuses that stale direction long enough for wrong-sign bias to accumulate.

Q is only an activation compression basis. It can stay good while the policy gets worse.

K=5

recovers

Fresher anchor gradient; validation ends near 0.735.

K=20

collapses

Staler anchor gradient; validation falls to 0.444.

Q error

low

High at bootstrap, then ≈0.04 after first update.

1. K increases staleness

The anchor computes M at old weights, then the live model applies it later. Larger K means more drift.

2. K increases off-policyness

GRPO samples come from the policy. A stale anchor learns from pi(theta_{t-K}), not today's policy.

3. Q tracks activations, not reward

PowerSGD Q tracks activation geometry. It does not know answer quality, reward, KL, or length.

Why anchor gradients hurt

M can point in an old direction.
The merger copies stale signs.
The stale correction is reused, so bias builds.

Why GRPO is sensitive

Rewards are sparse and sequence-level.
Longer answers can look reward-flat.
No KL/entropy brake lets length drift grow.

Why Q is not the collapse

Bootstrap Q error is high.
First Q update drops it sharply.
Late K=20 collapse only mildly raises it.

Observed score and validation comparison showing K=5 recovering while K=20 collapses. — **Observed learning.** K=5 (`pns1le3x`) recovers and ends near val 0.735. K=20 (`fxo8chsv`) drifts down and ends near val 0.444.

Q telemetry showing Q reconstruction error dropping after first update, clip fraction spikes at refresh steps, and K=20 response length instability late in training. — **Q sanity check.** Q error drops after the first anchor-owned correction and remains low. The policy collapses because stale anchor learning is bad, not because Q stopped compressing activations.

Bottom line: K=20 makes anchor learning stale and off-policy. Q still compresses, but Q cannot tell whether the policy is learning the right thing.