[ THIS BEING VERIFIED BY ME ]

Train-Inference Mismatch under a PowerSGD + signed-EMA compression codec (concise view)

Qwen2.5-1.5B-Instruct · GSM8K · GRPO (no-KL, no-entropy) · captured at training step 20, before the PPO mini-batch updates.

Topic

Although training is proceeding normally, the train-inference mismatch from the compression codec is very high: well outside the importance-ratio band that the open-source library (verl) treats as correctable (about 2x to 5x), and far above dense / literature runs (KL about 0.0004 to 0.01, versus about 4.1 here). Median per-token gap 2.83 nats, log-prob correlation only 0.13.

A normal 5x clip would drop about 61% of tokens from the loss, so plain truncated importance sampling (TIS) cannot absorb this mismatch without removing most of the trajectory. This compression-induced mismatch is its own problem, separate from anchor staleness.

Setup

Compressed run with the signed_ema merger. The numbers below compare vLLM's generation log-prob for each token against the FSDP trainer's recomputed log-prob with compression, at step 20.

Distributions

Main plot. For each token, r = p_train / p_vLLM: how many times more or less likely the trainer thinks the token is, versus the engine that generated it. r = 1 (black center line) is a perfect match and is best; closer to the center is better. Mass left of 1/5x means the trainer rates generated tokens as much less likely than vLLM. Shaded bands are verl's correctable range (about 2x, up to 5x); most mass sits well outside them.

Per-token ratio r = p_train / p_vLLM (log x-axis). Black line r = 1 is a perfect match. Dotted orange is verl's default 2x band, dashed red the 5x band (top of verl's typical range). Most tokens sit far left, below 1/5x.

The same thing in plain probability, |p_train - p_vLLM|. 0 is best (the engines agree); further right is worse. Most tokens are far from 0.

Per-token difference in the probability assigned to the sampled token.

vLLM vs train-engine log-probs

Each dot is one token: vLLM log-prob (x) vs trainer log-prob with compression (y). On the dashed y = x line is best (agreement); higher correlation is better. Most dots sit below the line (the trainer assigns lower probability); the correlation is only 0.13.

Each point is one response token: vLLM log-prob (x) vs train-engine-with-compression log-prob (y).

Per-prompt samples (prompt, vLLM output, per-token log-probs)

Right-most column is r = p_train/p_vLLM (1.0 is a perfect match). Shaded cells fall outside verl's [1/5, 5] band. Rows shaded red have |diff| > 0.5 nats.

row 833 · uid e975aa99 · 126 tokens · mean|diff|=4.478 · reward=1.00
Prompt
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Peter has a plan to take his family on a vacation to Germany. He needs $5,000 to cover all the spending for the travel, and he has $2,900 in savings right now. If he can save up $700 each month, how many months does he need to wait to reach his goal? Let's think step by step and output the final answer after "####".<|im_end|>
<|im_start|>assistant
vLLM response
To determine how many months Peter needs to save up to reach his goal, we can follow these steps:

1. Find the additional amount Peter needs to save: $5,000 (goal) - $2,900 (savings) = $2,100
2. Determine how much Peter saves each month: $700
3. Calculate the number of months needed by dividing the additional amount by the monthly savings: $2,100 ÷ $700 = 3

#### 3
Peter needs to wait 3 months to reach his goal.
postokenvLLM logpFSDP logpdiffp_vLLMp_trainr = p_train/p_vLLM
0'To'-0.5539-1.8649-1.31100.57470.15490.270
1' determine'-1.8226-3.5531-1.73050.16160.02860.177
2' how'-0.0041-3.0491-3.04490.99590.04740.048
3' many'-0.0068-4.2043-4.19750.99320.01490.015
4' months'-0.0000-4.7287-4.72861.00000.00888.8e-03
5' Peter'-0.0069-9.3030-9.29600.99310.00019.2e-05
6' needs'-0.0001-1.9611-1.96100.99990.14070.141
7' to'-0.0020-1.1309-1.12890.99800.32280.323
8' save'-0.4748-3.3493-2.87450.62200.03510.056
9' up'-1.7495-11.0092-9.25970.17390.00009.5e-05
10' to'-0.6844-4.7787-4.09440.50440.00840.017
11' reach'-0.3180-14.4767-14.15870.72760.00007.1e-07
12' his'-0.0144-7.8737-7.85930.98570.00043.9e-04
13' goal'-0.0345-8.3045-8.27000.96610.00022.6e-04
14','-0.1299-1.8832-1.75330.87820.15210.173
15' we'-0.0603-2.9032-2.84290.94150.05480.058
16' can'-1.1635-9.7227-8.55920.31240.00011.9e-04
17' follow'-0.2063-9.0647-8.85840.81360.00011.4e-04
18' these'-0.0001-7.9082-7.90810.99990.00043.7e-04
19' steps'-0.0003-2.7164-2.71610.99970.06610.066
20':\n\n'-0.0038-0.6737-0.66990.99620.50980.512
21'1'-0.0008-0.8880-0.88720.99920.41150.412
22'.'-0.0000-0.0590-0.05891.00000.94270.943
23' Find'-3.1160-6.4585-3.34250.04430.00160.035
24' the'-0.7663-0.01380.75250.46470.98622.12
25' additional'-2.3393-7.5139-5.17460.09640.00055.7e-03
26' amount'-0.0103-0.6491-0.63880.98980.52250.528
27' Peter'-0.3436-10.9960-10.65230.70920.00002.4e-05
28' needs'-0.0067-2.4754-2.46870.99330.08410.085
29' to'-1.2565-2.1519-0.89540.28460.11630.408
30' save'-0.0110-6.1968-6.18580.98910.00202.1e-03
31':'-0.3907-3.1723-2.78160.67660.04190.062
32' $'-0.1500-2.5813-2.43130.86070.07570.088
33'5'-0.0000-2.7727-2.77271.00000.06250.062
34','-0.0086-10.1255-10.11690.99140.00004.0e-05
35'0'-0.0006-0.0028-0.00230.99940.99720.998
36'0'0.0000-0.0009-0.00091.00000.99910.999
37'0'-0.0000-4.2330-4.23301.00000.01450.015
38' ('-0.5897-2.5536-1.96390.55450.07780.140
39'goal'-0.5303-19.9256-19.39540.58850.00003.8e-09
40')'-0.0020-13.3964-13.39440.99800.00001.5e-06
41' -'-0.0002-4.0284-4.02820.99980.01780.018
42' $'-0.0000-2.3837-2.38371.00000.09220.092
43'2'-0.0000-0.6027-0.60271.00000.54730.547
44','-0.0000-10.1528-10.15281.00000.00003.9e-05
45'9'-0.0000-8.8833-8.88331.00000.00011.4e-04
46'0'0.0000-0.0002-0.00021.00000.99981.000
47'0'0.0000-0.0145-0.01451.00000.98560.986
48' ('-0.0001-4.2891-4.28900.99990.01370.014
49's'-0.8368-13.5584-12.72170.43310.00003.0e-06
... 76 more tokens (see samples.json)
row 173 · uid e975aa99 · 161 tokens · mean|diff|=4.177 · reward=1.00
Prompt
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Peter has a plan to take his family on a vacation to Germany. He needs $5,000 to cover all the spending for the travel, and he has $2,900 in savings right now. If he can save up $700 each month, how many months does he need to wait to reach his goal? Let's think step by step and output the final answer after "####".<|im_end|>
<|im_start|>assistant
vLLM response
We start by finding out how much more Peter needs to save. He has $2,900, and he needs $5,000 in total. 

Required saving = Total budget - Current savings
Required saving = $5,000 - $2,900
Required saving = $2,100

Next, we calculate how many months it will take him to save the remaining amount. He saves $700 each month.

Number of months = Required saving / Monthly savings
Number of months = $2,100 / $700
Number of months = 3 [Rounding up since even a month left will result in more saving]

#### 3
Peter needs to wait 3 months to reach his goal.
postokenvLLM logpFSDP logpdiffp_vLLMp_trainr = p_train/p_vLLM
0'We'-1.9054-3.8649-1.95940.14880.02100.141
1' start'-0.6447-2.8341-2.18940.52480.05880.112
2' by'-1.1387-9.0719-7.93320.32020.00013.6e-04
3' finding'-1.6309-11.8348-10.20380.19570.00003.7e-05
4' out'-0.5248-9.9319-9.40710.59170.00008.2e-05
5' how'-0.1430-2.8131-2.67010.86680.06000.069
6' much'-0.0004-7.0900-7.08960.99960.00088.3e-04
7' more'-0.0158-7.3799-7.36410.98430.00066.3e-04
8' Peter'-1.8277-9.3996-7.57200.16080.00015.1e-04
9' needs'-0.0007-2.2624-2.26170.99930.10410.104
10' to'-0.2704-1.2278-0.95740.76310.29290.384
11' save'-0.0291-3.4061-3.37710.97140.03320.034
12'.'-2.0507-1.15790.89280.12860.31412.44
13' He'-0.6726-4.8858-4.21310.51040.00760.015
14' has'-3.4860-1.17072.31530.03060.310110.13
15' $'-0.0253-2.4965-2.47120.97500.08240.084
16'2'-0.0004-0.3722-0.37190.99960.68920.689
17','-0.0017-9.2614-9.25970.99830.00019.5e-05
18'9'-0.0009-9.7537-9.75270.99910.00015.8e-05
19'0'0.0000-0.0001-0.00011.00000.99991.000
20'0'-0.0000-0.0161-0.01611.00000.98410.984
21','-4.4389-2.12332.31550.01180.119610.13
22' and'-0.0819-2.9686-2.88670.92140.05140.056
23' he'-1.4652-2.6859-1.22060.23100.06820.295
24' needs'-0.0484-2.5647-2.51630.95270.07690.081
25' $'-0.0386-2.2262-2.18760.96210.10790.112
26'5'-0.0000-2.7808-2.78081.00000.06200.062
27','-0.0000-9.8765-9.87651.00000.00015.1e-05
28'0'-0.0001-0.0138-0.01360.99990.98630.986
29'0'0.0000-0.0081-0.00811.00000.99200.992
30'0'-0.0000-5.4453-5.44531.00000.00434.3e-03
31' in'-1.7843-3.6392-1.85490.16790.02630.156
32' total'-0.0033-5.2710-5.26770.99670.00515.2e-03
33'.'-0.5031-2.0845-1.58140.60470.12440.206
34' \n\n'-1.4786-3.3345-1.85590.22800.03560.156
35'Required'-7.4072-10.3198-2.91270.00060.00000.054
36' saving'-3.6222-9.1622-5.54000.02670.00013.9e-03
37' ='-0.2930-1.6706-1.37760.74600.18810.252
38' Total'-0.2876-5.7645-5.47690.75010.00314.2e-03
39' budget'-4.8631-5.5482-0.68510.00770.00390.504
40' -'-0.0245-5.3484-5.32390.97580.00484.9e-03
41' Current'-0.3755-9.2463-8.87080.68690.00011.4e-04
42' savings'-0.0006-3.3497-3.34910.99940.03510.035
43'\n'-0.1546-3.1557-3.00120.85680.04260.050
44'Required'-0.0267-8.6083-8.58160.97370.00021.9e-04
45' saving'-0.0001-9.4081-9.40800.99990.00018.2e-05
46' ='-0.0000-0.2619-0.26191.00000.76960.770
47' $'-0.0142-2.2616-2.24740.98590.10420.106
48'5'-0.0000-5.6656-5.66561.00000.00353.5e-03
49','-0.0052-11.0013-10.99610.99480.00001.7e-05
... 111 more tokens (see samples.json)
row 592 · uid 9da91639 · 174 tokens · mean|diff|=4.239 · reward=1.00
Prompt
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Sean and Sierra invited 200 guests to their wedding.  If 83% of the guests RSVP with a Yes response and 9% of the guests RSVP with a No response, how many guests did not respond at all? Let's think step by step and output the final answer after "####".<|im_end|>
<|im_start|>assistant
vLLM response
Step 1: Calculate the number of guests who RSVP'd with a Yes response.
83% of 200 guests = 0.83 * 200 = 166 guests

Step 2: Calculate the number of guests who RSVP'd with a No response.
9% of 200 guests = 0.09 * 200 = 18 guests

Step 3: Calculate the total number of guests who responded.
166 guests (yes) + 18 guests (no) = 184 guests responded

Step 4: Calculate the number of guests who did not respond at all.
200 guests invited - 184 guests responded = 16 guests did not respond

#### 16

The final answer is: #### 16
postokenvLLM logpFSDP logpdiffp_vLLMp_trainr = p_train/p_vLLM
0'Step'-2.3255-1.38120.94430.09770.25132.57
1' '-0.0000-0.0189-0.01891.00000.98120.981
2'1'0.0000-0.0549-0.05491.00000.94660.947
3':'-0.0000-3.7765-3.77651.00000.02290.023
4' Calculate'-0.0152-2.2131-2.19790.98490.10940.111
5' the'-0.0005-0.0116-0.01110.99950.98840.989
6' number'-0.0054-1.2475-1.24210.99460.28720.289
7' of'-0.0000-0.0001-0.00011.00000.99991.000
8' guests'-0.0116-7.7084-7.69680.98840.00044.5e-04
9' who'-0.0057-8.7395-8.73390.99430.00021.6e-04
10' RSVP'-0.1947-13.0234-12.82870.82310.00002.7e-06
11"'d"-2.1733-15.6278-13.45450.11380.00001.4e-06
12' with'-0.6462-7.1503-6.50410.52410.00081.5e-03
13' a'-0.1210-4.4255-4.30450.88600.01200.014
14' Yes'-0.0214-13.5368-13.51550.97890.00001.3e-06
15' response'-0.0222-14.2875-14.26540.97810.00006.4e-07
16'.\n'-0.2189-1.8572-1.63830.80340.15610.194
17'8'-0.4112-5.6874-5.27620.66290.00345.1e-03
18'3'-0.0001-8.8874-8.88730.99990.00011.4e-04
19'%'-0.0001-4.8037-4.80360.99990.00828.2e-03
20' of'-0.0002-0.3171-0.31700.99980.72820.728
21' '-0.0005-4.1491-4.14860.99950.01580.016
22'2'-0.0000-0.3502-0.35021.00000.70450.705
23'0'-0.0000-0.0011-0.00111.00000.99890.999
24'0'0.0000-0.0007-0.00071.00000.99930.999
25' guests'-0.1067-9.3878-9.28110.89880.00019.3e-05
26' ='-0.1923-1.5660-1.37380.82510.20890.253
27' '-0.0017-0.0022-0.00050.99830.99781.000
28'0'-0.0266-7.6108-7.58420.97370.00055.1e-04
29'.'0.0000-8.4065-8.40651.00000.00022.2e-04
30'8'-0.0000-6.4076-6.40751.00000.00161.6e-03
31'3'-0.0000-9.1467-9.14671.00000.00011.1e-04
32' *'-0.0021-7.0452-7.04310.99790.00098.7e-04
33' '0.0000-0.0024-0.00241.00000.99760.998
34'2'-0.0000-0.3017-0.30171.00000.73960.740
35'0'0.0000-0.0371-0.03711.00000.96360.964
36'0'-0.0000-2.0405-2.04051.00000.13000.130
37' ='-0.0706-0.3756-0.30500.93180.68690.737
38' '-0.0000-0.0004-0.00041.00000.99961.000
39'1'-0.0000-1.2194-1.21941.00000.29540.295
40'6'-0.0000-2.7427-2.74271.00000.06440.064
41'6'-0.0000-6.3902-6.39021.00000.00171.7e-03
42' guests'-0.0189-11.9208-11.90190.98130.00006.8e-06
43'\n\n'-0.0053-0.4357-0.43050.99480.64680.650
44'Step'-0.0000-4.2210-4.22101.00000.01470.015
45' '-0.0000-0.0201-0.02011.00000.98010.980
46'2'-0.0000-0.1663-0.16631.00000.84680.847
47':'0.0000-3.0060-3.00601.00000.04950.049
48' Calculate'-0.0000-1.2081-1.20811.00000.29880.299
49' the'-0.0000-0.0210-0.02101.00000.97920.979
... 124 more tokens (see samples.json)
row 67 · uid 9da91639 · 149 tokens · mean|diff|=4.238 · reward=1.00
Prompt
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Sean and Sierra invited 200 guests to their wedding.  If 83% of the guests RSVP with a Yes response and 9% of the guests RSVP with a No response, how many guests did not respond at all? Let's think step by step and output the final answer after "####".<|im_end|>
<|im_start|>assistant
vLLM response
First, calculate the number of guests who responded yes:
83% of 200 guests = 0.83 * 200 = 166 guests

Next, calculate the number of guests who responded no:
9% of 200 guests = 0.09 * 200 = 18 guests

Add the number of guests who responded yes and no to find the total who responded:
166 guests + 18 guests = 184 guests

Subtract the total number of initial guests from the number who responded to find those who did not respond:
200 guests - 184 guests = 16 guests

#### 16
postokenvLLM logpFSDP logpdiffp_vLLMp_trainr = p_train/p_vLLM
0'First'-0.4505-1.7562-1.30570.63730.17270.271
1','-0.0004-0.0225-0.02210.99960.97770.978
2' calculate'-0.5992-2.1727-1.57360.54930.11390.207
3' the'-0.0006-0.0428-0.04230.99940.95810.959
4' number'-0.0356-1.6272-1.59160.96500.19650.204
5' of'-0.0000-0.0004-0.00041.00000.99961.000
6' guests'-0.0102-7.6619-7.65170.98980.00054.8e-04
7' who'-0.0047-8.6410-8.63630.99530.00021.8e-04
8' responded'-1.6168-13.7888-12.17200.19850.00005.2e-06
9' yes'-3.4887-17.2288-13.74010.03050.00001.1e-06
10':\n'-0.7896-3.4632-2.67360.45400.03130.069
11'8'-0.6528-5.3122-4.65940.52060.00499.5e-03
12'3'-0.0000-9.1355-9.13551.00000.00011.1e-04
13'%'-0.0006-5.4343-5.43370.99940.00444.4e-03
14' of'-0.0005-0.5986-0.59810.99950.54960.550
15' '-0.0000-3.4163-3.41631.00000.03280.033
16'2'-0.0000-0.2405-0.24051.00000.78630.786
17'0'0.0000-0.0010-0.00101.00000.99900.999
18'0'0.0000-0.0007-0.00071.00000.99930.999
19' guests'-0.7046-10.1833-9.47880.49430.00007.6e-05
20' ='-0.0307-1.3147-1.28400.96980.26860.277
21' '-0.0009-0.0024-0.00150.99910.99760.999
22'0'-0.0077-7.6996-7.69190.99230.00054.6e-04
23'.'0.0000-8.4072-8.40721.00000.00022.2e-04
24'8'-0.0000-6.5397-6.53971.00000.00141.4e-03
25'3'-0.0000-9.1643-9.16431.00000.00011.0e-04
26' *'-0.0010-6.8446-6.84360.99900.00111.1e-03
27' '0.0000-0.0032-0.00321.00000.99680.997
28'2'-0.0000-0.2983-0.29831.00000.74210.742
29'0'0.0000-0.0316-0.03161.00000.96890.969
30'0'-0.0000-1.9899-1.98991.00000.13670.137
31' ='-0.0116-0.4193-0.40770.98840.65750.665
32' '-0.0000-0.0006-0.00061.00000.99940.999
33'1'-0.0000-1.2390-1.23901.00000.28970.290
34'6'-0.0000-2.7359-2.73591.00000.06480.065
35'6'-0.0000-6.5119-6.51191.00000.00151.5e-03
36' guests'-0.0247-12.0058-11.98110.97560.00006.3e-06
37'\n\n'-0.0230-0.3714-0.34840.97720.68980.706
38'Next'-0.1463-6.4932-6.34690.86390.00151.8e-03
39','-0.0000-0.0748-0.07471.00000.92790.928
40' calculate'-0.0012-1.8713-1.87020.99880.15390.154
41' the'-0.0001-0.0341-0.03410.99990.96640.967
42' number'-0.0002-1.5524-1.55220.99980.21170.212
43' of'-0.0001-0.0011-0.00100.99990.99890.999
44' guests'-0.0000-7.8068-7.80681.00000.00044.1e-04
45' who'-0.0001-8.4982-8.49810.99990.00022.0e-04
46' responded'-0.0028-11.9997-11.99690.99720.00006.2e-06
47' no'-0.0001-11.7483-11.74820.99990.00007.9e-06
48':\n'-0.0010-1.7106-1.70960.99900.18080.181
49'9'-0.0000-6.2490-6.24901.00000.00191.9e-03
... 99 more tokens (see samples.json)
row 705 · uid 6e13868d · 151 tokens · mean|diff|=5.308 · reward=1.00
Prompt
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
The largest frog can grow to weigh 10 times as much as the smallest frog. The largest frog weighs 120 pounds. How much more does the largest frog weigh than the smallest frog? Let's think step by step and output the final answer after "####".<|im_end|>
<|im_start|>assistant
vLLM response
To find out how much more the largest frog weighs than the smallest frog, we need to divide the weight of the largest frog by its weight being 10 times the smallest frog.

Largest frog weight = 120 pounds
Smallest frog weight = 120 pounds ÷ 10

Let's calculate the smallest frog's weight:
Smallest frog weight = 120 ÷ 10 = 12 pounds

Now, we subtract the smallest frog's weight from the largest frog's weight to find the difference:
Difference in weight = Largest frog weight - Smallest frog weight
Difference in weight = 120 - 12 = 108 pounds

#### 108
postokenvLLM logpFSDP logpdiffp_vLLMp_trainr = p_train/p_vLLM
0'To'-1.0691-1.9301-0.86110.34330.14510.423
1' find'-0.1006-2.3824-2.28180.90430.09230.102
2' out'-0.0467-10.7741-10.72740.95440.00002.2e-05
3' how'-0.0013-3.2887-3.28740.99870.03730.037
4' much'-0.0004-6.3073-6.30690.99960.00181.8e-03
5' more'-0.0095-6.4581-6.44870.99060.00161.6e-03
6' the'-0.0002-0.8568-0.85660.99980.42450.425
7' largest'-0.0009-10.0600-10.05920.99910.00004.3e-05
8' frog'-0.0003-10.1837-10.18340.99970.00003.8e-05
9' weighs'-0.0006-13.7774-13.77680.99940.00001.0e-06
10' than'-0.0143-13.3049-13.29060.98580.00001.7e-06
11' the'-0.0000-0.0365-0.03651.00000.96410.964
12' smallest'-0.0000-9.5825-9.58251.00000.00016.9e-05
13' frog'-0.0688-12.0658-11.99710.93350.00006.2e-06
14','-0.0272-1.9186-1.89130.97310.14680.151
15' we'-0.0498-1.9336-1.88380.95150.14460.152
16' need'-1.4004-6.2406-4.84020.24650.00197.9e-03
17' to'-0.0007-8.5152-8.51440.99930.00022.0e-04
18' divide'-1.9924-5.4737-3.48130.13640.00420.031
19' the'-0.0044-0.0738-0.06940.99560.92890.933
20' weight'-0.3914-9.3660-8.97460.67610.00011.3e-04
21' of'-0.0030-0.2263-0.22330.99700.79750.800
22' the'-0.0001-0.0367-0.03670.99990.96390.964
23' largest'-0.0065-9.0084-9.00190.99350.00011.2e-04
24' frog'-0.0006-11.1142-11.11360.99940.00001.5e-05
25' by'-0.0187-7.7720-7.75330.98150.00044.3e-04
26' its'-1.5882-9.5723-7.98410.20430.00013.4e-04
27' weight'-0.7640-8.9020-8.13800.46580.00012.9e-04
28' being'-8.2587-7.99560.26320.00030.00031.30
29' '-0.0488-0.9831-0.93430.95230.37420.393
30'1'-0.0001-0.2543-0.25420.99990.77550.776
31'0'-0.0028-0.8400-0.83720.99720.43170.433
32' times'-0.0013-4.7114-4.71010.99870.00909.0e-03
33' the'-0.7999-0.36370.43630.44940.69511.55
34' smallest'-0.2211-9.2701-9.04900.80160.00011.2e-04
35' frog'-0.1564-11.9589-11.80250.85520.00007.5e-06
36'.\n\n'-1.2960-1.4481-0.15210.27360.23500.859
37'L'-2.5922-10.1524-7.56020.07490.00005.2e-04
38'argest'-0.0037-15.9276-15.92390.99630.00001.2e-07
39' frog'-0.0093-11.4455-11.43620.99070.00001.1e-05
40' weight'-0.7740-13.0171-12.24320.46120.00004.8e-06
41' ='-0.3599-0.3691-0.00920.69770.69140.991
42' '-0.0338-0.01790.01590.96670.98221.02
43'1'-0.0001-0.0477-0.04760.99990.95340.954
44'2'-0.0054-2.0230-2.01770.99460.13230.133
45'0'-0.0000-0.0010-0.00101.00000.99900.999
46' pounds'-0.0008-7.6527-7.65190.99920.00054.8e-04
47'\n'-0.0529-0.9060-0.85300.94840.40410.426
48'Sm'-0.5606-11.5577-10.99710.57090.00001.7e-05
49'allest'-0.0007-15.0451-15.04440.99930.00002.9e-07
... 101 more tokens (see samples.json)
row 376 · uid 6e13868d · 337 tokens · mean|diff|=4.842 · reward=1.00
Prompt
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
The largest frog can grow to weigh 10 times as much as the smallest frog. The largest frog weighs 120 pounds. How much more does the largest frog weigh than the smallest frog? Let's think step by step and output the final answer after "####".<|im_end|>
<|im_start|>assistant
vLLM response
To find out how much more the largest frog weighs than the smallest frog, we need to first determine the weight of the smallest frog.

1. The largest frog weighs 120 pounds.
2. The largest frog can grow to weigh 10 times as much as the smallest frog. 

To find the weight of the smallest frog, we divide the weight of the largest frog by 10:
Smallest frog weight = 120 pounds / 10

Now let's calculate it:

Smallest frog weight = 12 pounds / 10 = 12 pounds

Finally, to find out how much more the largest frog weighs than the smallest frog, we subtract the weight of the smallest frog from the weight of the largest frog:

Difference = Largest frog's weight - Smallest frog's weight
Difference = 120 pounds - 12 pounds

#### 108

However, I made a mistake in my initial reasoning. The weight of the smallest frog isn't just 12 pounds, but it's the weight of the smallest frog, which is much less than this value based on the 10 times relationship.

To correctly calculate the weight of the smallest frog, we should multiply the weight of the largest frog by 1/10:

Smallest frog weight = 120 pounds / 10 = 12 pounds

Now, let's correctly calculate the difference using 12 pounds as the weight of the smallest frog:

Difference = Largest frog's weight - Smallest frog's weight
Difference = 120 pounds - 12 pounds

#### 108
postokenvLLM logpFSDP logpdiffp_vLLMp_trainr = p_train/p_vLLM
0'To'-1.0891-1.9301-0.84100.33650.14510.431
1' find'-0.0893-2.3824-2.29300.91450.09230.101
2' out'-0.0561-10.7741-10.71800.94550.00002.2e-05
3' how'-0.0010-3.2887-3.28770.99900.03730.037
4' much'-0.0004-6.3073-6.30690.99960.00181.8e-03
5' more'-0.0097-6.4581-6.44850.99040.00161.6e-03
6' the'-0.0002-0.8568-0.85660.99980.42450.425
7' largest'-0.0009-10.0600-10.05920.99910.00004.3e-05
8' frog'-0.0003-10.1837-10.18340.99970.00003.8e-05
9' weighs'-0.0006-13.7774-13.77680.99940.00001.0e-06
10' than'-0.0143-13.3049-13.29060.98580.00001.7e-06
11' the'-0.0000-0.0365-0.03651.00000.96410.964
12' smallest'-0.0000-9.5825-9.58251.00000.00016.9e-05
13' frog'-0.0687-12.0658-11.99710.93360.00006.2e-06
14','-0.0241-1.9186-1.89450.97620.14680.150
15' we'-0.0536-1.9336-1.88000.94780.14460.153
16' need'-1.4407-6.2406-4.79990.23680.00198.2e-03
17' to'-0.0006-8.5152-8.51460.99940.00022.0e-04
18' first'-1.5526-4.7237-3.17120.21170.00890.042
19' determine'-0.5707-3.6648-3.09410.56510.02560.045
20' the'-0.0348-0.0416-0.00680.96580.95920.993
21' weight'-0.0159-10.5655-10.54960.98420.00002.6e-05
22' of'-0.0013-0.2204-0.21920.99870.80220.803
23' the'-0.0001-0.0223-0.02220.99990.97790.978
24' smallest'-0.0005-9.4306-9.43010.99950.00018.0e-05
25' frog'-0.0001-10.8950-10.89490.99990.00001.9e-05
26'.\n\n'-1.3582-1.9524-0.59420.25710.14190.552
27'1'-1.6017-0.92730.67440.20160.39561.96
28'.'-0.0037-0.1456-0.14180.99630.86450.868
29' The'-0.3951-0.7518-0.35660.67360.47150.700
30' largest'-0.1238-10.5936-10.46970.88350.00002.8e-05
31' frog'-0.0001-11.5054-11.50530.99990.00001.0e-05
32' weighs'-0.0447-10.5102-10.46550.95630.00002.9e-05
33' '-0.0016-0.0263-0.02480.99840.97400.976
34'1'-0.0001-0.1702-0.17010.99990.84350.844
35'2'-0.1131-0.9061-0.79300.89310.40410.452
36'0'-0.0000-0.0002-0.00021.00000.99981.000
37' pounds'-0.0000-4.5273-4.52731.00000.01080.011
38'.\n'-0.0478-1.3257-1.27780.95330.26560.279
39'2'-0.0001-1.9235-1.92340.99990.14610.146
40'.'-0.0000-0.5815-0.58151.00000.55910.559
41' The'-0.1835-0.3392-0.15570.83240.71230.856
42' largest'-0.0104-10.2946-10.28420.98970.00003.4e-05
43' frog'-0.0003-11.3759-11.37560.99970.00001.1e-05
44' can'-0.9238-6.5918-5.66800.39700.00143.5e-03
45' grow'-0.0662-17.5500-17.48380.93600.00002.6e-08
46' to'-0.0091-7.6793-7.67030.99100.00054.7e-04
47' weigh'-0.0395-18.6062-18.56670.96130.00008.6e-09
48' '-0.0005-0.0835-0.08300.99950.91990.920
49'1'-0.0000-0.3328-0.33281.00000.71690.717
... 287 more tokens (see samples.json)
row 1004 · uid ca739be8 · 90 tokens · mean|diff|=7.111 · reward=1.00
Prompt
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Janice gave all three dozens of pebbles from her trip to her friends. Each of her friends got 4 pebbles. How many friends received pebbles? Let's think step by step and output the final answer after "####".<|im_end|>
<|im_start|>assistant
vLLM response
First, let's calculate the total number of pebbles Janice had:
3 dozens x 12 pebbles/dozen = 36 pebbles

Now, we divide the total number of pebbles by the number of pebbles each friend got to find out how many friends received pebbles:
36 pebbles ÷ 4 pebbles/friend = 9 friends

#### 9
postokenvLLM logpFSDP logpdiffp_vLLMp_trainr = p_train/p_vLLM
0'First'-1.0209-1.9090-0.88810.36030.14820.411
1','-0.0008-0.0241-0.02330.99920.97620.977
2' let'-2.1349-2.5461-0.41120.11830.07840.663
3"'s"-0.0001-0.0027-0.00270.99990.99730.997
4' calculate'-0.7362-0.38540.35080.47890.68021.42
5' the'-0.0133-0.0415-0.02810.98680.95940.972
6' total'-0.0001-0.1198-0.11980.99990.88710.887
7' number'-0.0003-0.0572-0.05690.99970.94440.945
8' of'-0.0000-0.0001-0.00011.00000.99991.000
9' pe'-0.0008-6.9657-6.96490.99920.00099.4e-04
10'bb'-0.0000-16.4632-16.46311.00000.00007.1e-08
11'les'0.0000-16.9375-16.93751.00000.00004.4e-08
12' Jan'-0.0692-13.3776-13.30850.93320.00001.7e-06
13'ice'-0.0000-16.5210-16.52101.00000.00006.7e-08
14' had'-0.5468-3.1910-2.64410.57880.04110.071
15':\n'-1.0805-2.1311-1.05060.33940.11870.350
16'3'-0.1611-4.6353-4.47420.85120.00970.011
17' dozens'-0.0242-7.7812-7.75700.97610.00044.3e-04
18' x'-5.0102-5.5118-0.50160.00670.00400.606
19' '-0.0001-0.0018-0.00180.99990.99820.998
20'1'-0.0001-3.2402-3.24010.99990.03920.039
21'2'-0.0000-0.2780-0.27801.00000.75730.757
22' pe'-0.0009-9.0770-9.07600.99910.00011.1e-04
23'bb'-0.0000-19.9604-19.96041.00000.00002.1e-09
24'les'0.0000-18.9864-18.98641.00000.00005.7e-09
25'/do'-0.1042-15.4583-15.35410.90100.00002.1e-07
26'zen'-0.0001-19.3883-19.38820.99990.00003.8e-09
27' ='-0.0001-0.6627-0.66260.99990.51550.516
28' '-0.0000-0.0014-0.00141.00000.99860.999
29'3'-0.0000-1.0521-1.05211.00000.34920.349
30'6'-0.0001-2.8649-2.86490.99990.05700.057
31' pe'-0.0000-9.0531-9.05311.00000.00011.2e-04
32'bb'-0.0000-19.3729-19.37291.00000.00003.9e-09
33'les'0.0000-19.6614-19.66141.00000.00002.9e-09
34'\n\n'-0.0041-0.3951-0.39090.99590.67360.676
35'Now'-0.3783-1.8440-1.46580.68510.15820.231
36','-0.0821-0.07140.01080.92120.93111.01
37' we'-1.6273-1.02640.60090.19650.35831.82
38' divide'-2.2202-5.2458-3.02560.10860.00530.049
39' the'-0.0616-0.04190.01970.94030.95901.02
40' total'-0.0002-0.1265-0.12630.99980.88110.881
41' number'-0.0138-0.1678-0.15400.98630.84550.857
42' of'-0.0002-0.0024-0.00220.99980.99760.998
43' pe'-0.0000-7.5535-7.55351.00000.00055.2e-04
44'bb'-0.0000-16.7424-16.74241.00000.00005.4e-08
45'les'0.0000-16.9947-16.99471.00000.00004.2e-08
46' by'-0.0005-7.0740-7.07340.99950.00088.5e-04
47' the'-0.0080-0.5358-0.52780.99200.58520.590
48' number'-0.0057-2.2160-2.21030.99430.10900.110
49' of'-0.1682-0.01050.15770.84520.98961.17
... 40 more tokens (see samples.json)
row 34 · uid ca739be8 · 116 tokens · mean|diff|=6.725 · reward=1.00
Prompt
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Janice gave all three dozens of pebbles from her trip to her friends. Each of her friends got 4 pebbles. How many friends received pebbles? Let's think step by step and output the final answer after "####".<|im_end|>
<|im_start|>assistant
vLLM response
First, let's determine the total number of pebbles Janice had. Since she had three dozens, we calculate the total number of pebbles as follows:

3 dozens * 12 pebbles/dozen = 36 pebbles

Each friend received 4 pebbles, so to find out how many friends received pebbles, we divide the total number of pebbles by the number of pebbles each friend received:

36 pebbles / 4 pebbles per friend = 9 friends

#### 9
postokenvLLM logpFSDP logpdiffp_vLLMp_trainr = p_train/p_vLLM
0'First'-1.0223-1.9090-0.88660.35980.14820.412
1','-0.0012-0.0241-0.02290.99880.97620.977
2' let'-2.1716-2.5461-0.37450.11400.07840.688
3"'s"-0.0001-0.0027-0.00270.99990.99730.997
4' determine'-1.3307-3.5104-2.17970.26430.02990.113
5' the'-0.0622-0.05690.00520.93970.94461.01
6' total'-0.0007-0.1336-0.13280.99930.87500.876
7' number'-0.0004-0.0459-0.04550.99960.95520.956
8' of'-0.0000-0.0001-0.00011.00000.99991.000
9' pe'-0.0020-6.9941-6.99210.99800.00099.2e-04
10'bb'-0.0001-16.4588-16.45870.99990.00007.1e-08
11'les'0.0000-16.8583-16.85831.00000.00004.8e-08
12' Jan'-0.0341-13.1849-13.15080.96650.00001.9e-06
13'ice'-0.0000-16.9292-16.92921.00000.00004.4e-08
14' had'-0.3370-3.1847-2.84770.71390.04140.058
15'.'-0.5390-1.1540-0.61500.58330.31540.541
16' Since'-0.7129-3.8398-3.12690.49020.02150.044
17' she'-0.1806-2.5916-2.41100.83480.07490.090
18' had'-0.3257-3.0678-2.74210.72200.04650.064
19' three'-0.0583-8.3890-8.33070.94340.00022.4e-04
20' dozens'-0.0142-5.5190-5.50480.98590.00404.1e-03
21','-1.1469-1.7666-0.61980.31760.17090.538
22' we'-0.4253-3.8744-3.44920.65360.02080.032
23' calculate'-0.3094-0.5867-0.27730.73390.55620.758
24' the'-3.0911-0.06243.02870.04550.939520.67
25' total'-0.0183-0.1162-0.09800.98190.89030.907
26' number'-1.6160-0.14991.46610.19870.86084.33
27' of'-0.0503-0.00510.04510.95100.99491.05
28' pe'-0.0008-7.5173-7.51650.99920.00055.4e-04
29'bb'-0.0000-17.4416-17.44151.00000.00002.7e-08
30'les'0.0000-17.5843-17.58431.00000.00002.3e-08
31' as'-0.2501-7.5775-7.32740.77870.00056.6e-04
32' follows'-0.1428-14.5207-14.37780.86690.00005.7e-07
33':\n\n'-0.2948-1.4458-1.15110.74470.23560.316
34'3'-0.4686-4.5624-4.09380.62590.01040.017
35' dozens'-0.0310-8.5148-8.48390.96950.00022.1e-04
36' *'-0.7912-1.0955-0.30430.45330.33440.738
37' '-0.0022-0.00200.00030.99780.99801.00
38'1'-0.0001-3.7128-3.71270.99990.02440.024
39'2'-0.0002-0.4215-0.42130.99980.65610.656
40' pe'-0.0012-10.5660-10.56480.99880.00002.6e-05
41'bb'-0.0000-20.3070-20.30691.00000.00001.5e-09
42'les'0.0000-18.6896-18.68961.00000.00007.6e-09
43'/do'-0.0894-15.8296-15.74010.91440.00001.5e-07
44'zen'-0.0000-19.3306-19.33061.00000.00004.0e-09
45' ='-0.0034-0.6942-0.69080.99660.49950.501
46' '-0.0003-0.0023-0.00200.99970.99770.998
47'3'-0.0000-0.9641-0.96411.00000.38130.381
48'6'-0.0068-3.0552-3.04840.99320.04710.047
49' pe'-0.0000-9.2901-9.29001.00000.00019.2e-05
... 66 more tokens (see samples.json)

Possible direction (not yet decided)

Not a committed plan, only a pointer for later. One candidate for the regime where the mismatch is too large for plain TIS is the recent paper Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL (Ye et al., arXiv:2603.19470), which controls heavy-tailed importance ratios instead of masking the offending tokens out. It is shown for quantization / kernel mismatch, not compression, so it would need adapting, and we have not decided whether to use it.


Raw data: step_batch.pt / step_batch.npz (full batch tensors), summary.json, samples.json. Generated by tim_report.py.