A 7B math fine-tune on 8× H100: SFT +6.4, DPO +0.6
Fine-tune a 7B reasoning model on competition math with two stages of LoRA — supervised fine-tuning on self-distilled chain-of-thought, then direct preference optimization on chosen-vs-rejected reasoning pairs — and measure the delta on four held-out benchmarks. The whole pipeline (synthetic data curation, SFT, DPO, two merge steps, and a 3-model × 4-benchmark eval grid) ran on one 8× H100 SXM pod in ~3.5 hours for ~$93.
SFT delivered +6.4 pp averaged across GSM8K, MATH-500, AIME 2024+2025, and MathNet. DPO at the chosen configuration was a +0.6 pp no-op — and the training-time reward margin near zero predicted that before any eval ran.
| Benchmark | Baseline | SFT | SFT+DPO | Δ SFT | Δ DPO |
|---|---|---|---|---|---|
| GSM8K (n=200) | 85.0% | 79.5% | 81.0% | −5.5 pp | +1.5 pp |
| MATH-500 (n=500) | 60.0% | 73.8% | 73.8% | +13.8 pp | 0.0 pp |
| AIME 2024+2025 (n=60) | 5.0% | 11.7% | 11.7% | +6.7 pp | 0.0 pp |
| MathNet (n=500) | 23.6% | 34.0% | 35.0% | +10.4 pp | +1.0 pp |
| Average | 43.4% | 49.8% | 50.4% | +6.4 pp | +0.6 pp |
{"data":[{"x":["GSM8K","MATH-500","AIME 24+25","MathNet"],"y":[85.0,60.0,5.0,23.6],"name":"Baseline","type":"bar","marker":{"color":"#AAB0C0"},"hovertemplate":"%{x}<br>baseline %{y:.1f}%<extra></extra>"},{"x":["GSM8K","MATH-500","AIME 24+25","MathNet"],"y":[79.5,73.8,11.7,34.0],"name":"SFT","type":"bar","marker":{"color":"#636EFA"},"hovertemplate":"%{x}<br>SFT %{y:.1f}%<extra></extra>"},{"x":["GSM8K","MATH-500","AIME 24+25","MathNet"],"y":[81.0,73.8,11.7,35.0],"name":"SFT+DPO","type":"bar","marker":{"color":"#00CC96"},"hovertemplate":"%{x}<br>SFT+DPO %{y:.1f}%<extra></extra>"}],"layout":{"title":{"text":"Greedy accuracy by benchmark (max_new_tokens=2048, no BoN)"},"barmode":"group","xaxis":{"title":""},"yaxis":{"title":"accuracy (%)","range":[0,100]},"height":440,"margin":{"l":60,"r":30,"t":60,"b":50},"legend":{"x":0.78,"y":0.98}}}
The point of this run was not SOTA. Teacher and student share a base model, so the achievable gain is bounded by self-distillation by construction — the model is being asked to imitate itself more reliably. The point was a working end-to-end pipeline that exercises a defined set of techniques (synthetic curation, LoRA, DeepSpeed ZeRO-3, vLLM tensor parallelism, math_verify grading, preference optimization, multi-benchmark eval) and produces a measurable, honest delta.
Model and the GQA divisibility constraint
| Component | Value |
|---|---|
| Base | DeepSeek-R1-Distill-Qwen-7B |
| Teacher | DeepSeek-R1-Distill-Qwen-7B (same model self-distills) |
| dtype | bfloat16 |
| Attention | 28 query heads, 4 KV heads (GQA) |
The head counts decide how many GPUs the model can use. vLLM’s tensor parallelism shards attention heads across GPUs, so the TP degree must divide both the query-head count (28) and the KV-head count (4). The binding constraint is the 4 KV heads: TP must divide 4, so TP ∈ {1, 2, 4} — never 8. That constraint is baked straight into the data-gen CLI so it can’t be violated by accident:
parser.add_argument("--tp", type=int, default=1, choices=[1, 2, 4],
help="vLLM tensor_parallel_size. Must divide BOTH num_attention_heads "
"AND num_key_value_heads. DeepSeek-R1-Distill-Qwen-7B has 28 heads "
"and 4 KV heads (GQA), so valid values are {1, 2, 4}.")
argparse rejects --tp 8 before any CUDA loads. The cost is that an 8-GPU pod runs vLLM at TP=4, leaving four cards idle through every generation and eval phase — an unavoidable ~$13 on this hardware.
Data: teacher self-distillation, three runs to fill the pref floor
The teacher (same weights as the student base) loads into vLLM and samples each NuminaMath-TIR problem n=3 times under a code-CoT instruction. Every completion is graded for symbolic equivalence against the gold answer. The grading is the single source of truth for “is this correct?”, shared by both data-gen and eval:
def is_equivalent(predicted: str, gold: str) -> bool:
"""Symbolic if possible (math_verify), else normalized string compare."""
pred_n, gold_n = _normalize(predicted), _normalize(gold)
if pred_n == gold_n:
return True
if _HAS_MATH_VERIFY:
try:
# parsing_timeout=0 skips the multiprocessing path that breaks on Windows;
# safe for short boxed answers.
p = _mv_parse(_boxed(predicted), parsing_timeout=0)
g = _mv_parse(_boxed(gold), parsing_timeout=0)
if p and g and _mv_verify(g, p, timeout_seconds=0):
return True
except Exception:
pass
return False
Two non-obvious details there. math_verify.parse only fires when it sees a recognized anchor, so a bare answer like 42 is wrapped to \boxed{42} before parsing (_boxed). And parsing_timeout=0 is mandatory on Windows — the default spins up a multiprocessing watchdog that deadlocks; setting it to 0 keeps grading single-process, which is fine for short answers.
Each problem’s samples are split into correct and wrong, and a preference pair is harvested only when a problem has at least one of each:
# SFT data: keep one correct response per problem
if correct_responses:
f_sft.write(json.dumps({"problem": problems[i],
"response": correct_responses[0],
"gold_answer": gold_answer}) + "\n")
# DPO pref pair: needs ≥1 correct AND ≥1 wrong from the SAME problem
if correct_responses and wrong_responses:
f_pref.write(json.dumps({"prompt": problem_prompt,
"chosen": correct_responses[0],
"rejected": wrong_responses[0]}) + "\n")
That “≥1 correct AND ≥1 wrong” condition is where data-gen got expensive. It took three runs:
| Run | Config | SFT | Pref | Why |
|---|---|---|---|---|
| 1 | temp=0.7, seed=42 | 2535 | 270 | seed collapsed n=3 diversity → only 5.4% of problems mixed-correctness |
| 2 | temp=1.0, no seed | 2884 | 1243 | same 5000 problems, pref harvest jumped 4.6× |
| 3 | temp=1.0, no seed, problems 5000–6999 | 1128 | 448 | top-up to clear the pref-pair floor |
Combined: 3663 SFT examples and 1961 preference pairs.
The culprit was these two lines, which set a seed on a multi-sample request:
sampling = SamplingParams(
n=args.n_samples,
temperature=0.7,
top_p=0.95,
max_tokens=MAX_NEW_TOKENS,
seed=SEED, # ← this collapses diversity across the n samples
)
Setting seed while requesting n>1 collapses the diversity of samples within a single request — all three completions per problem came out nearly identical, so almost every problem was all-correct or all-wrong, and the pref harvest (which needs both) got starved: 270 pairs from 5000 problems. The fix was a one-liner — remove seed, raise temperature to 1.0 — and the harvest jumped to 1243 on the same 5000 problems.
SFT, and a guard that aborts the packing trap
lora_config = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
)
| Knob | Value |
|---|---|
| Trainable | 40.4M (0.53% of 7.66B) |
| Epochs | 3 |
| Effective batch | 64 (per-device 2 × grad-accum 4 × 8 GPUs) |
| LR | 2e-4 cosine, 3% warmup |
| Packing | disabled |
| Max seq len | 2048 |
| Sharding | DeepSpeed ZeRO-3 |
| Steps | 171 (57/epoch × 3) |
| train_loss | 1.25 → 0.245 |
accelerate launch --config_file src/accelerate_config.yaml src/sft.py
packing=disabled is a scar from a prior run, and the code now refuses to launch if the same mistake recurs:
PACKING = False
steps_per_epoch = max(1, len(ds) // EFFECTIVE_BATCH)
total_steps = steps_per_epoch * NUM_EPOCHS
if steps_per_epoch < MIN_STEPS_PER_EPOCH: # = 30
sys.exit(
f"[ABORT] steps_per_epoch={steps_per_epoch} < {MIN_STEPS_PER_EPOCH}. "
f"This is the packing trap from 2026-05-19. "
f"Either set PACKING=False, lower max_seq_length, or get more data."
)
The history: an earlier run packed 2,439 short examples into ~6 packed sequences per epoch at max_seq_length=2048 — which is 6 gradient steps per epoch. The naive “fix” of bumping epochs 3 → 12 just showed the model the same six sequences twelve times and overfit in 15 minutes. The real fix is to not pack a dataset this small, and to fail loudly before training if too few steps would result. The general principle: with packing on, you measure training work in tokens, not steps — the step count lies.
For a 7B LoRA, ZeRO-3 isn’t strictly required (it fits on an H100 80GB with plain DDP — usage was ~20 GB/card, ~60 GB headroom). I used it anyway to exercise the technique and keep the same launch path valid for a full fine-tune at this size.
The two-merge requirement
DPO loads the SFT-merged model as its base, not the LoRA adapter directly. So the pipeline needs two merge calls — one after SFT, one after DPO — each through a single-GPU merge_lora.py (~30 seconds, ~15 GB BF16 output). DPO then trains a second, fresh LoRA on top of the merged SFT model, not stacked on the first adapter.
python src/merge_lora.py --adapter checkpoints/lora_v5 --out checkpoints/merged_sft_v5
# ... DPO trains lora_dpo_v5 on top of merged_sft_v5 ...
python src/merge_lora.py --base checkpoints/merged_sft_v5 \
--adapter checkpoints/lora_dpo_v5 --out checkpoints/merged_dpo_v5
DPO: one flag that crashed, and a no-op that the metrics predicted
The reference model is the SFT model with its DPO adapter disabled — same weights, no second copy in VRAM. So DPOTrainer is constructed with ref_model=None:
trainer = DPOTrainer(model=model, ref_model=None, args=dpo_config, train_dataset=ds, tokenizer=tokenizer)
The committed config carried an optimization that does not survive ZeRO-3:
# precompute_ref_log_probs=True does ONE pass through the reference model first,
# caches per-sample log probs, then drops the reference...
precompute_ref_log_probs=True,
loss_type="sigmoid",
Under DeepSpeed ZeRO-3 in trl 0.12+, that raises a hard ValueError at DPOTrainer init — ZeRO-3 can’t correctly hold both policy and reference parameter shards while precomputing. The fix was a single flip to False, which falls back to live reference passes via PEFT’s adapter-disable trick. The cost: each step now runs four forward passes (chosen + rejected × policy + reference) instead of one, so DPO step time is ~3× SFT step time. (A cosmetic side effect: DeepSpeed prints Invalidate trace cache @ step 10 10–20× per step, because those four passes keep confusing ZeRO-3’s parameter-prefetch trace into rebuilds. Noise, not a bug.)
Catching that crash cheaply is what the --smoke-steps flag is for — a 5-step run gates the full hour:
accelerate launch --config_file src/accelerate_config.yaml src/dpo.py --smoke-steps 5
| Knob | Value |
|---|---|
| LR | 5e-7 cosine, 3% warmup |
| β | 0.1 |
| Epochs | 2 |
| Steps | 60 (30/epoch × 2) |
Final rewards/accuracies | 0.225 |
Final rewards/margins | −0.026 |
The result was a clean no-op. MATH-500 and AIME produced identical correct counts under SFT and SFT+DPO (369/500 and 7/60). At temperature 0, identical counts mean DPO did not flip a single answer extraction on those two benchmarks. The +1.5 pp on GSM8K and +1.0 on MathNet are within noise for n=200 and n=500.
This wasn’t a data problem — the 1961 pairs were clean (0% identical chosen/rejected, healthy length distributions). It was a configuration problem, and the training-time metrics called it before eval did: a final rewards/accuracies of 0.225 (worse than the 0.5 coin flip) and a slightly negative rewards/margins of −0.026 mean the policy never learned to prefer chosen over rejected. The likely causes, all conservative for a self-distillation setup where the policy starts equal to the reference:
- LR=5e-7 (the trl default) is too small to move a policy that begins identical to its reference.
- 60 total steps on a cosine-to-zero schedule decays the LR to nothing before the policy finds a useful direction.
- β=0.1 (default) is low; self-distillation may need β=0.3 to force differentiation.
Predicted fix for a re-run: LR=5e-6, β=0.3, ≥200 steps. Untested here — but the lesson is cheaper than the re-run: watch rewards/margins during DPO. If it’s hovering at zero, the eval will be a no-op and you can stop paying for it.
The GSM8K regression is a feature of training on harder data
SFT helped most where the problems are hard (+6.7 to +13.8 pp on MATH-500, AIME, MathNet) and hurt on the easiest benchmark (−5.5 pp on GSM8K). This is a known and explainable failure mode. NuminaMath-TIR is competition-style with long, structured CoT. After SFT the model adopts that verbose style even on simple GSM8K word problems, and some chains now run past max_new_tokens=2048 before reaching \boxed{} — which the grader scores as wrong. The fix is either a 4096-token cap or BoN majority voting at inference; both were skipped here for budget.
The absolute AIME numbers also look bleak (5% baseline) precisely because of greedy decoding at a 2048-token cap. R1-Distill-Qwen-7B’s published AIME score is ~55% with temperature 0.6 and many samples. The relative comparison is valid — all three models eval identically — but the absolutes are a decoding artifact, not the model’s ceiling.
Eval grid
Each (model, benchmark) pair runs independently through vLLM at TP=4, greedy, max_new_tokens=2048, no BoN. Grading reuses the same grade_response pipeline from data-gen.
| Benchmark | Source | n |
|---|---|---|
| GSM8K | openai/gsm8k:main:test | 200 |
| MATH-500 | HuggingFaceH4/MATH-500 | 500 |
| AIME 2024+2025 | Maxwell-Jia/AIME_2024 + opencompass/AIME2025 | 60 |
| MathNet | ShadenA/MathNet (text-only) | 500 |
One operational wrinkle: eval.py writes eval/{baseline,lora}_<bench>.json, so with three models the lora filename collides between the SFT and DPO blocks. A bash driver renames each output to sft_<bench>.json or dpo_<bench>.json after the call. Twelve evals, ~14 minutes total.
Cost ledger
| Phase | Wall-time | Cost @ $26.33/hr |
|---|---|---|
| data_gen (3 runs, ~101 min, TP=4) | 101 min | ~$44 |
| SFT + 2 merges | ~12 min | ~$5 |
| DPO smoke + full | ~15 min | ~$7 |
| Eval (12 runs) | ~14 min | ~$6 |
| Provisioning + idle + uploads | — | ~$31 |
| Total | ~3.5 hr | ~$93 |
The single most expensive line is data-gen, and most of that is the TP=4 constraint leaving half the pod idle. On a model whose KV-head count divided by 8, the same run would have been materially cheaper.
What’s next
In priority order, none of it required for proof-of-pipeline:
- Re-run DPO with LR=5e-6, β=0.3, 200 steps (~$15) — does DPO deliver real Δ on top of SFT, or does self-distillation truly ceiling-block it?
- BoN-8 eval across all three models (~$20) — adds test-time-compute coverage, likely recovers the GSM8K regression and lifts AIME absolutes.
- Bump
max_new_tokensto 4096 — lifts all absolute numbers without changing relative comparisons.
Code, configs, logs, and per-problem eval JSONs: github.com/debtirthasaha/math-slm. The merged 7B model (~15 GB) is on Hugging Face at MR0b0t/math-slm-sft-dpo-v5.
Enjoy Reading This Article?
Here are some more articles you might like to read next: