A 7B math fine-tune on 8× H100: SFT +6.4, DPO +0.6

Fine-tune a 7B reasoning model on competition math with two stages of LoRA — supervised fine-tuning on self-distilled chain-of-thought, then direct preference optimization on chosen-vs-rejected reasoning pairs — and measure the delta on four held-out benchmarks. The whole pipeline (synthetic data curation, SFT, DPO, two merge steps, and a 3-model × 4-benchmark eval grid) ran on one 8× H100 SXM pod in ~3.5 hours for ~$93.

SFT delivered +6.4 pp averaged across GSM8K, MATH-500, AIME 2024+2025, and MathNet. DPO at the chosen configuration was a +0.6 pp no-op — and the training-time reward margin near zero predicted that before any eval ran.

Benchmark	Baseline	SFT	SFT+DPO	Δ SFT	Δ DPO
GSM8K (n=200)	85.0%	79.5%	81.0%	−5.5 pp	+1.5 pp
MATH-500 (n=500)	60.0%	73.8%	73.8%	+13.8 pp	0.0 pp
AIME 2024+2025 (n=60)	5.0%	11.7%	11.7%	+6.7 pp	0.0 pp
MathNet (n=500)	23.6%	34.0%	35.0%	+10.4 pp	+1.0 pp
Average	43.4%	49.8%	50.4%	+6.4 pp	+0.6 pp

{"data":[{"x":["GSM8K","MATH-500","AIME 24+25","MathNet"],"y":[85.0,60.0,5.0,23.6],"name":"Baseline","type":"bar","marker":{"color":"#AAB0C0"},"hovertemplate":"%{x}<br>baseline %{y:.1f}%<extra></extra>"},{"x":["GSM8K","MATH-500","AIME 24+25","MathNet"],"y":[79.5,73.8,11.7,34.0],"name":"SFT","type":"bar","marker":{"color":"#636EFA"},"hovertemplate":"%{x}<br>SFT %{y:.1f}%<extra></extra>"},{"x":["GSM8K","MATH-500","AIME 24+25","MathNet"],"y":[81.0,73.8,11.7,35.0],"name":"SFT+DPO","type":"bar","marker":{"color":"#00CC96"},"hovertemplate":"%{x}<br>SFT+DPO %{y:.1f}%<extra></extra>"}],"layout":{"title":{"text":"Greedy accuracy by benchmark (max_new_tokens=2048, no BoN)"},"barmode":"group","xaxis":{"title":""},"yaxis":{"title":"accuracy (%)","range":[0,100]},"height":440,"margin":{"l":60,"r":30,"t":60,"b":50},"legend":{"x":0.78,"y":0.98}}}

The point of this run was not SOTA. Teacher and student share a base model, so the achievable gain is bounded by self-distillation by construction — the model is being asked to imitate itself more reliably. The point was a working end-to-end pipeline that exercises a defined set of techniques (synthetic curation, LoRA, DeepSpeed ZeRO-3, vLLM tensor parallelism, math_verify grading, preference optimization, multi-benchmark eval) and produces a measurable, honest delta.

Model and the GQA divisibility constraint

Component	Value
Base	DeepSeek-R1-Distill-Qwen-7B
Teacher	DeepSeek-R1-Distill-Qwen-7B (same model self-distills)
dtype	bfloat16
Attention	28 query heads, 4 KV heads (GQA)

The head counts decide how many GPUs the model can use. vLLM’s tensor parallelism shards attention heads across GPUs, so the TP degree must divide both the query-head count (28) and the KV-head count (4). The binding constraint is the 4 KV heads: TP must divide 4, so TP ∈ {1, 2, 4} — never 8. That constraint is baked straight into the data-gen CLI so it can’t be violated by accident:

parser.add_argument("--tp", type=int, default=1, choices=[1, 2, 4],
    help="vLLM tensor_parallel_size. Must divide BOTH num_attention_heads "
         "AND num_key_value_heads. DeepSeek-R1-Distill-Qwen-7B has 28 heads "
         "and 4 KV heads (GQA), so valid values are {1, 2, 4}.")

argparse rejects --tp 8 before any CUDA loads. The cost is that an 8-GPU pod runs vLLM at TP=4, leaving four cards idle through every generation and eval phase — an unavoidable ~$13 on this hardware.

Data: teacher self-distillation, three runs to fill the pref floor

The teacher (same weights as the student base) loads into vLLM and samples each NuminaMath-TIR problem n=3 times under a code-CoT instruction. Every completion is graded for symbolic equivalence against the gold answer. The grading is the single source of truth for “is this correct?”, shared by both data-gen and eval:

def is_equivalent(predicted: str, gold: str) -> bool:
    """Symbolic if possible (math_verify), else normalized string compare."""
    pred_n, gold_n = _normalize(predicted), _normalize(gold)
    if pred_n == gold_n:
        return True
    if _HAS_MATH_VERIFY:
        try:
            # parsing_timeout=0 skips the multiprocessing path that breaks on Windows;
            # safe for short boxed answers.
            p = _mv_parse(_boxed(predicted), parsing_timeout=0)
            g = _mv_parse(_boxed(gold), parsing_timeout=0)
            if p and g and _mv_verify(g, p, timeout_seconds=0):
                return True
        except Exception:
            pass
    return False

Two non-obvious details there. math_verify.parse only fires when it sees a recognized anchor, so a bare answer like 42 is wrapped to \boxed{42} before parsing (_boxed). And parsing_timeout=0 is mandatory on Windows — the default spins up a multiprocessing watchdog that deadlocks; setting it to 0 keeps grading single-process, which is fine for short answers.

Each problem’s samples are split into correct and wrong, and a preference pair is harvested only when a problem has at least one of each:

# SFT data: keep one correct response per problem
if correct_responses:
    f_sft.write(json.dumps({"problem": problems[i],
                            "response": correct_responses[0],
                            "gold_answer": gold_answer}) + "\n")

# DPO pref pair: needs ≥1 correct AND ≥1 wrong from the SAME problem
if correct_responses and wrong_responses:
    f_pref.write(json.dumps({"prompt": problem_prompt,
                             "chosen": correct_responses[0],
                             "rejected": wrong_responses[0]}) + "\n")

That “≥1 correct AND ≥1 wrong” condition is where data-gen got expensive. It took three runs:

Run	Config	SFT	Pref	Why
1	temp=0.7, `seed=42`	2535	270	seed collapsed `n=3` diversity → only 5.4% of problems mixed-correctness
2	temp=1.0, no seed	2884	1243	same 5000 problems, pref harvest jumped 4.6×
3	temp=1.0, no seed, problems 5000–6999	1128	448	top-up to clear the pref-pair floor

Combined: 3663 SFT examples and 1961 preference pairs.

The culprit was these two lines, which set a seed on a multi-sample request:

sampling = SamplingParams(
    n=args.n_samples,
    temperature=0.7,
    top_p=0.95,
    max_tokens=MAX_NEW_TOKENS,
    seed=SEED,        # ← this collapses diversity across the n samples
)

Setting seed while requesting n>1 collapses the diversity of samples within a single request — all three completions per problem came out nearly identical, so almost every problem was all-correct or all-wrong, and the pref harvest (which needs both) got starved: 270 pairs from 5000 problems. The fix was a one-liner — remove seed, raise temperature to 1.0 — and the harvest jumped to 1243 on the same 5000 problems.

SFT, and a guard that aborts the packing trap

lora_config = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
)

Knob	Value
Trainable	40.4M (0.53% of 7.66B)
Epochs	3
Effective batch	64 (per-device 2 × grad-accum 4 × 8 GPUs)
LR	2e-4 cosine, 3% warmup
Packing	disabled
Max seq len	2048
Sharding	DeepSpeed ZeRO-3
Steps	171 (57/epoch × 3)
train_loss	1.25 → 0.245

accelerate launch --config_file src/accelerate_config.yaml src/sft.py

packing=disabled is a scar from a prior run, and the code now refuses to launch if the same mistake recurs:

PACKING = False
steps_per_epoch = max(1, len(ds) // EFFECTIVE_BATCH)
total_steps = steps_per_epoch * NUM_EPOCHS
if steps_per_epoch < MIN_STEPS_PER_EPOCH:   # = 30
    sys.exit(
        f"[ABORT] steps_per_epoch={steps_per_epoch} < {MIN_STEPS_PER_EPOCH}. "
        f"This is the packing trap from 2026-05-19. "
        f"Either set PACKING=False, lower max_seq_length, or get more data."
    )

The history: an earlier run packed 2,439 short examples into ~6 packed sequences per epoch at max_seq_length=2048 — which is 6 gradient steps per epoch. The naive “fix” of bumping epochs 3 → 12 just showed the model the same six sequences twelve times and overfit in 15 minutes. The real fix is to not pack a dataset this small, and to fail loudly before training if too few steps would result. The general principle: with packing on, you measure training work in tokens, not steps — the step count lies.

For a 7B LoRA, ZeRO-3 isn’t strictly required (it fits on an H100 80GB with plain DDP — usage was ~20 GB/card, ~60 GB headroom). I used it anyway to exercise the technique and keep the same launch path valid for a full fine-tune at this size.

The two-merge requirement

DPO loads the SFT-merged model as its base, not the LoRA adapter directly. So the pipeline needs two merge calls — one after SFT, one after DPO — each through a single-GPU merge_lora.py (~30 seconds, ~15 GB BF16 output). DPO then trains a second, fresh LoRA on top of the merged SFT model, not stacked on the first adapter.

python src/merge_lora.py --adapter checkpoints/lora_v5 --out checkpoints/merged_sft_v5
# ... DPO trains lora_dpo_v5 on top of merged_sft_v5 ...
python src/merge_lora.py --base checkpoints/merged_sft_v5 \
                         --adapter checkpoints/lora_dpo_v5 --out checkpoints/merged_dpo_v5

DPO: one flag that crashed, and a no-op that the metrics predicted

The reference model is the SFT model with its DPO adapter disabled — same weights, no second copy in VRAM. So DPOTrainer is constructed with ref_model=None:

trainer = DPOTrainer(model=model, ref_model=None, args=dpo_config, train_dataset=ds, tokenizer=tokenizer)

The committed config carried an optimization that does not survive ZeRO-3:

# precompute_ref_log_probs=True does ONE pass through the reference model first,
# caches per-sample log probs, then drops the reference...
precompute_ref_log_probs=True,
loss_type="sigmoid",

Under DeepSpeed ZeRO-3 in trl 0.12+, that raises a hard ValueError at DPOTrainer init — ZeRO-3 can’t correctly hold both policy and reference parameter shards while precomputing. The fix was a single flip to False, which falls back to live reference passes via PEFT’s adapter-disable trick. The cost: each step now runs four forward passes (chosen + rejected × policy + reference) instead of one, so DPO step time is ~3× SFT step time. (A cosmetic side effect: DeepSpeed prints Invalidate trace cache @ step 10 10–20× per step, because those four passes keep confusing ZeRO-3’s parameter-prefetch trace into rebuilds. Noise, not a bug.)

Catching that crash cheaply is what the --smoke-steps flag is for — a 5-step run gates the full hour:

accelerate launch --config_file src/accelerate_config.yaml src/dpo.py --smoke-steps 5

Knob	Value
LR	5e-7 cosine, 3% warmup
β	0.1
Epochs	2
Steps	60 (30/epoch × 2)
Final `rewards/accuracies`	0.225
Final `rewards/margins`	−0.026

The result was a clean no-op. MATH-500 and AIME produced identical correct counts under SFT and SFT+DPO (369/500 and 7/60). At temperature 0, identical counts mean DPO did not flip a single answer extraction on those two benchmarks. The +1.5 pp on GSM8K and +1.0 on MathNet are within noise for n=200 and n=500.

This wasn’t a data problem — the 1961 pairs were clean (0% identical chosen/rejected, healthy length distributions). It was a configuration problem, and the training-time metrics called it before eval did: a final rewards/accuracies of 0.225 (worse than the 0.5 coin flip) and a slightly negative rewards/margins of −0.026 mean the policy never learned to prefer chosen over rejected. The likely causes, all conservative for a self-distillation setup where the policy starts equal to the reference:

LR=5e-7 (the trl default) is too small to move a policy that begins identical to its reference.
60 total steps on a cosine-to-zero schedule decays the LR to nothing before the policy finds a useful direction.
β=0.1 (default) is low; self-distillation may need β=0.3 to force differentiation.

Predicted fix for a re-run: LR=5e-6, β=0.3, ≥200 steps. Untested here — but the lesson is cheaper than the re-run: watch rewards/margins during DPO. If it’s hovering at zero, the eval will be a no-op and you can stop paying for it.

The GSM8K regression is a feature of training on harder data

SFT helped most where the problems are hard (+6.7 to +13.8 pp on MATH-500, AIME, MathNet) and hurt on the easiest benchmark (−5.5 pp on GSM8K). This is a known and explainable failure mode. NuminaMath-TIR is competition-style with long, structured CoT. After SFT the model adopts that verbose style even on simple GSM8K word problems, and some chains now run past max_new_tokens=2048 before reaching \boxed{} — which the grader scores as wrong. The fix is either a 4096-token cap or BoN majority voting at inference; both were skipped here for budget.

The absolute AIME numbers also look bleak (5% baseline) precisely because of greedy decoding at a 2048-token cap. R1-Distill-Qwen-7B’s published AIME score is ~55% with temperature 0.6 and many samples. The relative comparison is valid — all three models eval identically — but the absolutes are a decoding artifact, not the model’s ceiling.

Eval grid

Each (model, benchmark) pair runs independently through vLLM at TP=4, greedy, max_new_tokens=2048, no BoN. Grading reuses the same grade_response pipeline from data-gen.

Benchmark	Source	n
GSM8K	openai/gsm8k:main:test	200
MATH-500	HuggingFaceH4/MATH-500	500
AIME 2024+2025	Maxwell-Jia/AIME_2024 + opencompass/AIME2025	60
MathNet	ShadenA/MathNet (text-only)	500

One operational wrinkle: eval.py writes eval/{baseline,lora}_<bench>.json, so with three models the lora filename collides between the SFT and DPO blocks. A bash driver renames each output to sft_<bench>.json or dpo_<bench>.json after the call. Twelve evals, ~14 minutes total.

Cost ledger

Phase	Wall-time	Cost @ $26.33/hr
data_gen (3 runs, ~101 min, TP=4)	101 min	~$44
SFT + 2 merges	~12 min	~$5
DPO smoke + full	~15 min	~$7
Eval (12 runs)	~14 min	~$6
Provisioning + idle + uploads	—	~$31
Total	~3.5 hr	~$93

The single most expensive line is data-gen, and most of that is the TP=4 constraint leaving half the pod idle. On a model whose KV-head count divided by 8, the same run would have been materially cheaper.

What’s next

In priority order, none of it required for proof-of-pipeline:

Re-run DPO with LR=5e-6, β=0.3, 200 steps (~$15) — does DPO deliver real Δ on top of SFT, or does self-distillation truly ceiling-block it?
BoN-8 eval across all three models (~$20) — adds test-time-compute coverage, likely recovers the GSM8K regression and lifts AIME absolutes.
Bump max_new_tokens to 4096 — lifts all absolute numbers without changing relative comparisons.

Code, configs, logs, and per-problem eval JSONs: github.com/debtirthasaha/math-slm. The merged 7B model (~15 GB) is on Hugging Face at MR0b0t/math-slm-sft-dpo-v5.