Math SLM (in progress)

A 1.5B math-reasoning model. LoRA SFT of DeepSeek-R1-Distill-Qwen-1.5B on teacher-generated CoT data, BoN-32 inference with code voting. Target benchmarks include MathNet (ICLR 2026 text-only 500).

Pipeline:

  • Base. DeepSeek-R1-Distill-Qwen-1.5B.
  • Teacher. DeepSeek-R1-Distill-Qwen-7B for generating chain-of-thought training data.
  • Training. LoRA SFT on teacher-distilled CoT, single 1× A100 40GB.
  • Inference. Best-of-N sampling (N=32) with code-execution voting on math problems.
  • Eval. GSM8K, MATH-500, AIME 2024+25, MathNet text-only 500.

Target: cross 10% on MathNet with a 1.5B model. Frontier 2026 models get 45-78% on the same benchmark, so the small-model gap is the point.

Budget: ~$60 cloud spend. Writeup will follow the same structure as the GPT-2 post.