Math SLM (SFT + DPO)
Two-stage LoRA on DeepSeek-R1-Distill-Qwen-7B. SFT +6.4 pp across four math benchmarks; DPO a config-bottlenecked no-op. End-to-end on 8× H100 for ~$93.
Two-stage LoRA on DeepSeek-R1-Distill-Qwen-7B. SFT +6.4 pp across four math benchmarks; DPO a config-bottlenecked no-op. End-to-end on 8× H100 for ~$93.