notes

one project per post — architecture, bugs, numbers

A haiku VLM: SFT did the work, KTO collapsed at λ=1.0

A LLaVA-pattern VLM that writes a 5-7-5 haiku for a ukiyo-e woodblock print. SigLIP (frozen) + trained projector + Qwen2.5-3B (LoRA), on 3,913 Met Museum prints, in English and Japanese. SFT delivered ~95% of the lift; preference optimization only helped where the chosen/rejected gap was real; KTO collapsed at its default λ_U.

14 min read · 2026

A 7B math fine-tune on 8× H100: SFT +6.4, DPO +0.6

Two-stage LoRA (SFT then DPO) on DeepSeek-R1-Distill-Qwen-7B, end-to-end on a single 8× H100 pod for ~$93. SFT lifted four math benchmarks by +6.4 pp average; DPO at conservative defaults moved nothing, and the training-time reward margin predicted it.

16 min read · 2026

Eight A100s, $61, and 124M parameters

Full reproduction of GPT-2 124M on rented multi-GPU hardware. Val loss 3.40 vs OpenAI's 3.29 (97% match), HellaSwag 27% vs 29.45%, in 2.5 hours of training.

18 min read · 2026

Birkhoff in 8.7 KB

An 8.71 KB prompt for SAIR's equational-theories competition (Tao + Davis, follow-up to Honda-Murakami-Zhang 2025). Replace free-form LLM reasoning with a 9-magma Birkhoff-sound decision procedure. A 31B model running this prompt beat a 120B one on the hardest set.

16 min read · 2026

BPE from scratch, and why your LLM can't count L's

Byte-pair encoding implemented in pure Python. Plus SolidGoldMagikarp, the encode/decode asymmetry, and a list of LLM weirdness all caused by the tokenizer.

17 min read · April 25, 2026

2026 · tokenization bpe gpt · nlp
Birkhoff in 8.7 KB

An 8.71 KB prompt for SAIR's equational-theories competition (Tao + Davis, follow-up to Honda-Murakami-Zhang 2025). Replace free-form LLM reasoning with a 9-magma Birkhoff-sound decision procedure. A 31B model running this prompt beat a 120B one on the hardest set.

16 min read · April 20, 2026

2026 · prompting llm-reasoning benchmarks equational-logic · nlp
Tiny Shakespeare, tiny GPT

A 1.83M-parameter decoder-only transformer trained on 1MB of Shakespeare. Architecture is identical to GPT-2, just smaller.

10 min read · April 15, 2026

2026 · transformer attention gpt · deep-learning
makemore: from counting bigrams to a WaveNet

Five character-level language models trained on 32K baby names. Bigram → MLP → BatchNorm → manual backprop → hierarchical fusion.

10 min read · April 08, 2026

2026 · language-models mlp batchnorm · deep-learning
micrograd: a scalar-valued autograd engine

A 150-line autograd engine that supports +, *, **, tanh, exp, and a tiny MLP on top.

6 min read · April 01, 2026

2026 · autograd backprop · deep-learning

notes

one project per post — architecture, bugs, numbers

A haiku VLM: SFT did the work, KTO collapsed at λ=1.0

A 7B math fine-tune on 8× H100: SFT +6.4, DPO +0.6

Eight A100s, $61, and 124M parameters

Birkhoff in 8.7 KB

BPE from scratch, and why your LLM can't count L's

Birkhoff in 8.7 KB

Tiny Shakespeare, tiny GPT

makemore: from counting bigrams to a WaveNet

micrograd: a scalar-valued autograd engine