-
Eight A100s, $61, and 124M parameters
Full reproduction of GPT-2 124M on rented multi-GPU hardware. Val loss 3.40 vs OpenAI's 3.29 (97% match), HellaSwag 27% vs 29.45%, in 2.5 hours of training.
-
BPE from scratch, and why your LLM can't count L's
Byte-pair encoding implemented in pure Python. Plus SolidGoldMagikarp, the encode/decode asymmetry, and a list of LLM weirdness all caused by the tokenizer.
-
Birkhoff in 8.7 KB
An 8.71 KB prompt for SAIR's equational-theories competition (Tao + Davis, follow-up to Honda-Murakami-Zhang 2025). Replace free-form LLM reasoning with a 9-magma Birkhoff-sound decision procedure. A 31B model running this prompt beat a 120B one on the hardest set.
-
Tiny Shakespeare, tiny GPT
A 1.83M-parameter decoder-only transformer trained on 1MB of Shakespeare. Architecture is identical to GPT-2, just smaller.
-
makemore: from counting bigrams to a WaveNet
Five character-level language models trained on 32K baby names. Bigram → MLP → BatchNorm → manual backprop → hierarchical fusion.