blank

Eight A100s, $61, and 124M parameters

2026-05-17T18:00:00+00:00

End-to-end reproduction of Karpathy’s “Let’s reproduce GPT-2 (124M)” video. The build started on a 4 GB GTX 1650 and finished on 8× A100 SXM4 rented from Lambda. Training ran for 19,073 steps over 10B FineWeb-Edu tokens. Total cost: $61.23. Final val loss 3.40 vs OpenAI’s released baseline of 3.29 — a 97% match. HellaSwag 26.99% vs 29.45% baseline.

Metric	Start	End	OpenAI baseline	% of target
Val loss	10.951	3.3969	3.292	97%
HellaSwag acc_norm	24.82%	26.99%	29.45%	91%
Sustained throughput	—	1.1M tokens/sec	—	—
Steps/min	—	~127	—	—

The val loss gap is small enough that another ~$80 of training would close it. The HellaSwag gap is the well-known FineWeb-Edu artifact — high-quality educational text gives slightly worse commonsense reasoning per perplexity unit than OpenAI’s WebText.

The actual training run, with the OpenAI baseline as a reference line:

{"data":[{"x":[0,50,100,150,200,250,300,350,400,450,500,550,600,650,700,750,800,850,900,950,1000,1050,1100,1150,1200,1250,1300,1350,1400,1450,1500,1550,1600,1650,1700,1750,1800,1850,1900,1950,2000,2050,2100,2150,2200,2250,2300,2350,2400,2450,2500,2550,2600,2650,2700,2750,2800,2850,2900,2950,3000,3050,3100,3150,3200,3250,3300,3350,3400,3450,3500,3550,3600,3650,3700,3750,3800,3850,3900,3950,4000,4050,4100,4150,4200,4250,4300,4350,4400,4450,4500,4550,4600,4650,4700,4750,4800,4850,4900,4950,5000,5050,5100,5150,5200,5250,5300,5350,5400,5450,5500,5550,5600,5650,5700,5750,5800,5850,5900,5950,6000,6050,6100,6150,6200,6250,6300,6350,6400,6450,6500,6550,6600,6650,6700,6750,6800,6850,6900,6950,7000,7050,7100,7150,7200,7250,7300,7350,7400,7450,7500,7550,7600,7650,7700,7750,7800,7850,7900,7950,8000,8050,8100,8150,8200,8250,8300,8350,8400,8450,8500,8550,8600,8650,8700,8750,8800,8850,8900,8950,9000,9050,9100,9150,9200,9250,9300,9350,9400,9450,9500,9550,9600,9650,9700,9750,9800,9850,9900,9950,10000,10050,10100,10150,10200,10250,10300,10350,10400,10450,10500,10550,10600,10650,10700,10750,10800,10850,10900,10950,11000,11050,11100,11150,11200,11250,11300,11350,11400,11450,11500,11550,11600,11650,11700,11750,11800,11850,11900,11950,12000,12050,12100,12150,12200,12250,12300,12350,12400,12450,12500,12550,12600,12650,12700,12750,12800,12850,12900,12950,13000,13050,13100,13150,13200,13250,13300,13350,13400,13450,13500,13550,13600,13650,13700,13750,13800,13850,13900,13950,14000,14050,14100,14150,14200,14250,14300,14350,14400,14450,14500,14550,14600,14650,14700,14750,14800,14850,14900,14950,15000,15050,15100,15150,15200,15250,15300,15350,15400,15450,15500,15550,15600,15650,15700,15750,15800,15850,15900,15950,16000,16050,16100,16150,16200,16250,16300,16350,16400,16450,16500,16550,16600,16650,16700,16750,16800,16850,16900,16950,17000,17050,17100,17150,17200,17250,17300,17350,17400,17450,17500,17550,17600,17650,17700,17750,17800,17850,17900,17950,18000,18050,18100,18150,18200,18250,18300,18350,18400,18450,18500,18550,18600,18650,18700,18750,18800,18850,18900,18950,19000,19050],"y":[10.955029,8.652573,7.339675,6.86025,6.542106,6.387939,6.357413,6.096048,5.967309,5.876346,5.71647,5.620126,5.435894,5.515324,5.390418,5.392198,5.161924,5.103988,4.995365,4.951733,4.926783,4.729811,4.587543,4.550589,4.48293,4.45039,4.537868,4.524739,4.583992,4.460568,4.444868,4.381177,4.368808,4.243211,4.285657,4.238783,4.14917,4.080226,4.098678,4.060075,4.298099,4.237405,4.191657,4.160752,4.198476,4.104283,4.108762,4.067921,4.116701,3.930878,3.955742,3.91789,3.876447,4.088891,4.056384,4.056029,4.073021,4.063436,4.069367,4.050061,4.010026,3.995684,3.911222,3.906269,3.913971,3.918218,3.8181,3.854215,3.778675,4.010101,3.959394,3.957729,3.982995,3.894079,3.907338,3.947609,3.883737,3.910084,3.870479,3.850236,3.727764,3.670568,3.701516,3.90143,3.958811,3.923477,3.937549,3.883154,3.868949,3.855658,3.828887,3.774447,3.781784,3.798658,3.635849,3.673353,3.645618,3.586813,3.888197,3.844004,3.799917,3.815917,3.847712,3.832865,3.779544,3.74312,3.729076,3.561608,3.449939,3.528541,3.683025,3.799625,3.81723,3.845233,3.772759,3.711059,3.710322,3.754577,3.779361,3.760361,3.777559,3.711643,3.720454,3.777155,3.65674,3.782415,3.646935,3.719334,3.650414,3.724901,3.751636,3.754004,3.742701,3.696117,3.698276,3.68034,3.759598,3.708417,3.695954,3.631462,3.729955,3.715655,3.698024,3.702647,3.629437,3.723177,3.659523,3.68377,3.655551,3.658967,3.7362,3.69976,3.615459,3.622903,3.649003,3.639076,3.662286,3.650483,3.670292,3.651448,3.695603,3.68487,3.654457,3.708802,3.614856,3.69809,3.693779,3.759378,3.655924,3.603687,3.635201,3.577073,3.665966,3.597528,3.628942,3.662407,3.59794,3.680565,3.593917,3.621378,3.630538,3.659089,3.641257,3.618741,3.630788,3.586698,3.624945,3.674439,3.589103,3.632992,3.573211,3.668649,3.576148,3.580356,3.653369,3.636115,3.52358,3.505952,3.577158,3.549963,3.615004,3.577269,3.568972,3.564436,3.570774,3.634584,3.541682,3.667766,3.649652,3.562427,3.635001,3.523956,3.58213,3.502887,3.647513,3.51616,3.524837,3.57378,3.494234,3.525831,3.57591,3.551852,3.528437,3.449448,3.544546,3.442327,3.526242,3.499907,3.521501,3.529324,3.462544,3.582874,3.455339,3.536372,3.499629,3.504132,3.533975,3.490507,3.533957,3.489776,3.557726,3.498231,3.519463,3.488743,3.457328,3.46037,3.476739,3.496565,3.492133,3.472932,3.479477,3.561807,3.48808,3.509819,3.694302,3.460399,3.524822,3.494879,3.392873,3.560445,3.419302,3.431916,3.512982,3.439761,3.548496,3.439065,3.460358,3.483511,3.41854,3.472522,3.46191,3.484112,3.439632,3.421415,3.482553,3.374554,3.473166,3.405603,3.465045,3.494958,3.42211,3.420622,3.453494,3.398499,3.427082,3.472697,3.431491,3.40042,3.439947,3.439783,3.438387,3.458462,3.404739,3.435358,3.447509,3.449286,3.387485,3.45961,3.476094,3.443756,3.433138,3.329407,3.445033,3.35931,3.425076,3.378695,3.468765,3.427226,3.371132,3.437642,3.390105,3.415966,3.414496,3.380322,3.463654,3.346844,3.372707,3.376734,3.404138,3.317891,3.375628,3.411178,3.285608,3.410325,3.337027,3.41946,3.418103,3.326298,3.391832,3.404434,3.378114,3.332703,3.339972,3.366685,3.418615,3.384357,3.433587,3.384535,3.312392,3.38399,3.330543,3.428728,3.399785,3.389828,3.414865,3.348916,3.35787,3.314832,3.453763,3.406743,3.298486,3.358698,3.260078,3.487381,3.320862,3.405267,3.411797,3.287154,3.387619,3.278841,3.413348,3.371803,3.388267,3.352105,3.261056,3.370182,3.351461,3.389376,3.346794,3.29316,3.40975,3.297334,3.394829,3.381837,3.26427,3.313486,3.348226,3.354343,3.289057,3.260655,3.272449,3.453633],"name":"train loss","type":"scatter","mode":"lines","line":{"color":"rgba(99,110,250,0.35)","width":1},"hovertemplate":"step %{x}
train %{y:.3f}"},{"x":[0,250,500,750,1000,1250,1500,1750,2000,2250,2500,2750,3000,3250,3500,3750,4000,4250,4500,4750,5000,5250,5500,5750,6000,6250,6500,6750,7000,7250,7500,7750,8000,8250,8500,8750,9000,9250,9500,9750,10000,10250,10500,10750,11000,11250,11500,11750,12000,12250,12500,12750,13000,13250,13500,13750,14000,14250,14500,14750,15000,15250,15500,15750,16000,16250,16500,16750,17000,17250,17500,17750,18000,18250,18500,18750,19000,19072],"y":[10.9512,6.43,5.824,5.3065,4.8817,4.5899,4.4297,4.3199,4.2401,4.171,4.1257,4.0726,4.0291,3.9984,3.9699,3.9341,3.911,3.8882,3.8702,3.8511,3.83,3.813,3.8044,3.782,3.7661,3.7459,3.7313,3.7218,3.7073,3.6976,3.6843,3.6784,3.6638,3.6543,3.6425,3.6341,3.6258,3.616,3.607,3.5969,3.5895,3.5786,3.5725,3.5648,3.5544,3.5483,3.5396,3.5319,3.5236,3.5161,3.5099,3.5031,3.4963,3.4897,3.485,3.4792,3.4719,3.4666,3.4611,3.4558,3.4511,3.4459,3.4418,3.4376,3.4329,3.4287,3.4259,3.4222,3.4196,3.4169,3.4137,3.4108,3.4082,3.406,3.4044,3.4023,3.3994,3.3969],"name":"val loss","type":"scatter","mode":"lines+markers","line":{"color":"#EF553B","width":2.5},"marker":{"size":4},"hovertemplate":"step %{x}
val %{y:.3f}"},{"x":[0,19072],"y":[3.292,3.292],"name":"OpenAI GPT-2 124M val baseline (3.292)","type":"scatter","mode":"lines","line":{"color":"#00cc96","width":1.5,"dash":"dash"},"hoverinfo":"skip"}],"layout":{"title":{"text":"Training and validation loss, 19,073 steps on FineWeb-Edu"},"xaxis":{"title":"step","range":[-100,19200]},"yaxis":{"title":"loss"},"height":460,"margin":{"l":60,"r":30,"t":60,"b":50},"hovermode":"x unified","legend":{"x":0.6,"y":0.95}}}

Model: matching HF’s GPT-2 byte-for-byte

The model class mirrors HuggingFace’s GPT2LMHeadModel parameter names exactly so that from_pretrained("gpt2") can copy weights in. 148 tensors per HF GPT-2 state_dict, in three groups:

Embeddings. transformer.wte.weight (50257, 768), transformer.wpe.weight (1024, 768). Position embeddings are learned, not sinusoidal — that was a 2018 GPT-2 choice.
12 identical blocks. Each has pre-norm ln_1, fused QKV projection attn.c_attn.weight (768, 2304), output projection attn.c_proj, pre-norm ln_2, MLP up mlp.c_fc.weight (768, 3072) (4× expansion), MLP down mlp.c_proj. GELU between the MLP linears.
Final. transformer.ln_f (the extra LayerNorm GPT-2 added vs the original 2017 transformer) and lm_head.weight (50257, 768).

Two implementation gotchas:

HF weights are stored transposed. HuggingFace uses a TF-legacy Conv1D layer that stores (in, out)-shaped weights. Standard PyTorch nn.Linear stores (out, in). So when copying HF weights into our nn.Linear-based model, the four matrices c_attn, c_proj (attn), c_fc, c_proj (mlp) need .t(). The other tensors copy directly.

Weight tying. lm_head.weight and transformer.wte.weight are the same tensor in GPT-2 — one physical 38.6M-parameter matrix used both as input embedding and output classifier. That matrix alone is ~30% of the full 124M. Implementation: self.transformer.wte.weight = self.lm_head.weight. Two Python names, one tensor object.

Residual stream init

GPT-2 scales the output projection of each residual sub-layer at init by 1/sqrt(2*n_layer). With n_layer = 12, that’s a factor of ~0.204. Why: at each residual addition x = x + sub_layer(x), the variance of x grows. Without rescaling, variance compounds across 24 sub-layers (12 attn + 12 MLP) and the residual stream’s scale explodes by init time. The rescale keeps post-stack variance close to the input variance.

for pn, p in self.named_parameters():
    if pn.endswith('c_proj.weight'):
        torch.nn.init.normal_(p, mean=0.0, std=0.02 * (2 * config.n_layer) ** -0.5)

Everything else is standard N(0, 0.02). This trick is from nanoGPT, not the GPT-2 paper, but it makes training much more stable.

Data: FineWeb-Edu, sharded

10B tokens of FineWeb-Edu (the educational subset), tokenized with tiktoken’s gpt2 encoding, written out as 100 shards of (100M tokens, uint16) .npy files. Tokenization is multiprocess and dominated by Python overhead — ~30 minutes on the 8× A100 instance, ~$26 just to tokenize.

uint16 because vocab size is 50257, comfortably under 65536. Cuts disk size in half.

DataLoaderLite keeps one shard in memory at a time, advances to the next when exhausted, and respects the DDP rank/world-size partitioning:

def reset(self):
    self.current_shard = 0
    self.tokens = load_tokens(self.shards[self.current_shard])
    self.current_position = self.B * self.T * self.process_rank

def next_batch(self):
    B, T = self.B, self.T
    buf = self.tokens[self.current_position : self.current_position + B*T + 1]
    x = buf[:-1].view(B, T)
    y = buf[1:].view(B, T)
    self.current_position += B * T * self.num_processes
    if self.current_position + (B * T * self.num_processes + 1) > len(self.tokens):
        self.current_shard = (self.current_shard + 1) % len(self.shards)
        self.tokens = load_tokens(self.shards[self.current_shard])
        self.current_position = self.B * self.T * self.process_rank
    return x, y

current_position starts at B * T * process_rank so each rank reads a different stride. Combined with current_position += B * T * num_processes after every step, the 8 ranks tile the shard without overlap.

Hyperparameters: every number traced to a source

The single most important insight: papers specify training in tokens, not steps. Everything follows from that.

Knob	Value	Source
Total batch (tokens / step)	524288 = 2¹⁹ ≈ 0.5M	GPT-3 paper Table 2.1, “GPT-3 Small”
`max_steps`	19073	`10_000_000_000 / 524288` = full pass over 10B tokens
`warmup_steps`	715	`375_000_000 / 524288` (GPT-3 §2.3: linear warmup over 375M tokens)
`max_lr`	6e-4	GPT-3 paper Table 2.1
`min_lr`	`max_lr * 0.1`	GPT-3 §2.3: “cosine decay to 10% of max”
`betas`	(0.9, 0.95)	GPT-3 §2.3 — note β₂=0.95, not Adam default 0.999
`weight_decay`	0.1	GPT-3 §2.3
Wd applies to 2D+ params only	—	GPT-3 §2.3
`clip_grad_norm`	1.0	GPT-3 §2.3
`n_layer, n_head, n_embd`	12, 12, 768	GPT-2 paper Table 2 (124M row)
`block_size`	1024	GPT-2 paper
`vocab_size` (padded)	50304	Nearest multiple of 128 ≥ 50257, for tensor-core tile alignment. nanoGPT addition, not a paper.

The provenance layers cleanly:

GPT-3 paper → all training hyperparameters (LR, betas, weight decay, batch, schedule)
GPT-2 paper → architecture (layers, heads, embd, pre-norm, ln_f, GELU)
nanoGPT → implementation tricks (vocab pad to 50304, residual init rescale, fused AdamW)

Gradient accumulation and the global-batch math

Global batch size is fixed at 524288 tokens. With 8 GPUs and per-GPU B=16, T=1024:

per_gpu_tokens     = B * T              = 16 * 1024  = 16384
global_per_step    = per_gpu * world    = 16384 * 8  = 131072
grad_accum_steps   = global / 524288    = 524288 / 131072 = 4

Every “macro step” = 4 forward+backward passes per GPU + 1 optimizer step + 1 all-reduce. Compute is dominated by the forward+backward; the all-reduce is bandwidth-bound but small compared to compute.

In code:

for micro_step in range(grad_accum_steps):
    x, y = train_loader.next_batch()
    x, y = x.to(device), y.to(device)
    with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
        logits, loss = model(x, y)
    loss = loss / grad_accum_steps     # critical
    if ddp:
        model.require_backward_grad_sync = (micro_step == grad_accum_steps - 1)
    loss.backward()

Two things to be careful about:

loss / grad_accum_steps because loss.backward() accumulates gradients with +=. Without dividing, four micro-steps give 4× the gradient — equivalent to multiplying the LR by 4. Subtle bug.
require_backward_grad_sync is DDP’s escape hatch: gradient all-reduce normally fires on every .backward() call. With grad accumulation, we only want to sync on the last micro-step. The intermediate all-reduces would just waste bandwidth syncing partial gradients that will be added to again before the step. Setting the flag to False for the first 3 micro-steps and True only on the 4th cuts ~25% of network traffic with no correctness cost.

Speed: bf16, SDPA, TF32, vocab padding

On the 4 GB 1650 baseline at B=4, T=32: ~1080 tokens/sec at fp32. The Ampere/Hopper speedups don’t fit on that card. The cloud instance runs Hopper-tier compute and reclaims them all:

bfloat16 autocast for forward+backward. Halves memory bandwidth and unlocks tensor-core throughput. bf16 has fp32’s exponent range, so unlike fp16 you don’t need a gradient scaler — the simple torch.autocast(dtype=torch.bfloat16) context manager works directly.
Scaled-dot-product attention (F.scaled_dot_product_attention) replaces the manual Q@K.T softmax. On Ampere+ this dispatches to FlashAttention 2 under the hood — fused kernel, no materialized (T, T) attention matrix, much better memory traffic. On the 1650 SDPA falls back to manual and gives ~0% speedup; on A100 it’s a real win.
TF32 matmuls via torch.set_float32_matmul_precision('high'). Same fp32 storage but tensor-core compute. Free speedup, no accuracy cost on any benchmark I checked.
vocab_size padded 50257 → 50304. Multiple of 128 = aligns with tensor-core tile sizes. The extra rows are dead weight that never see gradient because no real token ID maps to them, but they make the matmul faster. ~150 KB of extra parameters, several % throughput. Worth it.
torch.compile off for this run because HellaSwag eval has variable shapes (each example has its own max_len) and compile recompiles per shape — 10042 recompiles is catastrophic. Without HellaSwag in the loop, flip compile back on and reclaim ~20-30%.

DDP wraps after compile (when compile is on):

model = GPT(GPTConfig(vocab_size=50304))
model.to(device)
if use_compile:
    model = torch.compile(model)
if ddp:
    model = DDP(model, device_ids=[ddp_local_rank])
raw_model = model.module if ddp else model

torch.compile(DDP(model)) would try to trace DDP’s gradient-sync wrappers, which isn’t real compute. Compile-first, DDP-second.

The optimizer: weight decay split

Following GPT-3 §2.3, weight decay applies to “weight” matrices (2D and higher) but not to biases or LayerNorm γ/β (1D parameters). Implementation:

def configure_optimizers(self, weight_decay, learning_rate, betas, device_type):
    param_dict = {pn: p for pn, p in self.named_parameters() if p.requires_grad}
    decay_params   = [p for n, p in param_dict.items() if p.dim() >= 2]
    nodecay_params = [p for n, p in param_dict.items() if p.dim() <  2]
    optim_groups = [
        {'params': decay_params,   'weight_decay': weight_decay},
        {'params': nodecay_params, 'weight_decay': 0.0},
    ]
    use_fused = device_type == 'cuda'
    return torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas,
                             eps=1e-8, fused=use_fused)

fused=True calls a single CUDA kernel for the AdamW update instead of dispatching per-parameter. Trivially faster, no accuracy impact.

LR schedule: warmup + cosine + floor

def get_lr(it):
    if it < warmup_steps:
        return max_lr * (it + 1) / warmup_steps
    if it > max_steps:
        return min_lr
    decay_ratio = (it - warmup_steps) / (max_steps - warmup_steps)
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))
    return min_lr + coeff * (max_lr - min_lr)

Linear ramp from 0 to max_lr over warmup_steps. Then cosine decay to min_lr = 0.1 * max_lr. The cosine starts at coeff=1 (full max_lr) and ends at coeff=0 (full min_lr). At step 7240, decay_ratio = (7240 - 715) / (19073 - 715) ≈ 0.355 and the LR is roughly 0.1 + 0.85 * cos(π·0.355) ≈ 0.5 * max_lr. Mid-training, half max.

HellaSwag inline eval

HellaSwag is a 4-way multiple-choice commonsense benchmark. Each example: a context + 4 candidate endings, pick the most plausible. Scoring: for each candidate, average the LM’s per-token loss over just the ending tokens. The candidate with lowest loss is the model’s pick.

def get_most_likely_row(tokens, mask, logits):
    shift_logits = logits[..., :-1, :].contiguous()
    shift_tokens = tokens[..., 1:].contiguous()
    shift_losses = F.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_tokens.view(-1),
        reduction='none',
    ).view(tokens.size(0), -1)
    shift_mask = mask[..., 1:].contiguous()
    masked_losses = shift_losses * shift_mask
    avg_loss = masked_losses.sum(dim=1) / shift_mask.sum(dim=1)
    return avg_loss.argmin().item()

10042 examples, sharded across the 8 ranks. The accuracies all-reduce at the end of each eval block.

Random chance = 25%. OpenAI’s released GPT-2 (124M) baseline = 29.45%. Our reproduction = 26.99%.

Training memory: optimizer state is most of it

What lives in GPU memory	Training	Inference
Model weights (124M × 4B)	~500 MB	~500 MB
Gradients (one per param)	~500 MB	—
Adam `m` running mean	~500 MB	—
Adam `v` running variance	~500 MB	—
Activations for backward	huge, ∝ B×T	—
Forward activation buffers	yes	yes (ephemeral)
Total	2-4 GB minimum	~600 MB

The optimizer state is what kills training on small cards. A 4 GB GTX 1650 can comfortably infer 124M at fp16 but can’t comfortably train it. AdamW needs 3× the model size in extra state, before you’ve allocated a single activation.

Cost ledger

Phase	Time	Cost
FineWeb-Edu tokenization (multiprocess on 8× A100 inst.)	~30 min	~$8
Training: 19,073 steps × ~472 ms/step	~2.5 hours	~$40
Setup, idle, HF download throttles, debugging	~1 hour	~$13
Total	~4 hours	$61.23

8× A100 SXM4 on Lambda was $15.92/hr at the time. Every minute typing slowly costs real money — have the next command ready before SSH’ing in.

tmux new -s train keeps the training run alive across SSH drops. Doesn’t save money, but the alternative is a $40 training run dying because your laptop went to sleep.

What the trained model produces

Sample completions at val_loss 3.40 with temperature=1.0, top_k=50, prompt "Hello, I'm a language model,":

Coherent prose about teachers, students, classrooms. Grammatical for 3 sentences.
Rambling but grammatical paragraph on social media and jobs.
Repetition loop: "I'm a language model. I'm a language model. I'm a language model."
Has dialogue with quote marks, sentence-level coherence.

Diagnosis: working LM with sentence-level coherence and occasional repetition loops — exactly the failure mode of a 124M model at val_loss 3.4. At 3.1 the loops fade. With temperature=0.7, top_k=20 the output is cleaner but less creative.

What the model actually is

124,475,904 floating-point numbers plus ~200 lines of Python that combines them. The checkpoint file is those numbers plus a kilobyte of config. Inference loads the numbers and runs the architecture forward.

Tensor	Shape	Count
`wte.weight` (tied with `lm_head`)	50304 × 768	38,633,472
`wpe.weight`	1024 × 768	786,432
12 × Block	each ~7.09M	~85,054,464
`ln_f` (γ + β)	768 × 2	1,536
Total		~124,475,904

Random init and trained model have the same 124M numbers in identical shapes. The only difference is the numerical values. Training is 19,073 nudges of lr × gradient × (-1) applied to every number. No single number means anything. Meaning emerges from the collective behavior of all of them running through the matmuls.

What would close the remaining gap

Train 5K more steps on 8× A100 to hit OpenAI val loss exactly. ~$80, gets val loss to ~3.29.
Switch base dataset from FineWeb-Edu to OpenWebText. Educational text is high-quality but narrow. WebText is broader and gives better commonsense — closes the HellaSwag gap.
Quantize for inference. int8 or int4 brings the inference footprint from 600 MB to ~150 MB, runs much faster on the 1650.

Code and plots: github.com/debtirthasaha/gpt2-124m-reproduction. The trained checkpoint (523 MB) is on Hugging Face at MR0b0t/gpt2-124m-reproduction.

BPE from scratch, and why your LLM can’t count L’s

2026-04-25T10:00:00+00:00

A byte-pair-encoding tokenizer in pure Python on byte arrays. No NumPy, no neural net, no gradients. The trained tokenizer is two dicts:

merges: {(int, int): int}    # the parameters of the tokenizer
vocab:  {int: bytes}         # derived from merges, used to decode

That’s the entire model. Then a long second half on what the tokenizer makes weird about LLMs: SolidGoldMagikarp, spelling failures, arithmetic failures, and the encode/decode asymmetry.

Why tokenization exists

LLMs eat integers, not text. The tokenizer maps strings ↔ integer sequences.

Two obvious approaches, both bad:

Unicode code points as tokens. ~150K possible code points → vocab too large. The Unicode standard also keeps changing — not stable.
Raw UTF-8 bytes. Vocab is a clean 256, but every text becomes 3-4× longer. Attention is O(T²). Long sequences blow up compute and exhaust context length.

BPE: start at 256 raw bytes, iteratively merge the most frequent adjacent pair into a new token. Sequences shrink, vocab grows in a controlled way, you stop whenever you like. Vocab size is now a tunable hyperparameter.

GPT-2 uses ~50K. GPT-4 uses ~100K. Llama 2 uses ~32K.

UTF-8 in one paragraph

UTF-8 encodes each Unicode code point as 1-4 bytes. ASCII is 1 byte (compatible with the old world). CJK ideographs are 3 bytes. Most emoji are 4. Crucially, not every byte sequence is valid UTF-8 — b'\x80' alone is not a legal start byte. This matters when we look at encode/decode round-tripping.

"Hello".encode('utf-8')   # b'Hello'             (5 bytes)
"안".encode('utf-8')       # b'\xec\x95\x88'      (3 bytes)
"🌊".encode('utf-8')       # b'\xf0\x9f\x8c\x8a'  (4 bytes)

GPT-2, GPT-4, and Llama all run BPE on UTF-8 bytes (byte-level BPE). Sentencepiece runs BPE on code points and falls back to bytes only for rare ones — clunkier but you’ll meet it in Llama and Mistral.

The BPE algorithm

Given a sequence of token IDs:

Count adjacent pairs.
Pick the most frequent pair.
Mint a new token ID (256, 257, 258, …).
Replace every occurrence of the pair with the new ID.
Record the merge.

Repeat N times. The dict of recorded merges is the trained tokenizer.

Start (vocab=4):  a a b d a a b d a a b              (length 11)
Most freq: (a,a) → mint Z
Round 1:          Z b d Z b d Z b                    (length 9, vocab 5)
Most freq: (Z,b) → mint Y       ← Z is brand new but already mergeable
Round 2:          Y d Y d Y                          (length 5, vocab 6)

Token 256 can participate in the round-2 merge that creates token 257. BPE is hierarchical — merges form a forest where new merges build on top of old ones. This hierarchy is load-bearing and shows up again when we build vocab from merges and again when we encode.

Implementation

def get_stats(ids):
    counts = {}
    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts

def merge(ids, pair, idx):
    newids = []
    i = 0
    while i < len(ids):
        if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
            newids.append(idx)
            i += 2
        else:
            newids.append(ids[i])
            i += 1
    return newids

zip(ids, ids[1:]) is the idiomatic way to iterate adjacent pairs. The i < len(ids) - 1 check must come first — otherwise ids[i+1] on the last element raises IndexError. Python’s and short-circuits, so the bounds check before the comparison is the fix.

Training is the loop:

tokens = list(text.encode('utf-8'))   # ints in [0, 255]
ids    = list(tokens)
merges = {}
for i in range(num_merges):
    stats = get_stats(ids)
    pair  = max(stats, key=stats.get)
    idx   = 256 + i
    ids   = merge(ids, pair, idx)
    merges[pair] = idx

list(text.encode('utf-8')) because iterating a bytes object yields ints in Python 3, so list(bytes_obj) is a flat list of ints in [0, 255]. We keep tokens untouched for the compression-ratio report and mutate ids.

20 merges on 20868 bytes of text: down to 16154 tokens, 1.29× compression. The first pair selected is (101, 32) = 'e ' — words ending in e followed by a space.

{"data":[{"x":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],"y":[20868,20276,19859,19465,19148,18865,18594,18326,18092,17873,17671,17469,17300,17136,16975,16826,16689,16552,16417,16285,16154],"type":"scatter","mode":"lines+markers","line":{"color":"#EF553B","width":2},"marker":{"size":7},"text":["start","'e '","'in'","'s '","'th'","'er'","'t '","'co'","', '","'an'","'d '","'or'","'ar'","'en'","' on '","' on '","'y '","'al'","'on'","'<256><256> '","'ac'"],"hovertemplate":"merge %{x}
pair: %{text}
tokens: %{y}","name":"sequence length"}],"layout":{"title":{"text":"Sequence length over 20 BPE merges (20,868 bytes -> 16,154 tokens)"},"xaxis":{"title":"merge step","dtick":2},"yaxis":{"title":"tokens","range":[15500,21500]},"height":420,"margin":{"l":70,"r":30,"t":60,"b":50},"showlegend":false}}

The biggest drops come early — 'e ' alone saves 592 tokens. Returns diminish: by merge 20, each new pair removes ~130 tokens. The dominant merges are mostly English space-suffix bigrams ('e ', 's ', 't ', 'd ', 'y ', ', ') and high-frequency root pairs ('in', 'th', 'er', 'an', 'or', 'ar', 'en', 'al', 'on', 'ac'). BPE rediscovers the morphological structure of English suffixes and roots from byte frequencies alone.

Building `vocab` from `merges`

Decode needs to know what each token ID is in bytes. Derive it:

def _build_vocab(merges):
    vocab = {i: bytes([i]) for i in range(256)}
    for (p0, p1), idx in merges.items():
        vocab[idx] = vocab[p0] + vocab[p1]
    return vocab

The insertion-order requirement is real. Token 258 might be vocab[256] + vocab[257] — both must already exist when we look them up. Python 3.7+ guarantees dict.items() iterates in insertion order. In Python ≤3.6 this code silently produces wrong vocab.

Decode

def decode(ids, vocab):
    raw_bytes = b"".join(vocab[i] for i in ids)
    return raw_bytes.decode('utf-8', errors='replace')

errors='replace' because not every byte sequence is valid UTF-8. If the LLM emits a sequence of token IDs whose concatenated bytes don’t form a valid UTF-8 string, errors='strict' raises UnicodeDecodeError and the inference call crashes. 'replace' substitutes U+FFFD (the ?-in-a-diamond character) and keeps going. OpenAI’s released code does the same.

Encode

The trick: apply merges in the same order they were created during training. Get this wrong and you produce a different token sequence than the trained vocabulary expects.

def encode(text, merges):
    tokens = list(text.encode('utf-8'))
    if len(tokens) < 2:
        return tokens

    while True:
        stats = get_stats(tokens)
        pair  = min(stats, key=lambda p: merges.get(p, float('inf')))
        if pair not in merges:
            break
        tokens = merge(tokens, pair, merges[pair])
    return tokens

Three subtleties:

min, not max. Training picked the most frequent pair. Encoding picks the lowest-rank merge (the one created earliest). Why: later merges depend on earlier ones existing as their building blocks. If the text contains a pair that became token 258 = (256, 257), then tokens 256 and 257 must merge first. Always do the earliest available merge.
float('inf') as fallback. Pairs not in merges get rank infinity. min never picks them. The loop terminates when every remaining pair has rank infinity.
len(tokens) < 2 guard. Empty or single-char strings give empty stats and min({}) raises ValueError.

The encode(decode(x)) asymmetry

decode(encode("hello world")) == "hello world"   # always
encode(decode([128]))         == [128]            # NOT guaranteed

decode([128]): byte 128 alone is b'\x80', an invalid UTF-8 start byte. With errors='replace', decode returns the replacement character. Re-encoding the replacement character gives different bytes than [128].

Forward (text → tokens → text) is always lossless. Reverse may not be. If your code ever relies on encode(decode(x)) == x, it has a latent bug.

GPT-2 regex pre-splitting

Plain BPE happily merges across word and punctuation boundaries: dog., dog!, dog?, dog, end up as separate tokens. GPT-2 prevents this by forcing a split before BPE runs, using a regex:

import regex as re   # pip install regex — NOT the stdlib re

GPT2_PATTERN = r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

re.findall(GPT2_PATTERN, "Hello world! I'm fine.")
# ['Hello', ' world', '!', " I", "'m", ' fine', '.']

Pattern parts:

's|'t|'re|'ve|'m|'ll|'d — English contraction suffixes
` ?\p{L}+` — optional space + Unicode letters (so “ world” is one chunk)
` ?\p{N}+` — optional space + Unicode numbers
` ?[^\s\p{L}\p{N}]+` — optional space + punctuation
\s+(?!\S) and \s+ — whitespace runs

BPE runs per chunk and IDs are concatenated. Two known GPT-2 bugs fixed in GPT-4: only ASCII apostrophe (curly quotes break), and not case-insensitive (DON'T doesn’t split). GPT-4 uses (?i:...) and properly handles Unicode apostrophes.

GPT-4 also caps numbers at 3 digits per chunk. With arbitrary-length number tokens, 12345 might be 1 token but 12346 might be [12, 346] — totally inconsistent splits that wreck digit-position arithmetic. The 3-digit cap forces predictable behavior.

Whitespace in code: a GPT-2 disaster, a GPT-4 fix

GPT-2 makes every space its own token. Four spaces of Python indent = 4 tokens. GPT-4 merges runs of whitespace into single tokens, roughly doubling effective context for indented code. This is one of the largest single improvements in GPT-4’s tokenizer.

Special tokens

Outside BPE entirely. Matched by string before BPE runs, then assigned a hardcoded ID.

GPT-2 vocab layout:
.255         raw byte tokens
.50255     50,000 BPE merges
        <|endoftext|>           ← only special token

<|endoftext|> marks document boundaries during training:

doc1_tokens + [50256] + doc2_tokens + [50256] + doc3_tokens + ...

The model learns what 50256 means. It’s special only in that the tokenizer never produces it from regular text — only when explicitly requested.

Adding a special token to a pretrained model is surgery: resize the embedding table by N rows, resize the LM head by N columns, typically freeze the base and train only the new slices.

Security: tiktoken.encode(user_input, allowed_special="all") lets users inject <|endoftext|> and confuse boundary logic. Default is strict — opt in only when you know the input is trusted.

Vocab size: the only two places it shows up

vocab_size appears in exactly two places in the model:

Token embedding: nn.Embedding(vocab_size, n_embd)
LM head: nn.Linear(n_embd, vocab_size)

Everything else (attention, MLP, LayerNorm) is independent of vocab size.

	Small (256)	Large (1M)
Sequence length	very long	short
Embed / head size	tiny	huge
Per-token signal	strong (every token seen often)	weak (rare tokens undertrained)

Sweet spot is 32k–100k.

SolidGoldMagikarp

The famous tokenization bug. Mechanism:

OpenAI trained the tokenizer on a dataset that included Reddit.
User SolidGoldMagikarp posted enough that BPE merged the username into a single token.
OpenAI then trained the language model on a different, filtered dataset that didn’t include those Reddit posts.
The token exists in vocab but its embedding row was never updated by gradient descent. It’s still the random initialization.
At inference, typing SolidGoldMagikarp loads that random embedding into the model. Undefined behavior.

Observed: the model evades, hallucinates, insults the user, bypasses safety, or gets stuck looping.

The C analogy: reading uninitialized memory. The slot exists, but no one ever wrote a meaningful value to it.

Root cause is the dataset mismatch between tokenizer training and LM training. Prevention is to use the same (or strictly overlapping) datasets, or audit per-token activation counts after pretraining and remove zero-activation tokens.

Other examples: " davidjl", " TheNitromeFan", " RandomRedditorWithNo". All Reddit usernames or fragments.

Spelling, reversal, arithmetic — all tokenization

Why GPT can’t count the L’s in DefaultCellStyle. DefaultCellStyle is a single token in GPT-4’s vocab. The model sees one opaque ID, not the constituent letters. Asking how many L’s are inside is like asking how many L’s are in the integer 28139.

Workaround for character-level tasks: prompt the model to first split with spaces (D e f a u l t C e l l S t y l e) so each character becomes its own token.

Why arithmetic is brittle. Number tokenization in GPT-2 is essentially arbitrary. 1024 might be one token, 123456 might be [12, 3456]. Digit position is unaligned across examples. Carrying digits requires aligning positions, which is structurally impossible when chunks are random. GPT-4’s 3-digit cap helps. Llama uses split_digits=True (one digit per token).

Why non-English is undertrained. Korean text takes ~3× more tokens than English for the same content. Less training signal per concept, less effective context.

Trailing whitespace warning. In training, " the" is a single token (space + “the”). If you end a prompt with a bare space, the last token is a lone space — almost never seen during training. The model is now out-of-distribution. Don’t end prompts with spaces.

Partial token glitches. Completing "DefaultCellSty" (a partial token) can produce immediate end-of-text, garbage, or content-policy warnings, because that exact subsequence rarely appears in training. tiktoken’s source has an entire “unstable tokens” module for this case.

Token economy

Same data, different format, different token count. JSON has overhead from {, }, ", :, commas. YAML strips most of it.

'{"name": "Alice", "age": 30}'    # ~12 tokens
'name: Alice\nage: 30'             # ~8 tokens

~15% savings going from JSON to YAML for the same content. Multiply across an API bill or a long-context model and it matters.

tiktokenizer.vercel.app shows you how anything tokenizes.

Bugs to remember

#	Bug	Symptom	Fix
1	`merge()` no bounds check	`IndexError` on last element	check `i < len(ids) - 1` first
2	`decode()` with `errors='strict'`	crashes on invalid byte sequences	`errors='replace'`
3	`encode()` on empty/single-char	`min({})` raises `ValueError`	early return for `len < 2`
4	Wrong-order vocab build	Python ≤3.6 vocab is silently wrong	require 3.7+
5	GPT-2 ASCII-only apostrophe	curly-quote contractions break	use GPT-4 pattern
6	GPT-2 case-sensitive split	`DON'T` doesn’t split	use GPT-4 pattern with `(?i:...)`
7	Special tokens in user input	boundary confusion / jailbreak	restrict `allowed_special`
8	Trailing whitespace in prompt	OOD final token	don’t end with space
9	Tokenizer dataset ≠ LM dataset	SolidGoldMagikarp glitch tokens	same datasets, audit activations
10	`encode(decode(x)) == x` assumption	silent breakage	only `decode(encode(x))` is safe

Code: github.com/debtirthasaha/bpe-tokenizer. Reference: karpathy/minbpe.

Birkhoff in 8.7 KB

2026-04-20T10:00:00+00:00

SAIR’s Equational Theories competition (Stage 1) — organized by Damek Davis (UPenn), Terence Tao (UCLA), and the SAIR Foundation — gives you a pair of equations over a single binary operator *:

E1:  L1 = R1
E2:  L2 = R2

All variables are universally quantified. A magma is just a set with a binary operation — no axioms beyond closure. The question: does every magma that satisfies E1 necessarily satisfy E2? Output true or false.

The dataset is drawn from Tao’s Equational Theories Project: 4694 magma laws, giving 4694 × 4693 = 22,028,942 ordered implications. The Stage 1 public subsets are normal (1000 problems, 50/50 true/false split), hard1 (69), hard2 (200, 50/50), hard3 (400, 195/205) and an order5 research subset.

The setup is a follow-up to Honda, Murakami & Zhang (2025), Distilling Many-Shot In-Context Learning into a Cheat Sheet: instead of having one model write the cheatsheet, SAIR runs an open competition so the cheatsheet is discovered across submissions. You submit a single Markdown file — a prompt template the harness fills in with the two equations and sends to a fixed set of frozen models (GPT-OSS 120B, Llama 3.3 70B, Gemma 4 31B). No fine-tuning, no agents, no tool calls, no chain-of-thought tax beyond what fits in the 8192-token completion budget. Hard cap on cheatsheet size: 10 KB. Scoring on Stage 1 is correctness (accuracy and F1) only — no proof artefacts, no calibrated probabilities. Those come in Stage 2 (Lean proofs, counterexamples, calibration).

The cheatsheet I submitted is 8.71 KB — under the cap with room to spare. It replaces free-form reasoning with a 9-magma closed-form decision procedure. The headline result: Gemma 4 31B running this prompt beat GPT-OSS 120B by 16 accuracy points on the hardest band.

The math: Birkhoff completeness

By Birkhoff’s theorem, E1 ⊨ E2 (E1 semantically entails E2) iff E2 is derivable from E1 by equational logic — reflexivity, symmetry, transitivity, congruence, substitution. Equivalently, E1 ⊨ E2 iff there is no magma satisfying E1 but not E2.

This gives you exactly two sound moves:

Return false: exhibit a specific magma where E1 holds and E2 fails.
Return true: derive E2 from E1 (or argue no counterexample exists).

Anything else is not a proof. In particular, “the equations look similar / share variables / I don’t see a derivation” is not sound. This is where LLMs get into trouble.

Why free-form prompting struggles

Ask an LLM “does E1 imply E2?” and it will produce English reasoning that looks like a proof. Sometimes it is one. Often it is structural pattern-matching that happens to be wrong on subtle cases: an implication that needs you to construct a specific failing magma, or a non-implication where the equation pair shares enough surface structure that the model is fooled.

The fix is to remove the freedom. Instead of asking the model to reason, give it a closed-form procedure over a finite catalog of magmas, and a hard rule that the only way to return false is to point to a specific catalog entry that refutes.

The 9 magmas

A magma is (M, *). For each one, the cheatsheet supplies a closed-form predicate on the equation tree that decides whether the magma satisfies a given equation. No enumeration over M, no search — a direct formula.

#	Magma `a*b`	Satisfies `L = R` iff
0	`a*b = b` (right-projection)	`rm(L) == rm(R)`
1	`a*b = a` (left-projection)	`lm(L) == lm(R)`
2	`a*b = c` (constant)	both depths ≥ 1, or `L`, `R` same bare var
3	`a*b = a + b` on ℤ/2 (XOR)	`count(v, L) ≡ count(v, R)` mod 2 ∀v
4	`a*b = a + b` on ℤ/3	`count(v, L) ≡ count(v, R)` mod 3 ∀v
5	`a*b = a + b` on ℤ	`count(v, L) == count(v, R)` ∀v
6	`a*b = b + 1` on ℤ/3 (right-successor)	`rm(L) == rm(R)` and `drm(L) ≡ drm(R)` mod 3
7	`a*b = a + 1` on ℤ/3 (left-successor)	`lm(L) == lm(R)` and `dlm(L) ≡ dlm(R)` mod 3
8	`a*b = −a − b` on ℤ/3 (negation-sum)	`signed_count(v, L) ≡ signed_count(v, R)` mod 3 ∀v

Where the tree primitives are:

lm(v) = v,        lm(a*b) = lm(a)               # leftmost leaf
rm(v) = v,        rm(a*b) = rm(b)               # rightmost leaf
dlm(v) = 0,       dlm(a*b) = 1 + dlm(a)         # left-spine length
drm(v) = 0,       drm(a*b) = 1 + drm(b)         # right-spine length
count(v, t) = number of times v appears as a leaf in t
signed_count(v, t) = Σ (-1)^depth(leaf) over leaf occurrences of v

Each row of the table is a one-line check the model can run by walking the equation tree once. No symbolic manipulation, no equational rewriting, no induction.

The decision procedure

For each equation E, compute sig(E) = (h0, h1, …, h8) — the 9-bit vector of which catalog magmas satisfy E.

def implies(E1, E2):
    s1, s2 = sig(E1), sig(E2)
    for i in range(9):
        if s1[i] and not s2[i]:
            return "false"     # magma i refutes
    return "true"

If any catalog magma satisfies E1 but falsifies E2, return false and the witness is that magma. Otherwise return true — the sound default when no catalog magma produces a refutation. Note that this can be wrong (the true answer might be false via some magma outside the catalog), but it’s wrong in the safe direction: the catalog produces no false falses.

Refutation discipline

The hardest part of getting the LLM to be sound is making it stop hallucinating counterexamples. The cheatsheet enforces:

A refutation must name an index i ∈ {0..8} with sig(E1)[i]=T and sig(E2)[i]=F. Anything else is not a refutation. Do not infer false from structural similarity, shared letters, or the absence of an obvious derivation.

And the cheatsheet’s last instruction before falling back to true:

If either bit is uncertain after re-check, return true. Sound procedure: a hallucinated refutation is worse than a missed one, because a missed refutation at least falls back to the mathematically honest default.

This asymmetric default is the lever. Free-form LLM reasoning falsely says false constantly. Constraining false to “name your i” cuts those errors almost entirely, at the cost of a few extra false-true answers (which the catalog can’t help anyway).

The mod-3 magmas are where errors happen

Magmas 6, 7, 8 are the ones the cheatsheet spends the most space on, because they’re the ones models reliably get wrong. The error pattern: get lm/rm right, then either skip computing dlm/drm or compute them wrong, or get the mod-3 arithmetic wrong (signed counts can be negative; −1 mod 3 = 2).

The fix is to inline worked examples for each, written so the model has to walk the tree. Example for magma 7:

Example A (TRUE): x = ((x*y)*z)*w
  lm(L) = x. For R, descend left: R → (x*y)*z → x*y → x. lm(R) = x. Match.
  dlm(L) = 0. dlm(R) counts those 3 left-descents = 3.
  0 mod 3 = 0, 3 mod 3 = 0. Equal. h7 = TRUE.

Example B (FALSE): x = (x*y)*z
  lm(L) = lm(R) = x. Match.
  dlm(L) = 0, dlm(R) = 2.
  0 mod 3 = 0, 2 mod 3 = 2. Differ. h7 = FALSE.

The example doesn’t teach the model what dlm is. It teaches it that it has to walk the tree before answering, by being a worked instance with the walk shown.

Magma 8 (signed counts mod 3) gets the most defensive treatment, because depth-parity arithmetic with negative residues is the most error-prone single check in the whole procedure.

The mandatory PARSE step

Before any rule fires, the cheatsheet requires the model to explicitly produce lm, rm, dlm, drm, depth, and the per-variable counts for both equation sides. Skipping this step is the single largest source of wrong answers — models guess lm(R) instead of descending the tree.

Forcing the structured output reframes the problem. The model isn’t reasoning about magma implication anymore; it’s filling in a six-slot form, then running nine if-statements, then doing one loop with a hard stop condition. This format is much friendlier to the LLM substrate than free-form math.

Results

The Stage 1 leaderboard has 1,061 participants. It scores on accuracy and F1 across three difficulty buckets — normal, hard, extra_hard — for each of the three frozen models. Restricted leaderboards (single model, or the order-5 research subset) score the same submission from a different angle.

Cheatsheet size: 8.71 KB. Mean parse success: 100%. Mean per-query cost (across all models, all sets): roughly $0.0004–$0.0009 depending on model.

Overall leaderboard (all models, all sets) — rank 85

Model	Set	Accuracy	F1
GPT-OSS 120B	normal	77.8%	81.7%
GPT-OSS 120B	hard	74.0%	79.3%
GPT-OSS 120B	extra_hard	49.7%	66.4%
Llama 3.3 70B	normal	61.0%	63.1%
Llama 3.3 70B	hard	56.5%	61.6%
Llama 3.3 70B	extra_hard	31.7%	41.1%
Gemma 4 31B	normal	52.0%	13.3%
Gemma 4 31B	hard	51.0%	16.5%
Gemma 4 31B	extra_hard	65.8%	48.4%

Aggregate: 57.7% acc / 52.4% F1 → rank 85.

{"data":[{"x":["normal","hard","extra_hard"],"y":[77.8,74.0,49.7],"name":"GPT-OSS 120B","type":"bar","marker":{"color":"#636efa"},"hovertemplate":"GPT-OSS 120B
%{x}: %{y}%"},{"x":["normal","hard","extra_hard"],"y":[61.0,56.5,31.7],"name":"Llama 3.3 70B","type":"bar","marker":{"color":"#EF553B"},"hovertemplate":"Llama 3.3 70B
%{x}: %{y}%"},{"x":["normal","hard","extra_hard"],"y":[52.0,51.0,65.8],"name":"Gemma 4 31B","type":"bar","marker":{"color":"#00cc96"},"hovertemplate":"Gemma 4 31B
%{x}: %{y}%"}],"layout":{"title":{"text":"Accuracy by model and difficulty"},"barmode":"group","yaxis":{"title":"accuracy (%)","range":[0,100]},"xaxis":{"title":"difficulty"},"height":420,"margin":{"l":60,"r":30,"t":60,"b":50},"legend":{"orientation":"h","x":0.1,"y":-0.15}}}

The crossover on the extra_hard band is the result the rest of the post is about. Gemma (the smallest model) lands above GPT-OSS (the largest) for the first time.

GPT-OSS-only leaderboard — rank 13

Restricting the same submission to the 120B model: 67.2% acc / 75.8% F1.

Order-5 research subset (GPT-OSS) — rank 21

Order-5 equations are deeper trees and form a separate research-tier leaderboard. 79.8% acc / 83.2% F1.

The Gemma extra-hard anomaly — rank 20

On the extra-hard set specifically, the per-model ranks tell a different story:

Model	Params	Extra-hard accuracy	Rank on extra-hard
Gemma 4 31B	31B	65.8%	20
GPT-OSS 120B	120B	49.7%	157
Llama 3.3 70B	70B	31.7%	278

Gemma 4 31B — the smallest model in the eval — got the best extra-hard accuracy with this prompt, by a margin (16 points over GPT-OSS, 34 points over Llama). On a model 4× smaller than GPT-OSS.

The likely explanation: extra-hard problems benefit most from structured procedure-following. GPT-OSS at 120B has more “smart enough to deviate” headroom — it interprets the cheatsheet, decides parts of it are unnecessary, and falls back to free-form reasoning that fails on the hardest cases. Gemma at 31B has less headroom for that kind of agency. It follows the procedure step by step because that’s what fits in its working memory, and the procedure is sound. On the easier sets where GPT-OSS’s looser interpretation usually gets the right answer anyway, the gap is reversed.

If true, this is a real prompt-engineering result: highly-structured, low-freedom prompts may prefer smaller models on the hardest problems, because the largest models will second-guess the structure and lose.

F1 is low on Gemma — why

Gemma’s F1 on normal/hard is low (13.3%, 16.5%) despite reasonable accuracy. This is the asymmetric default at work: Gemma takes the “if uncertain, return true” instruction very literally and answers true on most edge cases. Accuracy stays okay because the base rate of true in the dataset is high. F1 collapses because Gemma is producing few false answers, so the precision/recall on the false class is bad. The same instruction that fixes GPT-OSS’s hallucinated refutations over-fixes Gemma’s. This is the cost of a single uniform prompt across very different models.

Sanity check: Qwen 2.5 7B locally

After the submission I ran the same cheatsheet against a much smaller model — Qwen 2.5 7B, 4-bit quantized, running locally on a 4 GB GTX 1650 via Ollama — over the first 50 problems of hard3 (the local dataset roughly aligned with the competition’s extra-hard band). Same prompt, same temperature 0, same parse rule (take the last true/false token).

Model	Params	Quant	Set	N	Accuracy	F1 (true)	Precision	Recall
Qwen 2.5 7B	7B	4-bit	hard3 (first 50)	50	56.0%	66.7%	53.8%	87.5%
Gemma 4 31B	31B	fp16	extra_hard	500	65.8%	48.4%	—	—
GPT-OSS 120B	120B	—	extra_hard	500	49.7%	66.4%	—	—
Llama 3.3 70B	70B	—	extra_hard	500	31.7%	41.1%	—	—

Confusion matrix on Qwen: TP=21, TN=7, FP=18, FN=3 — recall on true is 87.5%, precision is 53.8%. Same lopsided pattern as Gemma. The model defaults to true and is dragged down by the false-positive count on actual-false problems.

50 problems is a small slice and the bands aren’t a perfect match. Even so: a 7B model at 4-bit, running on a single laptop GPU, comes in above GPT-OSS 120B and Llama 3.3 70B on the hardest band. The cheatsheet is doing more of the work than the model is.

A representative failure on Qwen (problem 1 of hard3):

E1:  x = x * (x * y)
E2:  x = (x * ((x * x) * x)) * x
actual: true     pred: false

The model’s trace claims sig(E1)[2] = T for the Constant magma. The rule says depth(L) ≥ 1 ∧ depth(R) ≥ 1 or L, R same bare var. Here L = x (a leaf — depth 0), so the first conjunct fails; they’re not the same bare variable either. h2 should be F, the model wrote T, and a non-existent refutation gets reported as a real one.

This is the exact category the cheatsheet warns about: a tree-walking error on the depth primitive, propagating into a wrong bit, propagating into a hallucinated refutation. The fix for a future version is more aggressive worked examples on the Constant rule — the same defensive treatment magmas 6, 7, 8 already get. The PARSE step probably also needs to force the model to emit depth(L) and depth(R) as explicit lines before any rule fires.

What I’d change

Per-model branches. The Gemma F1 collapse is fixable: weaken the refutation discipline slightly for smaller models so they produce false more often, while keeping it strict for the larger ones. The competition’s “one prompt, three models” rule made this not an option for the submission, but it’s the obvious next experiment.
More magmas. Extending the catalog past 9 is mostly mechanical — every magma you add comes with its closed-form predicate, plug it into sig. The hard part is finding magmas that actually refute on the test distribution. An earlier draft had a 4-element bit-swap magma a*b = 2(a mod 2) + floor(b/2) (the “C4” magma) which characterized algebraically via slot-pairs; it added measurable coverage on the harder sets in offline testing but I cut it from the submission to keep the cheatsheet at the size where smaller models still parse it reliably.
Re-run the PARSE step under model-specific delimiters. Different model families parse code blocks and bullet lists with slightly different reliability. The PARSE step is the single most important determinant of accuracy on the hard set; getting that step to fire correctly for Gemma vs Llama vs GPT-OSS is worth more than any new magma.

Apply the procedure to this instance

The submission file ends with the actual placeholder block the harness substitutes into. The full cheatsheet, the Qwen 2.5 7B sanity-check results, and the verification scripts are at github.com/debtirthasaha/equational-theories-cheatsheet.

Tiny Shakespeare, tiny GPT

2026-04-15T10:00:00+00:00

Same architecture as GPT-2, scaled to fit a 4 GB GPU and trained on 1 MB of Shakespeare. Built one mechanism at a time, measuring val loss after each addition.

After adding	Val loss
Bigram baseline	2.88
+ single-head self-attention	2.41
+ multi-head	2.32
+ feed-forward MLP	2.23
+ residual + LayerNorm (3 blocks)	2.09
Scaled up (4 layers, 192-d, 6 heads)	1.59

{"data":[{"x":["bigram","+ single-head attn","+ multi-head","+ feed-forward","+ residual + LN (x3)","scaled up"],"y":[2.88,2.41,2.32,2.23,2.09,1.59],"type":"bar","marker":{"color":["#bbbbbb","#9b9bff","#7a7aff","#5959ff","#3838ff","#EF553B"]},"text":[2.88,2.41,2.32,2.23,2.09,1.59],"textposition":"outside","hovertemplate":"%{x}
val loss %{y}"}],"layout":{"title":{"text":"Validation loss after each architectural addition"},"yaxis":{"title":"val loss","range":[0,3.2]},"xaxis":{"tickangle":-25},"height":420,"margin":{"l":60,"r":30,"t":60,"b":120},"showlegend":false}}

1.83M parameters, val loss 1.59. Output has speaker tags, sentence rhythm, and (mostly) closed quotes.

Data, in one tensor

Dataset is input.txt — every Shakespeare play, 1,115,394 characters, 65 unique characters including \n.

chars = sorted(list(set(text)))
stoi  = {ch: i for i, ch in enumerate(chars)}
itos  = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
train_data = data[:n]   # 1,003,854 tokens
val_data   = data[n:]   #   111,540 tokens

That’s the tokenizer. (A real BPE tokenizer comes in the next post.)

Sampling chunks, not whole documents

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i    : i + block_size    ] for i in ix])
    y = torch.stack([data[i + 1: i + block_size + 1] for i in ix])
    return x.to(device), y.to(device)

y is x shifted right by one. Each position in the chunk is one training example: the model predicts position t+1 given positions 0..t. A chunk of length 8 produces 8 examples in parallel. Stacking batch_size chunks gives independent training signal — sequences in a batch don’t communicate.

The mathematical trick

Before self-attention there’s one observation worth dwelling on: how do you let each position aggregate information from all previous positions in parallel?

Goal: xbow[b, t] = mean(x[b, 0..t]). Naive double loop. The fast version is a matrix multiply.

Build a lower-triangular matrix of equal weights:

wei = [[1.00, 0.00, 0.00, 0.00],
       [0.50, 0.50, 0.00, 0.00],
       [0.33, 0.33, 0.33, 0.00],
       [0.25, 0.25, 0.25, 0.25]]

Now wei @ x produces, for each row of wei, a weighted sum over the value vectors. Row t only mixes positions 0..t. Same answer as the loop, one matmul.

Self-attention will turn wei from “equal weights” into “weights computed from the data.” The matrix-multiply-with-causal-mask structure stays.

Self-attention, single head

Three linear projections from the residual stream to a smaller head_size:

class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key   = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)

        wei = q @ k.transpose(-2, -1) * k.shape[-1]**-0.5   # (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)

        v = self.value(x)
        return wei @ v

Three details worth pointing at:

1/sqrt(head_size) scaling. Without it, large head_size produces dot products with large variance. Softmax collapses to near-one-hot and gradients stop flowing. Scaling holds the softmax diffuse at init.
register_buffer for tril. The mask is a constant; we don’t want it tracked as a learnable parameter, but we do want it to move to GPU when model.to(device) is called.
Mask is −inf on upper triangle. After softmax, exp(−inf) = 0. Future positions get exactly zero weight.

Single head added on top of the bigram baseline drops val loss 2.88 → 2.41.

Multi-head attention

Run n_head heads in parallel, concatenate, then project back to n_embd:

class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj  = nn.Linear(head_size * num_heads, n_embd)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        return self.proj(out)

head_size = n_embd // n_head, so concatenation gets you back to n_embd. The final proj is what lets heads interact — concatenation alone just glues outputs together.

nn.ModuleList instead of a plain Python list: PyTorch needs to know about these submodules to register their parameters. A plain list is invisible to model.parameters().

Val loss: 2.32.

Feed-forward: per-token computation

After attention mixes information across positions, the FFN lets each position think about what it gathered. Same MLP applied to every token independently:

self.net = nn.Sequential(
    nn.Linear(n_embd, 4 * n_embd),
    nn.ReLU(),
    nn.Linear(4 * n_embd, n_embd),
)

4× expansion is from the original Transformer paper. Val loss: 2.23.

A useful mental model from this point on: attention is communication, feed-forward is computation. Tokens talk to each other in attention, then each one updates its own representation in the FFN.

The block: pre-norm + residual

class Block(nn.Module):
    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

Three things going on:

Residual x + ...: the addition creates a gradient highway. Gradients flow back through + undisturbed to earlier layers. Without it, deep networks have vanishing gradients.
LayerNorm before the sub-block (pre-norm): the original 2017 paper put LN after; modern practice puts it before. Pre-norm trains more stably at depth.
Sub-blocks start near-zero in output: at init, attention and FFN contribute tiny perturbations to the residual stream. Useful signal accumulates gradually.

Val loss after stacking 3 such blocks: 2.09.

Scaling up

batch_size    = 64
block_size    = 128
n_embd        = 192
n_head        = 6      # head_size = 192/6 = 32
n_layer       = 4
dropout       = 0.2

1.83M parameters. Val loss 1.59. The drop from 2.09 → 1.59 is mostly capacity — same architecture, just more of it.

Sample output:

DUKE VINCENTIO:
Whither dost thou pursue, and what shall be done
To these things that he was forc'd to make?

ANGELO:
My lord, I will entreat your grace's hand.

The model is making it up character by character. There is no concept of a word in its vocabulary. It learned word boundaries, speaker tags, and sentence structure from 1 MB of text.

What’s missing vs GPT-2

This is GPT-2’s architecture at small scale. The pieces not in this build:

A real tokenizer (we used per-character; GPT-2 uses BPE — post).
Weight tying between wte and lm_head (the input embedding and output classifier share the same matrix in GPT-2 — lm_head.weight = wte.weight).
Initialization variants (GPT-2 scales the output projection of each residual layer by 1/sqrt(2*n_layer) to control variance through depth).
A proper optimizer recipe (cosine LR schedule, weight decay split, warmup) and DDP for multi-GPU.

All of those show up in the GPT-2 reproduction.

Code: github.com/debtirthasaha/tiny-gpt-shakespeare.

makemore: from counting bigrams to a WaveNet

2026-04-08T10:00:00+00:00

names.txt: 32,033 names, one per line. Vocabulary is 26 letters + . (start/end token) = 27 characters. Every name emma is wrapped to .emma. and the bigrams are (.,e), (e,m), (m,m), (m,a), (a,.). Goal: predict the next character.

Five models, each adding one mechanism. Loss is negative log likelihood, lower is better.

Model	Mechanism	Val NLL
Bigram counts	27×27 count matrix, +1 smoothing	2.45
Bigram NN	27→27 logits, softmax, gradient descent	2.46
MLP (Bengio 2003)	3-char context, 10-dim embedding, 200-hidden tanh	2.10
MLP + BN + Kaiming	same + proper init + batch norm	2.05
WaveNet-style	hierarchical pairwise fusion, 8-char context	1.99

{"data":[{"x":["bigram counts","bigram NN","MLP","MLP + BN + Kaiming","WaveNet-style"],"y":[2.45,2.46,2.10,2.05,1.99],"type":"bar","marker":{"color":["#bbbbbb","#9b9bff","#5959ff","#3838ff","#EF553B"]},"text":[2.45,2.46,2.10,2.05,1.99],"textposition":"outside","hovertemplate":"%{x}
val NLL %{y}"}],"layout":{"title":{"text":"Validation NLL across the five models"},"yaxis":{"title":"NLL (lower = better)","range":[0,3.0]},"xaxis":{"tickangle":-25},"height":420,"margin":{"l":60,"r":30,"t":60,"b":120},"showlegend":false}}

1. Counting bigrams

Build the count matrix directly:

N = torch.zeros((27, 27), dtype=torch.int32)
for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        N[stoi[ch1], stoi[ch2]] += 1

P = (N + 1).float()
P /= P.sum(dim=1, keepdim=True)

+1 smoothing avoids log(0) on bigrams that never appeared in training.

Sampling is torch.multinomial(P[ix], num_samples=1) in a loop until you draw the . token.

NLL = −mean(log P[bigram]) over the training set = 2.4543. As a sanity check: exp(−2.45) ≈ 8.7%, vs 1/27 ≈ 3.7% for uniform random. The bigram model assigns roughly 2.4× more probability to the correct next character than chance.

2. The same bigram model as a neural net

Same model, found by gradient descent instead of counting:

xenc = F.one_hot(xs, num_classes=27).float()   # (N, 27)
W    = torch.randn((27, 27), requires_grad=True)

logits = xenc @ W
counts = logits.exp()
probs  = counts / counts.sum(1, keepdim=True)
loss   = -probs[torch.arange(n), ys].log().mean()

xenc @ W is a row lookup (one-hot times matrix = pick a row of W). The “logits” are log-counts up to a constant. softmax(logits) matches the row-normalized count matrix. Trained 200 steps with lr=50, lands at NLL 2.4576 — within 0.01 of the count model.

The takeaway: this is the same model, just parameterized differently. The neural net’s W converges to the log of the count matrix. The equivalence breaks the moment you add nonlinearity or more context.

3. MLP, Bengio 2003

Bigrams are too local. With context .. you can’t tell e from o; with context ..em you can. Bump context from 1 → 3 characters.

input: 3 char indices, e.g. [0, 0, 5]
  → embedding C: (27, 10)             → (3, 10)
  → concatenate                       → (30,)
  → Linear(30, 200) + tanh
  → Linear(200, 27)                   → logits
  → softmax                           → next-char distribution

Dataset built by sliding a 3-window over each name:

context           target
['.', '.', '.']   'e'
['.', '.', 'e']   'm'
['.', 'e', 'm']   'm'
['e', 'm', 'm']   'a'
['m', 'm', 'a']   '.'

build_dataset() returns X (228146, 3) and Y (228146,). 80/10/10 train/dev/test split.

Forward:

emb = C[Xb]                          # (B, 3, 10)
h   = torch.tanh(emb.view(-1, 30) @ W1 + b1)   # (B, 200)
logits = h @ W2 + b2                  # (B, 27)
loss = F.cross_entropy(logits, Yb)

emb.view(-1, 30) flattens the 3-char window into a 30-d vector. Same network sees position-dependent patterns because each character’s embedding occupies a different slice of the input.

Trains in ~30 sec. Val loss ~2.10. Sampled names start sounding like names: montelle, kymbry, madiet.

4. The three init bugs nobody tells you about

The MLP works, but if you instrument it, three things are quietly broken at step 0.

Initial loss is too high. Loss at random init is ~27 (exploded softmax). Expected value is −log(1/27) ≈ 3.3. Cause: W2 and b2 initialized from N(0, 1) produce logits with huge variance — softmax assigns near-1 probability to one random class, and if it’s not the right one, −log(tiny) ≈ huge. Fix: scale W2 down by ~0.01 and zero b2. Initial loss drops to 3.32.

Tanh saturation. Most pre-activations land outside [-2, 2] at init, where tanh is flat. Local gradient (1 − tanh²(x)) is near 0, gradients can’t flow through these neurons, and they’re effectively dead. Diagnose with (h.abs() > 0.99).float().mean() per neuron — at init this is >97% for some neurons. Fix: scale W1 so that (W1.T @ x) has variance ~1.

Eyeballing the scaling factor. Kaiming He’s paper gives the formula directly: for a layer with fan_in inputs and a tanh/relu nonlinearity, initialize weights from N(0, gain/sqrt(fan_in)) where gain = 5/3 for tanh, sqrt(2) for relu. PyTorch ships this as torch.nn.init.kaiming_normal_.

After Kaiming init: pre-activations stay in [-2, 2], no dead neurons, loss starts where it should. Val loss drops from 2.10 to ~2.07 just from fixing initialization.

5. BatchNorm: forcing the distribution post-hoc

Kaiming gets you into the right range at init. As you train, weights drift, distributions shift again. BatchNorm normalizes the pre-activation distribution every forward pass:

bnmeani = hpreact.mean(0, keepdim=True)
bnstdi  = hpreact.std(0, keepdim=True)
hpreact = bngain * (hpreact - bnmeani) / bnstdi + bnbias

bngain and bnbias are learnable; they let the network un-do the normalization if it wants. In practice they stay small — the network mostly wants the normalized version.

The annoying part is inference. At inference there is no batch — you might be predicting one example at a time. So BatchNorm keeps an exponential moving average of train-time batch statistics and uses those at eval. Two extra non-learnable buffers per BN layer. Train/eval modes diverge.

This is also why model.eval() matters: without it, BatchNorm at inference would use the single-example statistics (variance = 0, division by zero, garbage output).

Val loss with init fixes + BN: ~2.05.

6. Manual backprop, every gradient by hand

For one block of training I deleted loss.backward() and computed every gradient by hand, layer by layer.

The cross-entropy case is the one worth writing out. Cross-entropy fuses three ops: softmax, pick the correct-class probability, −log. Differentiating directly:

For the correct class y:

p_y = exp(z_y) / S        where S = Σ exp(z_j)
dL/dz_y = p_y − 1

For any other class i ≠ y:

dL/dz_i = p_i

So dlogits = probs.clone(); dlogits[range(n), y] -= 1; dlogits /= n. That’s it. The most common loss function in deep learning has a 4-line gradient.

Once you’ve done this, autograd stops being a black box. PyTorch is registering a _backward closure on each op, exactly like micrograd, then walking the DAG in reverse and applying these closed-form rules.

7. WaveNet-style hierarchical fusion

The MLP smashes all 8 characters into one vector and runs a single Linear over it. Every character has to interact with every other character in one shot.

WaveNet processes pairs of adjacent characters, then pairs of pairs, then pairs of those:

[c1 c2  c3 c4  c5 c6  c7 c8]   8 chars, 10-dim each
  \_/    \_/    \_/    \_/
  [b1    b2    b3    b4]        4 bigram reps
    \___/        \___/
    [q1          q2]              2 four-gram reps
       \________/
           [o1]                    1 output → predict next char

Each fusion is the same operation: Linear((B, T/2, 2C) → (B, T/2, C)) + tanh. Local context builds up gradually.

Same dataset, same training loop. Val NLL: ~1.99.

Where val loss can keep dropping

Add	Expected drop
Longer context (12, 16 chars)	small, diminishing
More embedding dims	small
Multi-head self-attention	substantial — bigrams → attention is the biggest single step
More data	this dataset is tiny

Attention is what the tiny GPT post picks up.

Code: github.com/debtirthasaha/makemore-from-scratch.

micrograd: a scalar-valued autograd engine

2026-04-01T10:00:00+00:00

A scalar-valued autograd engine. Every value is a Python float wrapped in a Value object that knows what produced it. Operator overloads build a DAG implicitly. backward() topologically sorts the DAG and runs each node’s local gradient rule in reverse. About 150 lines total.

The `Value` class

class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._prev = set(_children)
        self._op = _op
        self._backward = lambda: None

data: the forward scalar.
grad: filled in by backward, starts at 0.
_prev: parents in the DAG.
_op: string label (debugging only).
_backward: closure each operation sets; default no-op for leaves.

Operator overloads register local gradient rules

Addition:

def __add__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    out = Value(self.data + other.data, (self, other), '+')

    def _backward():
        self.grad  += 1.0 * out.grad
        other.grad += 1.0 * out.grad
    out._backward = _backward
    return out

Three things at once: forward arithmetic, DAG construction ((self, other) as parents), and the local rule (d(a+b)/da = 1, d(a+b)/db = 1).

Multiplication: same shape, different local rule.

def _backward():
    self.grad  += other.data * out.grad
    other.grad += self.data  * out.grad

tanh:

def tanh(self):
    t = (math.exp(2*self.data) - 1) / (math.exp(2*self.data) + 1)
    out = Value(t, (self,), 'tanh')

    def _backward():
        self.grad += (1 - t**2) * out.grad
    out._backward = _backward
    return out

** and exp are the same pattern.

Why `+=` and not `=`

If a node feeds into multiple downstream nodes, the chain rule sums contributions over all paths. += is exactly that sum.

This is why optimizer.zero_grad() exists in PyTorch. Gradients accumulate by design; you have to clear them between training steps or you accumulate gradients across batches.

Backward via topo-sort

def backward(self):
    topo = []
    visited = set()
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._prev:
                build_topo(child)
            topo.append(v)
    build_topo(self)

    self.grad = 1.0
    for v in reversed(topo):
        v._backward()

Topo sort guarantees that when v._backward() runs, v.grad already has its final value — every downstream node has already pushed into it. Walking in reverse without the sort gives stale gradients.

The base case self.grad = 1.0 is the seed: the gradient of the final scalar with respect to itself is 1.

A 2-3-3-1 MLP, no PyTorch

A Neuron is a list of weight Values, a bias Value, and a tanh. A Layer is a list of Neurons. An MLP is a list of Layers.

Training loop:

for k in range(50):
    ypred = [n(x) for x in xs]
    loss  = sum((yp - y)**2 for yp, y in zip(ypred, ys))

    for p in n.parameters():
        p.grad = 0.0
    loss.backward()

    for p in n.parameters():
        p.data -= 0.05 * p.grad

(yp - y)**2, sum, every neuron’s tanh — all of it constructs nodes in the same Value DAG. loss.backward() walks the whole thing.

41 parameters. The actual training trajectory on the four-point demo, every step:

{"data":[{"x":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49],"y":[3.764166,3.514946,3.152909,2.638927,2.02037,1.407935,0.914896,0.591433,0.401572,0.290094,0.22114,0.175845,0.144458,0.121721,0.104642,0.091425,0.080942,0.072454,0.065461,0.059612,0.054657,0.050411,0.046738,0.043532,0.040712,0.038214,0.035988,0.033992,0.032194,0.030566,0.029087,0.027736,0.026498,0.025361,0.024312,0.023343,0.022443,0.021607,0.020828,0.020101,0.01942,0.018782,0.018182,0.017618,0.017086,0.016584,0.016109,0.015659,0.015233,0.014828],"type":"scatter","mode":"lines+markers","line":{"color":"#EF553B","width":2},"marker":{"size":4},"hovertemplate":"step %{x}
loss %{y:.4f}","name":"loss"}],"layout":{"title":{"text":"Training loss, 41-parameter MLP on 4 points"},"xaxis":{"title":"step"},"yaxis":{"title":"sum-of-squares loss","type":"log"},"height":420,"margin":{"l":70,"r":30,"t":60,"b":50},"showlegend":false}}

3.76 → 0.015 in 50 steps. Steep drop for the first ~10 steps as the network finds the rough direction, then a slow log-linear decline as it refines. Every gradient was computed by my own code.

Why scalar-valued

PyTorch operates on tensors and broadcasts the same chain rule across millions of elements per op. The math is identical; tensor ops just batch it. Once the scalar engine works, the upgrade to a tensor engine is engineering, not concept.

Code: github.com/debtirthasaha/micrograd-from-scratch.

A transformer that reads C++ and writes Python

2026-03-01T10:00:00+00:00

Encoder-decoder transformer that takes C++ source and emits the equivalent Python. Trained on XLCoST (a corpus of competitive-programming problems with parallel solutions in seven languages). 16.4M parameters. Best checkpoint at epoch 19, val_loss 2.0474, sized to fit a GTX 1650 4 GB.

Most “transformer from scratch” implementations are English → French. This is C++ → Python, which makes the input distribution different (structured code, lots of repeated tokens, hard syntax) and exposes a tokenization problem the dataset’s stock format papers over.

Problem setup

XLCoST ships parallel source files. Pairs look like:

// C++
int binary_search(vector<int>& a, int x) {
    int lo = 0, hi = a.size() - 1;
    while (lo <= hi) {
        int mid = (lo + hi) / 2;
        if (a[mid] == x) return mid;
        if (a[mid] < x) lo = mid + 1;
        else hi = mid - 1;
    }
    return -1;
}

def binary_search(a, x):
    lo, hi = 0, len(a) - 1
    while lo <= hi:
        mid = (lo + hi) // 2
        if a[mid] == x: return mid
        if a[mid] < x: lo = mid + 1
        else: hi = mid - 1
    return -1

An encoder-decoder transformer is the right shape: the encoder reads the entire C++ source bidirectionally, the decoder generates Python autoregressively with cross-attention into the encoder’s output. This is the original 2017 architecture.

Hyperparameters, sized to 4 GB

d_model         = 256
N (layers)      = 4
h (heads)       = 8
d_ff            = 512
dropout         = 0.1
label_smoothing = 0.1
max_seq_len     = 350
batch_size      = 8   (with dynamic padding)

16.4M parameters. The constraints are real: with d_model = 512 and full padding to 350 tokens, the VRAM math doesn’t close. Every knob got pulled until the model both fit and trained.

VRAM math, roughly: B × T × T × h × 4 bytes for attention scores per layer. At B=8, T=350, h=8: 8 × 350² × 8 × 4 = ~31 MB per layer, and you need this for forward and backward. Across 8 layers (4 encoder + 4 decoder) and other activations it adds up fast. Dynamic padding (pad each batch to its longest sequence rather than to max_seq_len) is what made training viable.

The dataset problem

XLCoST is pre-tokenized. The files look like:

int NEW_LINE binary_search ( vector < int > & a , int x ) { NEW_LINE INDENT ...

NEW_LINE, INDENT, DEDENT are XLCoST’s whitespace-preserving tokens. Splitting on whitespace gives you tokens directly — no tokenizer required.

This is fine for training and evaluation on XLCoST. It is not fine for inference on real code, because real code doesn’t ship with NEW_LINE tokens.

So a second tokenizer had to be built: a raw C++ tokenizer that takes ordinary source and produces the XLCoST tokenization. It handles:

Comments (//, /* */) — stripped before tokenization
String literals — preserved as single tokens (don’t split inside quoted strings)
Multi-char operators (<<, >>, ==, !=, <=, >=, &&, ||, ++, --, ->, ::, +=, etc.) — match greedy
Numbers, identifiers — match maximally
Whitespace → NEW_LINE, INDENT, DEDENT based on column position

The inference path is raw C++ → my tokenizer → XLCoST tokens → model → Python tokens → join.

Vocabulary coverage and UNKs

Vocab is built from the training set with min_freq=2 — any token appearing fewer than 2 times is replaced with . Final vocab is ~12K source tokens, ~10K target tokens.

This means common things work and uncommon things fail. binary_search is in vocab. Hello, World!\n is not — the string literal "Hello, World!\n" is a single rare token, gets mapped to , and the model has no signal to translate it. You can confirm this by tokenizing cout << "Hello, World!" << endl; and watching the string vanish into an UNK.

For competitive-programming-style code (loops, arrays, recursion, math) coverage is good and translation is fluent. For anything string-heavy it falls apart.

The architecture, 12 components

All twelve are in model.py. Quick map:

InputEmbeddings — nn.Embedding(vocab_size, d_model), output scaled by sqrt(d_model)
PositionalEncoding — sinusoidal, fixed (not learned)
LayerNormalization — manual implementation with learnable γ, β
FeedForwardBlock — Linear(d_model, d_ff) → ReLU → Dropout → Linear(d_ff, d_model)
MultiHeadAttentionBlock — Q/K/V projections, scaled-dot-product, output projection. Stores attention_scores as a buffer for later visualization.
ResidualConnection — x + dropout(sublayer(norm(x))) (pre-norm)
EncoderBlock — self-attention + FFN, each wrapped in residual
Encoder — stack of N encoder blocks + final LayerNorm
DecoderBlock — masked self-attention + cross-attention + FFN
Decoder — stack of N decoder blocks + final LayerNorm
ProjectionLayer — Linear(d_model, vocab_size), no softmax (cross-entropy applies it internally)
Transformer — encoder + decoder + source/target embeddings + source/target positional + projection

Pre-norm everywhere. Output of the projection layer is logits, not log-softmax — nn.CrossEntropyLoss expects logits and applies log-softmax internally for numerical stability.

The cross-attention in the decoder is where the two halves meet: decoder Q comes from the decoder’s own residual stream, but K and V come from the encoder’s final output. Each decoder position queries the encoded source to decide what to translate next.

Training: warmup + label smoothing

optimizer = Adam(params, lr=1e-4, betas=(0.9, 0.98), eps=1e-9)
scheduler = LambdaLR(optimizer,
                     lambda step: d_model ** -0.5 * min(step ** -0.5, step * warmup_steps ** -1.5))
criterion = nn.CrossEntropyLoss(ignore_index=PAD, label_smoothing=0.1)

The LR schedule is from the original transformer paper: linear warmup for warmup_steps, then 1/sqrt(step) decay. ignore_index=PAD masks padding tokens from the loss. label_smoothing=0.1 gives 10% of the probability mass to non-target tokens uniformly — softens the optimization target and regularizes.

Greedy decoding for inference. No beam search.

Results

20 epochs. Train and val loss every epoch:

Epoch	Train	Val
13	1.9109	2.0615
15	1.8708	2.0545
16	1.8542	2.0511
19	1.8103	2.0474 ← best
20	1.7964	2.0576 ← train still dropping, val rising

Best checkpoint at epoch 19. Train kept dropping past that point but val stopped — classic overfitting signature. The save-best-by-val-loss logic kept epoch 19 as best_model.pt.

Sample translation:

int binary_search(vector<int>& a, int x) {
    int lo = 0, hi = a.size() - 1;
    ...
}

def binary_search(a, x):
    lo, hi = 0, len(a) - 1
    while lo <= hi:
        mid = (lo + hi) // 2
        ...

Hello-world-style code with rare string literals fails: cout << "Hello" produces print(). Loops, math, recursion, array indexing all translate cleanly.

What the model actually learned

Inspired by Anthropic’s circuits work, I loaded the checkpoint and probed two things: the token embedding matrices on each side, and the attention pattern of the last encoder layer.

Embedding nearest neighbors

For a seed token, find the closest tokens in embedding space by cosine similarity. Both the source-side (C++) and target-side (Python) embedding tables show clean semantic structure even though no one taught the model what these tokens mean.

C++ side (top 6 per row):

int    -> endl, ll, ;, [EOS], long, <<
for    -> while, memset, getline, case, faces, sortRowWise
if     -> ==, 127, while, case, break, fast
vector -> calloc, begin, multiset, sizeof, NthPostordernode, word_size
<      -> >, ::, %, <=, &, #
==     -> case, 127, <=, if, checkAbundant, !=
true   -> Magic, False, ||, True, slope3, npos
string -> char, "4", chanceA, 122, modifyString, findNumberOfLIS

Python side (top 6 per row):

def    -> class, divTermCount, for, in, NEW_LINE, Euler
if     -> elif(0.70), while, or, and, isPower2, checkPerfectcube
for    -> while, in, def, [, range, within
range  -> in, while, sqrt, ord, int, xrange
+      -> +=, -, >>=, -=, <<=, >
==     -> !=(0.67), >=, >, <=, <, than
print  -> return, PrintList, format, Squares, cout, round
<      -> >(0.65), >=, <=, ==, <, ->
True   -> False, 82, 3.14159265, 0.25, 4.5, None
str    -> acos, log2, trailingZero, int, string, singlePrimeFactor

A few clusters that aren’t accidents:

Comparison operators. On the Python side, < is nearest to > (0.65), then >= (0.63), <= (0.63), == (0.53), < (0.51), -> — a tight cluster of every binary comparison the model has seen.
Boolean values. True finds False and None. true (on the C++ side) finds True, False, and ||. The model puts truth values close together regardless of casing or language.
Control flow. if on the Python side has elif as its nearest neighbor at cosine 0.70 — by a clear margin. for is nearest to while. def is nearest to class.
Cross-language synonyms. print (Python) has cout in its top-6. The model learned that the C++ side’s cout and the Python side’s print play structurally similar roles, even though they live in different vocabularies and different embedding tables.
C++ integer family. int is nearest to long and ll (the typedef long long ll shorthand competitive programmers use). The model picked up that these are interchangeable integer types.

None of this is taught explicitly. The supervision signal is a cross-entropy loss on next-token prediction in a sequence-to-sequence setup. Semantically related tokens end up close together because the loss is lower when interchangeable tokens have similar representations.

Encoder attention on `int sum = a + b ; NEW_LINE return sum ;`

Pulling out head 0 of the last encoder layer’s self-attention on one short example. Rows = query positions, columns = key positions. Hover for exact weights:

{"data":[{"z":[[0.02,0.00,0.04,0.00,0.00,0.00,0.16,0.57,0.02,0.00,0.14,0.07],[0.00,0.00,0.02,0.00,0.00,0.00,0.09,0.76,0.01,0.00,0.07,0.05],[0.01,0.00,0.06,0.00,0.00,0.00,0.11,0.64,0.03,0.00,0.12,0.02],[0.01,0.00,0.05,0.00,0.00,0.00,0.16,0.57,0.02,0.00,0.16,0.02],[0.01,0.00,0.03,0.00,0.00,0.00,0.10,0.75,0.01,0.00,0.08,0.03],[0.01,0.00,0.06,0.00,0.00,0.00,0.18,0.51,0.03,0.00,0.18,0.03],[0.03,0.00,0.06,0.00,0.00,0.00,0.19,0.41,0.05,0.00,0.16,0.09],[0.07,0.00,0.08,0.00,0.01,0.00,0.22,0.26,0.11,0.00,0.18,0.08],[0.02,0.00,0.04,0.00,0.00,0.00,0.17,0.52,0.03,0.00,0.15,0.08],[0.01,0.00,0.02,0.00,0.00,0.00,0.09,0.75,0.01,0.00,0.08,0.05],[0.03,0.00,0.06,0.00,0.00,0.00,0.18,0.46,0.04,0.00,0.15,0.08],[0.05,0.00,0.05,0.00,0.00,0.00,0.23,0.31,0.06,0.00,0.18,0.11]],"x":["int","sum","=","a","+","b",";","NEW_LINE","return","sum",";","[EOS]"],"y":["int","sum","=","a","+","b",";","NEW_LINE","return","sum",";","[EOS]"],"type":"heatmap","colorscale":"Viridis","hovertemplate":"q: %{y}
k: %{x}
weight: %{z:.2f}","colorbar":{"title":{"text":"attn"}}}],"layout":{"title":{"text":"Encoder self-attention, head 0, last layer"},"xaxis":{"title":"key (attended to)","side":"top"},"yaxis":{"title":"query (attending)","autorange":"reversed"},"height":500,"margin":{"l":80,"r":30,"t":90,"b":50}}}

In text form:

              int   sum    =     a     +     b     ;   NEW_  ret   sum    ;   [EOS]
   int    [ 0.02  0.00  0.04  0.00  0.00  0.00  0.16  0.57  0.02  0.00  0.14  0.07 ]
   sum    [ 0.00  0.00  0.02  0.00  0.00  0.00  0.09  0.76  0.01  0.00  0.07  0.05 ]
     =    [ 0.01  0.00  0.06  0.00  0.00  0.00  0.11  0.64  0.03  0.00  0.12  0.02 ]
     a    [ 0.01  0.00  0.05  0.00  0.00  0.00  0.16  0.57  0.02  0.00  0.16  0.02 ]
     +    [ 0.01  0.00  0.03  0.00  0.00  0.00  0.10  0.75  0.01  0.00  0.08  0.03 ]
     b    [ 0.01  0.00  0.06  0.00  0.00  0.00  0.18  0.51  0.03  0.00  0.18  0.03 ]
     ;    [ 0.03  0.00  0.06  0.00  0.00  0.00  0.19  0.41  0.05  0.00  0.16  0.09 ]
NEW_LI    [ 0.07  0.00  0.08  0.00  0.01  0.00  0.22  0.26  0.11  0.00  0.18  0.08 ]
return    [ 0.02  0.00  0.04  0.00  0.00  0.00  0.17  0.52  0.03  0.00  0.15  0.08 ]
   sum    [ 0.01  0.00  0.02  0.00  0.00  0.00  0.09  0.75  0.01  0.00  0.08  0.05 ]
     ;    [ 0.03  0.00  0.06  0.00  0.00  0.00  0.18  0.46  0.04  0.00  0.15  0.08 ]
 [EOS]    [ 0.05  0.00  0.05  0.00  0.00  0.00  0.23  0.31  0.06  0.00  0.18  0.11 ]

The argmax-key per query position:

q[ 0] int       -> k[ 7] NEW_LINE  (0.57)
q[ 1] sum       -> k[ 7] NEW_LINE  (0.76)
q[ 2] =         -> k[ 7] NEW_LINE  (0.64)
q[ 3] a         -> k[ 7] NEW_LINE  (0.57)
q[ 4] +         -> k[ 7] NEW_LINE  (0.75)
q[ 5] b         -> k[ 7] NEW_LINE  (0.51)
q[ 6] ;         -> k[ 7] NEW_LINE  (0.41)
q[ 7] NEW_LINE  -> k[ 7] NEW_LINE  (0.26)
q[ 8] return    -> k[ 7] NEW_LINE  (0.52)
q[ 9] sum       -> k[ 7] NEW_LINE  (0.75)
q[10] ;         -> k[ 7] NEW_LINE  (0.46)
q[11] [EOS]     -> k[ 7] NEW_LINE  (0.31)

Every single position is attending most heavily to position 7 — the NEW_LINE statement boundary. The mass on that one column ranges from 0.26 (the boundary attending to itself) to 0.76 (sum and + attending to the boundary). Other columns are near-zero almost everywhere.

This head has specialized into something like a statement-end aggregator: route information from anywhere in the current statement to the boundary marker that closes it. Cross-attention into the decoder then has a privileged column at NEW_LINE that has gathered everything about the C++ statement, and the decoder can read from it to emit the Python equivalent. The reason XLCoST’s pre-tokenization scheme produces these explicit boundary tokens is exactly so that the model has somewhere to put statement-level information. Head 0 of the last encoder layer is using them for that.

Other heads in the same layer attend differently (some local, some diagonal, some on operators) — the specialization isn’t uniform. But this one’s job is clear, and is the kind of mechanistic finding that motivates the circuits-style probing in the first place.

What would close the gap

Subword tokenization (BPE on the raw source) instead of per-word vocab + UNKs. The whole “rare string literal” failure goes away.
Bigger model if you have the VRAM. d_model=512 and 6 layers is the standard small-transformer scale, but doesn’t fit at T=350 on 4 GB.
Beam search at decode time. Greedy is fine for code but a beam of 4 reliably picks better completions for long sequences.

Code: github.com/debtirthasaha/cpp-to-python-transformer. The 16 numbered tests in test_step*.py build up each of the 12 components in isolation before the full model is assembled. Trained checkpoint (189 MB) is on Hugging Face at MR0b0t/cpp-to-python-transformer.

blank

Eight A100s, $61, and 124M parameters

Model: matching HF’s GPT-2 byte-for-byte

Residual stream init

Data: FineWeb-Edu, sharded

Hyperparameters: every number traced to a source

Gradient accumulation and the global-batch math

Speed: bf16, SDPA, TF32, vocab padding

The optimizer: weight decay split

LR schedule: warmup + cosine + floor

HellaSwag inline eval

Training memory: optimizer state is most of it

Cost ledger

What the trained model produces

What the model actually is

What would close the remaining gap

BPE from scratch, and why your LLM can’t count L’s

Why tokenization exists

UTF-8 in one paragraph

The BPE algorithm

Implementation

Building vocab from merges

Decode

Encode

The encode(decode(x)) asymmetry

GPT-2 regex pre-splitting

Whitespace in code: a GPT-2 disaster, a GPT-4 fix

Special tokens

Vocab size: the only two places it shows up

SolidGoldMagikarp

Spelling, reversal, arithmetic — all tokenization

Token economy

Bugs to remember

Birkhoff in 8.7 KB

The math: Birkhoff completeness

Why free-form prompting struggles

The 9 magmas

The decision procedure

Refutation discipline

The mod-3 magmas are where errors happen

The mandatory PARSE step

Results

Overall leaderboard (all models, all sets) — rank 85

GPT-OSS-only leaderboard — rank 13

Order-5 research subset (GPT-OSS) — rank 21

The Gemma extra-hard anomaly — rank 20

F1 is low on Gemma — why

Sanity check: Qwen 2.5 7B locally

What I’d change

Apply the procedure to this instance

Tiny Shakespeare, tiny GPT

Data, in one tensor

Sampling chunks, not whole documents

The mathematical trick

Self-attention, single head

Multi-head attention

Feed-forward: per-token computation

The block: pre-norm + residual

Scaling up

What’s missing vs GPT-2

makemore: from counting bigrams to a WaveNet

1. Counting bigrams

2. The same bigram model as a neural net

3. MLP, Bengio 2003

4. The three init bugs nobody tells you about

5. BatchNorm: forcing the distribution post-hoc

6. Manual backprop, every gradient by hand

7. WaveNet-style hierarchical fusion

Where val loss can keep dropping

micrograd: a scalar-valued autograd engine

The Value class

Operator overloads register local gradient rules

Why += and not =

Backward via topo-sort

A 2-3-3-1 MLP, no PyTorch

Why scalar-valued

A transformer that reads C++ and writes Python

Problem setup

Hyperparameters, sized to 4 GB

The dataset problem

Building `vocab` from `merges`

The `Value` class

Why `+=` and not `=`

Encoder attention on `int sum = a + b ; NEW_LINE return sum ;`