Tiny Shakespeare, tiny GPT
Same architecture as GPT-2, scaled to fit a 4 GB GPU and trained on 1 MB of Shakespeare. Built one mechanism at a time, measuring val loss after each addition.
| After adding | Val loss |
|---|---|
| Bigram baseline | 2.88 |
| + single-head self-attention | 2.41 |
| + multi-head | 2.32 |
| + feed-forward MLP | 2.23 |
| + residual + LayerNorm (3 blocks) | 2.09 |
| Scaled up (4 layers, 192-d, 6 heads) | 1.59 |
{"data":[{"x":["bigram","+ single-head attn","+ multi-head","+ feed-forward","+ residual + LN (x3)","scaled up"],"y":[2.88,2.41,2.32,2.23,2.09,1.59],"type":"bar","marker":{"color":["#bbbbbb","#9b9bff","#7a7aff","#5959ff","#3838ff","#EF553B"]},"text":[2.88,2.41,2.32,2.23,2.09,1.59],"textposition":"outside","hovertemplate":"%{x}<br>val loss %{y}<extra></extra>"}],"layout":{"title":{"text":"Validation loss after each architectural addition"},"yaxis":{"title":"val loss","range":[0,3.2]},"xaxis":{"tickangle":-25},"height":420,"margin":{"l":60,"r":30,"t":60,"b":120},"showlegend":false}}
1.83M parameters, val loss 1.59. Output has speaker tags, sentence rhythm, and (mostly) closed quotes.
Data, in one tensor
Dataset is input.txt — every Shakespeare play, 1,115,394 characters, 65 unique characters including \n.
chars = sorted(list(set(text)))
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
train_data = data[:n] # 1,003,854 tokens
val_data = data[n:] # 111,540 tokens
That’s the tokenizer. (A real BPE tokenizer comes in the next post.)
Sampling chunks, not whole documents
def get_batch(split):
data = train_data if split == 'train' else val_data
ix = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([data[i : i + block_size ] for i in ix])
y = torch.stack([data[i + 1: i + block_size + 1] for i in ix])
return x.to(device), y.to(device)
y is x shifted right by one. Each position in the chunk is one training example: the model predicts position t+1 given positions 0..t. A chunk of length 8 produces 8 examples in parallel. Stacking batch_size chunks gives independent training signal — sequences in a batch don’t communicate.
The mathematical trick
Before self-attention there’s one observation worth dwelling on: how do you let each position aggregate information from all previous positions in parallel?
Goal: xbow[b, t] = mean(x[b, 0..t]). Naive double loop. The fast version is a matrix multiply.
Build a lower-triangular matrix of equal weights:
wei = [[1.00, 0.00, 0.00, 0.00],
[0.50, 0.50, 0.00, 0.00],
[0.33, 0.33, 0.33, 0.00],
[0.25, 0.25, 0.25, 0.25]]
Now wei @ x produces, for each row of wei, a weighted sum over the value vectors. Row t only mixes positions 0..t. Same answer as the loop, one matmul.
Self-attention will turn wei from “equal weights” into “weights computed from the data.” The matrix-multiply-with-causal-mask structure stays.
Self-attention, single head
Three linear projections from the residual stream to a smaller head_size:
class Head(nn.Module):
def __init__(self, head_size):
super().__init__()
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
def forward(self, x):
B, T, C = x.shape
k = self.key(x)
q = self.query(x)
wei = q @ k.transpose(-2, -1) * k.shape[-1]**-0.5 # (B, T, T)
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
v = self.value(x)
return wei @ v
Three details worth pointing at:
-
1/sqrt(head_size)scaling. Without it, largehead_sizeproduces dot products with large variance. Softmax collapses to near-one-hot and gradients stop flowing. Scaling holds the softmax diffuse at init. -
register_bufferfortril. The mask is a constant; we don’t want it tracked as a learnable parameter, but we do want it to move to GPU whenmodel.to(device)is called. - Mask is
−infon upper triangle. After softmax,exp(−inf) = 0. Future positions get exactly zero weight.
Single head added on top of the bigram baseline drops val loss 2.88 → 2.41.
Multi-head attention
Run n_head heads in parallel, concatenate, then project back to n_embd:
class MultiHeadAttention(nn.Module):
def __init__(self, num_heads, head_size):
super().__init__()
self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
self.proj = nn.Linear(head_size * num_heads, n_embd)
def forward(self, x):
out = torch.cat([h(x) for h in self.heads], dim=-1)
return self.proj(out)
head_size = n_embd // n_head, so concatenation gets you back to n_embd. The final proj is what lets heads interact — concatenation alone just glues outputs together.
nn.ModuleList instead of a plain Python list: PyTorch needs to know about these submodules to register their parameters. A plain list is invisible to model.parameters().
Val loss: 2.32.
Feed-forward: per-token computation
After attention mixes information across positions, the FFN lets each position think about what it gathered. Same MLP applied to every token independently:
self.net = nn.Sequential(
nn.Linear(n_embd, 4 * n_embd),
nn.ReLU(),
nn.Linear(4 * n_embd, n_embd),
)
4× expansion is from the original Transformer paper. Val loss: 2.23.
A useful mental model from this point on: attention is communication, feed-forward is computation. Tokens talk to each other in attention, then each one updates its own representation in the FFN.
The block: pre-norm + residual
class Block(nn.Module):
def forward(self, x):
x = x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
return x
Three things going on:
- Residual
x + ...: the addition creates a gradient highway. Gradients flow back through+undisturbed to earlier layers. Without it, deep networks have vanishing gradients. - LayerNorm before the sub-block (pre-norm): the original 2017 paper put LN after; modern practice puts it before. Pre-norm trains more stably at depth.
- Sub-blocks start near-zero in output: at init, attention and FFN contribute tiny perturbations to the residual stream. Useful signal accumulates gradually.
Val loss after stacking 3 such blocks: 2.09.
Scaling up
batch_size = 64
block_size = 128
n_embd = 192
n_head = 6 # head_size = 192/6 = 32
n_layer = 4
dropout = 0.2
1.83M parameters. Val loss 1.59. The drop from 2.09 → 1.59 is mostly capacity — same architecture, just more of it.
Sample output:
DUKE VINCENTIO:
Whither dost thou pursue, and what shall be done
To these things that he was forc'd to make?
ANGELO:
My lord, I will entreat your grace's hand.
The model is making it up character by character. There is no concept of a word in its vocabulary. It learned word boundaries, speaker tags, and sentence structure from 1 MB of text.
What’s missing vs GPT-2
This is GPT-2’s architecture at small scale. The pieces not in this build:
- A real tokenizer (we used per-character; GPT-2 uses BPE — post).
- Weight tying between
wteandlm_head(the input embedding and output classifier share the same matrix in GPT-2 —lm_head.weight = wte.weight). - Initialization variants (GPT-2 scales the output projection of each residual layer by
1/sqrt(2*n_layer)to control variance through depth). - A proper optimizer recipe (cosine LR schedule, weight decay split, warmup) and DDP for multi-GPU.
All of those show up in the GPT-2 reproduction.
Enjoy Reading This Article?
Here are some more articles you might like to read next: