BPE tokenizer

Pure-Python byte-pair encoding, plus a deep dive on why tokenization makes LLMs weird (SolidGoldMagikarp, spelling, arithmetic).