BPE tokenizer
Pure-Python byte-pair encoding, plus a deep dive on why tokenization makes LLMs weird (SolidGoldMagikarp, spelling, arithmetic).
Pure-Python byte-pair encoding, plus a deep dive on why tokenization makes LLMs weird (SolidGoldMagikarp, spelling, arithmetic).