- Tiktokenizer is a handly website to see how different models tokenize strings.
- Tokenizers
- Character tokenization
- inefficient use of vocab as each character is allotted a id
- Byte-based tokenizer
- unicode strings can be represented as sequence of bytes (eg- utf-8 encoding).
- small vocab - only 256, characters can be between represented as sequence of bytes
- results in long sequences, attention is quadratic with sequence length so will be poor efficiency
- work tokenizer
- split sequence to sequence of strings and assign to an index in vocab
- no fixed vocab size, unbounded.
- might encounter UNK with new tokens
- Byte-pair encoding (BPE)
- basic- train tokenizer on raw text to automatically determine the vocabulary
- intuition - common sequences of characters represented by single token, rare sequences represetned by many tokens
- Algorithm - start with each byte as token, and successively merge most common pair of adjacent tokens
- Character tokenization
BPE Algorithm
def train_bpe(string: str, num_merges: int) -> BPETokenizerParams:
Start with the list of bytes of string.
indices = list(map(int, string.encode("utf-8")))
merges: dict[tuple[int, int], int] = {} # index1, index2 => merged index
vocab: dict[int, bytes] = {x: bytes([x]) for x in range(256)} # index -> bytes
for i in range(num_merges):
Count the number of occurrences of each pair of tokens
counts = defaultdict(int)
for index1, index2 in zip(indices, indices[1:]): # For each adjacent pair
counts[(index1, index2)] += 1 # @inspect counts
Find the most common pair.
pair = max(counts, key=counts.get) # @inspect pair
index1, index2 = pair
Merge that pair.
new_index = 256 + i # @inspect new_index
merges[pair] = new_index # @inspect merges
vocab[new_index] = vocab[index1] + vocab[index2] # @inspect vocab
indices = merge(indices, pair, new_index) # @inspect indices
return BPETokenizerParams(vocab=vocab, merges=merges)