Introduction
Pre-Training
Step 1: Download and preprocess the internet
Step 2: Tokenization
Step 3: Neural network training
Step 4: Inference
Base model
Post-Training: Supervised Finetuning
Conversations
Hallucinations
Knowledge of Self
Models need tokens to think
Things the model cannot do well
Post-Training: Reinforcement Learning
Reinforcement learning
DeepSeek-R1
AlphaGo
Reinforcement learning from human feedback (RLHF)
Preview of things to come
Keeping track of LLMs
Where to find LLMs
The process by which a large language model (LLM) converts sentences into smaller units (tokens) to process text .
Convert text to numbers so the model can understand it
Reduce sequence length by making frequently occurring word combinations into a single token
① UTF-8 encoding: Convert all characters to bytes
However, simply converting all characters to bytes would result in a sequence that is too long.
➡️ Therefore, a more efficient tokenization method is needed! (=BPE)
② Byte Pair Encoding (BPE): Register frequently appearing character pairs as one token → Reduce sequence
example
Original:"internationalization"
UTF-8:['i', 'n', 't', 'e', 'r', 'n', 'a', 't', 'i', 'o', 'n', 'a', 'l', 'i', 'z', 'a', 't', 'i', 'o', 'n']
After applying BPE :['international', 'ization']
➡️ Even the same word can be converted into a smaller number of tokens , i.e. the sequence length is optimized!
Number of tokens in GPT-4: 100,277
Token ID conversion process:
"Hello world"→ ["Hello", " world"]→[15339, 1917]
✅ Important Features
Spaces are also treated as a token ( " world"vs. "world")
Case differences also affect tokenization results ( "Hello"vs. "hello")
Reduce sequences by converting meaningful parts into a minimum number of tokens.
📌 Real-world tokenization example site using Tiktoken : Tiktokenizer
When the model receives input, it converts the text into a sequence of token IDs .
Example:
"The quick brown fox"→ [324, 9821, 4321, 294](Convert to token ID)
What the model predicts is the ID value of the next token .