Introduction
Pre-Training
Step 1: Download and preprocess the internet
Step 2: Tokenization
Step 3: Neural network training
Step 4: Inference
Base model
Post-Training: Supervised Finetuning
Conversations
Hallucinations
Knowledge of Self
Models need tokens to think
Things the model cannot do well
Post-Training: Reinforcement Learning
Reinforcement learning
DeepSeek-R1
AlphaGo
Reinforcement learning from human feedback (RLHF)
Preview of things to come
Keeping track of LLMs
Where to find LLMs
Existing reinforcement learning is easy to use in domains where the correct answer can be clearly verified (✅ Verifiable ). For example, in a problem like "3+2=?", the correct answer is clearly 5, and evaluating it is simple. However, this evaluation method is difficult in domains where verification is difficult (Unverifiable) . Representative examples include creative tasks such as humor generation, poetry writing, and summarizing .
Applying reinforcement learning requires a huge amount of data and evaluation .
For example, to reinforce a humor-generating model, you would need to generate millions of jokes and have them manually evaluated by humans. 😵
However, it is impractically inefficient for a person to directly evaluate all outputs .
👉 Automation using a reward model learned from human evaluation data!
1️⃣ A person looks at some data (e.g. 5 jokes) and ranks them 2️⃣ A reward model is trained
based on this data 3️⃣ After that, automatic evaluation is performed using the reward model, not a real person 4️⃣ Reinforcement learning is performed based on the reward model
💡 That is, instead of an actual person evaluating, we create a “model that evaluates like a person” and replace it!
🎯 Reinforcement learning is possible even in unverifiable domains!
🎯 Data labeling becomes easier (just rank the answers instead of writing them down)
🎯 Model performance improves (AI performance improvement like ChatGPT)
The compensation model is a simulated human, not an actual human .
So the model may not perfectly replicate how humans evaluate things 🤔
If you run reinforcement learning for a long time, the model can find loopholes in the reward model and learn to incorrectly achieve high scores .
For example, while learning "pelican jokes", what happens if an unusual word combination (e.g. "the the the") gets a high score from the reward model ! 🤯
To prevent this, you should not continue learning for too long and stop at an appropriate point .
💡 RLHF is not a perfect solution that can run RL infinitely!
✅ It can efficiently apply reinforcement learning by replacing human evaluation with a reward model, but
⚠️ it cannot go beyond the limitations of the reward model, and learning must stop at a certain level.