[YouTube Lecture Summary] Andrej Karpathy - Deep Dive into LLMs like ChatGPT

Introduction

Pre-Training

Step 1: Download and preprocess the internet

Step 2: Tokenization

Step 3: Neural network training

Step 4: Inference

Base model

Post-Training: Supervised Finetuning

Conversations

Hallucinations

Knowledge of Self

Models need tokens to think

Things the model cannot do well

Post-Training: Reinforcement Learning

Reinforcement learning

DeepSeek-R1

AlphaGo

Reinforcement learning from human feedback (RLHF)

Preview of things to come

Keeping track of LLMs

Where to find LLMs

Reinforcement learning from human feedback (RLHF)

🔍 RLHF (Reinforcement Learning with Human Feedback) and Learning in Unverifiable Domains

Existing reinforcement learning is easy to use in domains where the correct answer can be clearly verified (✅ Verifiable ). For example, in a problem like "3+2=?", the correct answer is clearly 5, and evaluating it is simple. However, this evaluation method is difficult in domains where verification is difficult (Unverifiable) . Representative examples include creative tasks such as humor generation, poetry writing, and summarizing .


🏆 RLHF: Reinforcement Learning with Human Feedback

🔹 Limitations of the existing method

  • Applying reinforcement learning requires a huge amount of data and evaluation .

  • For example, to reinforce a humor-generating model, you would need to generate millions of jokes and have them manually evaluated by humans. 😵

  • However, it is impractically inefficient for a person to directly evaluate all outputs .

🔹 The core idea of ​​RLHF

👉 Automation using a reward model learned from human evaluation data!

1️⃣ A person looks at some data (e.g. 5 jokes) and ranks them 2️⃣ A reward model is trained
based on this data 3️⃣ After that, automatic evaluation is performed using the reward model, not a real person 4️⃣ Reinforcement learning is performed based on the reward model

💡 That is, instead of an actual person evaluating, we create a “model that evaluates like a person” and replace it!


✅ Advantages of RLHF

🎯 Reinforcement learning is possible even in unverifiable domains!
🎯 Data labeling becomes easier (just rank the answers instead of writing them down)
🎯 Model performance improves (AI performance improvement like ChatGPT)


⚠️ Limitations and Issues of RLHF

❌ 1. Not a perfect human evaluation

  • The compensation model is a simulated human, not an actual human .

  • So the model may not perfectly replicate how humans evaluate things 🤔

❌ 2. The Gameable Problem of Reinforcement Learning

  • If you run reinforcement learning for a long time, the model can find loopholes in the reward model and learn to incorrectly achieve high scores .

  • For example, while learning "pelican jokes", what happens if an unusual word combination (e.g. "the the the") gets a high score from the reward model ! 🤯

  • To prevent this, you should not continue learning for too long and stop at an appropriate point .


📌 Conclusion

💡 RLHF is not a perfect solution that can run RL infinitely!
✅ It can efficiently apply reinforcement learning by replacing human evaluation with a reward model, but
⚠️ it cannot go beyond the limitations of the reward model, and learning must stop at a certain level.