Introduction
Pre-Training
Step 1: Download and preprocess the internet
Step 2: Tokenization
Step 3: Neural network training
Step 4: Inference
Base model
Post-Training: Supervised Finetuning
Conversations
Hallucinations
Knowledge of Self
Models need tokens to think
Things the model cannot do well
Post-Training: Reinforcement Learning
Reinforcement learning
DeepSeek-R1
AlphaGo
Reinforcement learning from human feedback (RLHF)
Preview of things to come
Keeping track of LLMs
Where to find LLMs
The large language model (LLM) is pre-trained on Internet documents , but in this state, it is only a simple text predictor and cannot function as a natural conversational AI assistant .
To address this, we fine-tune the model using a new conversation dataset during the post-training process.
The conversation dataset is basically structured in the form of “User Query → AI’s ideal response (Assistant Response).”
example:
User: "What is 2 + 2?"
Assistant: "2 + 2 is 4."
User: "What if it's '*' instead of '+'?"
Assistant: "2 × 2 is 4."
Conversational data is primarily generated by human labelers , although some automatic generation by AI has been used recently.
Professional labelers create conversations directly .
Example: Programming questions are answered by developers, scientific questions by scientists.
Example conversation data:
"5 ways to regain passion for my career?"
"Translate the following sentence into Spanish."
"What are the five must-see landmarks in Paris?"
Nowadays, it is common for AI to first generate responses, and then labelers review and correct them .
Most modern conversational datasets are built from “synthetic data” generated by AI.
Example: Projects like OpenAssistant use crowdsourcing to allow users to create and review questions and answers.
The conversation is converted into token sequences so that the model can understand it.
for example:
<|im_start|>user<|im_sep|> What are the top 5 must-see landmarks in Paris? <|im_end|><|im_start|> assistant <|im_sep|> 1. Eiffel Tower 2. Louvre 3. Notre Dame Cathedral 4. Champs-Elysees 5. Montmartre <|im_end|>
The model learns patterns based on this, allowing it to respond in a similar way in the next conversation.
When a user asks a question, the model generates an answer by statistically predicting "how would the labeler answer this question?"
That is, rather than the AI thinking on its own, it simulates the labeler's responses .
If you have similar questions in your training data, there is a good chance that you will generate nearly identical responses .
Even for questions that are not in the training data, it creates “similar-feeling answers” based on existing knowledge .