Introduction
Pre-Training
Step 1: Download and preprocess the internet
Step 2: Tokenization
Step 3: Neural network training
Step 4: Inference
Base model
Post-Training: Supervised Finetuning
Conversations
Hallucinations
Knowledge of Self
Models need tokens to think
Things the model cannot do well
Post-Training: Reinforcement Learning
Reinforcement learning
DeepSeek-R1
AlphaGo
Reinforcement learning from human feedback (RLHF)
Preview of things to come
Keeping track of LLMs
Where to find LLMs
The first step of pre-training is to collect a large amount of text data from the Internet .
Representative data source: Common Crawl
Common Crawl is a public database that has been crawling the web since 2007, storing over 2.7 billion web pages.
Contains raw text taken from various websites.
Other data sources: Wikipedia, books, papers, news, code repositories (GitHub), blogs, etc.
The collected raw data also contains unnecessary information, so a high-quality data set is built through various filtering steps.
Apply blacklist domain filtering to remove untrustworthy websites .
Filtering targets: spam, malware, marketing sites, adult content, websites containing hate speech, etc.
Since the raw data is stored along with the HTML code, a process of extracting only the pure text is required.
Remove the structure of a website (CSS, JavaScript, ads, etc.) leaving only text data .
Apply a language detection model to train the model in a specific language .
Example: The FineWeb dataset only retains data where the percentage of English is greater than 65% .
If you want a multilingual model, you can use a dataset that contains multiple languages.
If the same document or similar sentences are included multiple times, the model runs the risk of overfitting certain data.
Apply duplicate filtering techniques to detect and remove similar documents .
The website may contain Personally Identifiable Information (PII) .
Detect and remove personal information (e.g. name, address, phone number, credit card number, social security number, etc.).