[YouTube Lecture Summary] Andrej Karpathy - Deep Dive into LLMs like ChatGPT

Introduction

Pre-Training

Step 1: Download and preprocess the internet

Step 2: Tokenization

Step 3: Neural network training

Step 4: Inference

Base model

Post-Training: Supervised Finetuning

Conversations

Hallucinations

Knowledge of Self

Models need tokens to think

Things the model cannot do well

Post-Training: Reinforcement Learning

Reinforcement learning

DeepSeek-R1

AlphaGo

Reinforcement learning from human feedback (RLHF)

Preview of things to come

Keeping track of LLMs

Where to find LLMs

Step 1: Download and preprocess the internet

1. Data Collection

  • The first step of pre-training is to collect a large amount of text data from the Internet .

  • Representative data source: Common Crawl

    • Common Crawl is a public database that has been crawling the web since 2007, storing over 2.7 billion web pages.

    • Contains raw text taken from various websites.

  • Other data sources: Wikipedia, books, papers, news, code repositories (GitHub), blogs, etc.


2. Data Filtering

The collected raw data also contains unnecessary information, so a high-quality data set is built through various filtering steps.

① URL Filtering (Domain Filtering)

  • Apply blacklist domain filtering to remove untrustworthy websites .

  • Filtering targets: spam, malware, marketing sites, adult content, websites containing hate speech, etc.

② Text Extraction

  • Since the raw data is stored along with the HTML code, a process of extracting only the pure text is required.

  • Remove the structure of a website (CSS, JavaScript, ads, etc.) leaving only text data .

③ Language Filtering

  • Apply a language detection model to train the model in a specific language .

  • Example: The FineWeb dataset only retains data where the percentage of English is greater than 65% .

  • If you want a multilingual model, you can use a dataset that contains multiple languages.

④ Deduplication

  • If the same document or similar sentences are included multiple times, the model runs the risk of overfitting certain data.

  • Apply duplicate filtering techniques to detect and remove similar documents .

⑤ Personal Information Protection Filtering (PII Removal)

  • The website may contain Personally Identifiable Information (PII) .

  • Detect and remove personal information (e.g. name, address, phone number, credit card number, social security number, etc.).