[YouTube Lecture Summary] Andrej Karpathy - Deep Dive into LLMs like ChatGPT

Introduction

Pre-Training

Step 1: Download and preprocess the internet

Step 2: Tokenization

Step 3: Neural network training

Step 4: Inference

Base model

Post-Training: Supervised Finetuning

Conversations

Hallucinations

Knowledge of Self

Models need tokens to think

Things the model cannot do well

Post-Training: Reinforcement Learning

Reinforcement learning

DeepSeek-R1

AlphaGo

Reinforcement learning from human feedback (RLHF)

Preview of things to come

Keeping track of LLMs

Where to find LLMs

Step 2: Tokenization

1. What is Tokenization?

The process by which a large language model (LLM) converts sentences into smaller units (tokens) to process text .


2. The need for tokenization

  1. Convert text to numbers so the model can understand it

  2. Reduce sequence length by making frequently occurring word combinations into a single token


3. Tokenization Process

① UTF-8 encoding: Convert all characters to bytes

  • However, simply converting all characters to bytes would result in a sequence that is too long.

  • ➡️ Therefore, a more efficient tokenization method is needed! (=BPE)

② Byte Pair Encoding (BPE): Register frequently appearing character pairs as one token → Reduce sequence


example

  • Original:"internationalization"

  • UTF-8:['i', 'n', 't', 'e', 'r', 'n', 'a', 't', 'i', 'o', 'n', 'a', 'l', 'i', 'z', 'a', 't', 'i', 'o', 'n']

  • After applying BPE :['international', 'ization']

➡️ Even the same word can be converted into a smaller number of tokens , i.e. the sequence length is optimized!


4. Tokenization in the actual GPT model

  • Number of tokens in GPT-4: 100,277

  • Token ID conversion process:

    • "Hello world"["Hello", " world"][15339, 1917]

Important Features

  • Spaces are also treated as a token ( " world"vs. "world")

  • Case differences also affect tokenization results ( "Hello"vs. "hello")

  • Reduce sequences by converting meaningful parts into a minimum number of tokens.

📌 Real-world tokenization example site using Tiktoken : Tiktokenizer


5. Model input using tokenization

  • When the model receives input, it converts the text into a sequence of token IDs .

  • Example:

    • "The quick brown fox"[324, 9821, 4321, 294](Convert to token ID)

  • What the model predicts is the ID value of the next token .