<h3>🔍 RLHF (Reinforcement Learning with Human Feedback) and Learning in Unverifiable Domains</h3>Existing reinforcement learning is easy to use in domains where the correct answer can be clearly verified (✅ Verifiable ). For example, in a problem like "3+2=?", the correct answer is clearly 5, and evaluating it is simple. However, this evaluation method is difficult in domains where verification is difficult (Unverifiable) . Representative examples include creative tasks such as humor generation, poetry writing, and summarizing .<h2>🏆 RLHF: Reinforcement Learning with Human Feedback</h2><h3>🔹 Limitations of the existing method</h3><ul><li>Applying reinforcement learning requires a huge amount of data and evaluation .</li><li>For example, to reinforce a humor-generating model, you would need to generate millions of jokes and have them manually evaluated by humans. 😵</li><li>However, it is impractically inefficient for a person to directly evaluate all outputs .</li></ul><h3>🔹 The core idea of ​​RLHF</h3>👉 Automation using a reward model learned from human evaluation data!1️⃣ A person looks at some data (e.g. 5 jokes) and ranks them 2️⃣ A reward model is trained based on this data 3️⃣ After that, automatic evaluation is performed using the reward model, not a real person 4️⃣ Reinforcement learning is performed based on the reward model 💡 That is, instead of an actual person evaluating, we create a “model that evaluates like a person” and replace it!<h2>✅ Advantages of RLHF</h2>🎯 Reinforcement learning is possible even in unverifiable domains! 🎯 Data labeling becomes easier (just rank the answers instead of writing them down) 🎯 Model performance improves (AI performance improvement like ChatGPT)<h2>⚠️ Limitations and Issues of RLHF</h2><h3>❌ 1. Not a perfect human evaluation</h3><ul><li>The compensation model is a simulated human, not an actual human .</li><li>So the model may not perfectly replicate how humans evaluate things 🤔</li></ul><h3>❌ 2. The Gameable Problem of Reinforcement Learning</h3><ul><li>If you run reinforcement learning for a long time, the model can find loopholes in the reward model and learn to incorrectly achieve high scores .</li><li>For example, while learning "pelican jokes", what happens if an unusual word combination (e.g. "the the the") gets a high score from the reward model ! 🤯</li><li>To prevent this, you should not continue learning for too long and stop at an appropriate point .</li></ul><h2>📌 Conclusion</h2>💡 RLHF is not a perfect solution that can run RL infinitely! ✅ It can efficiently apply reinforcement learning by replacing human evaluation with a reward model, but ⚠️ it cannot go beyond the limitations of the reward model, and learning must stop at a certain level.

Original video link: <a target="_blank" rel="noopener noreferrer nofollow" href="https://www.youtube.com/watch?v=7xTGNNLPyMI&amp;t=11381s">https://www.youtube.com/watch?v=7xTGNNLPyMI&amp;t=11381s</a>Video topic: A comprehensive explanation of how large language models (LLMs) like ChatGPT work.<ul><li>The goal is to explain at a level that even the general public can understand.</li></ul>

Introduction

<ul><li>Step 1: Download and preprocess internet data</li><li>Step 2: Tokenization</li><li>Step 3: Training the Neural Network</li><li>Step 4: Inference</li></ul>

Pre-Training

<h3>1. Data Collection</h3><ul><li>The first step of pre-training is to collect a large amount of text data from the Internet .</li><li>Representative data source: Common Crawl<ul><li>Common Crawl is a public database that has been crawling the web since 2007, storing over 2.7 billion web pages.</li><li>Contains raw text taken from various websites.</li></ul></li><li>Other data sources: Wikipedia, books, papers, news, code repositories (GitHub), blogs, etc.</li></ul><h3>2. Data Filtering</h3>The collected raw data also contains unnecessary information, so a high-quality data set is built through various filtering steps.<h4>① URL Filtering (Domain Filtering)</h4><ul><li>Apply blacklist domain filtering to remove untrustworthy websites .</li><li>Filtering targets: spam, malware, marketing sites, adult content, websites containing hate speech, etc.</li></ul><h4>② Text Extraction</h4><ul><li>Since the raw data is stored along with the HTML code, a process of extracting only the pure text is required.</li><li>Remove the structure of a website (CSS, JavaScript, ads, etc.) leaving only text data .</li></ul><h4>③ Language Filtering</h4><ul><li>Apply a language detection model to train the model in a specific language .</li><li>Example: The FineWeb dataset only retains data where the percentage of English is greater than 65% .</li><li>If you want a multilingual model, you can use a dataset that contains multiple languages.</li></ul><h4>④ Deduplication</h4><ul><li>If the same document or similar sentences are included multiple times, the model runs the risk of overfitting certain data.</li><li>Apply duplicate filtering techniques to detect and remove similar documents .</li></ul><h4>⑤ Personal Information Protection Filtering (PII Removal)</h4><ul><li>The website may contain Personally Identifiable Information (PII) .</li><li>Detect and remove personal information (e.g. name, address, phone number, credit card number, social security number, etc.).</li></ul>

Step 1: Download and preprocess the internet

<h2>1. What is Tokenization?</h2>The process by which a large language model (LLM) converts sentences into smaller units (tokens) to process text .<h2>2. The need for tokenization</h2><ol><li>Convert text to numbers so the model can understand it</li><li>Reduce sequence length by making frequently occurring word combinations into a single token</li></ol><h2>3. Tokenization Process</h2>① UTF-8 encoding: Convert all characters to bytes<ul><li>However, simply converting all characters to bytes would result in a sequence that is too long.</li><li>➡️ Therefore, a more efficient tokenization method is needed! (=BPE)</li></ul>② Byte Pair Encoding (BPE): Register frequently appearing character pairs as one token → Reduce sequenceexample<ul><li>Original:<code>"internationalization"</code></li><li>UTF-8:<code>['i', 'n', 't', 'e', 'r', 'n', 'a', 't', 'i', 'o', 'n', 'a', 'l', 'i', 'z', 'a', 't', 'i', 'o', 'n']</code></li><li>After applying BPE :<code>['international', 'ization']</code></li></ul>➡️ Even the same word can be converted into a smaller number of tokens , i.e. the sequence length is optimized!<h2>4. Tokenization in the actual GPT model</h2><ul><li>Number of tokens in GPT-4: 100,277</li><li>Token ID conversion process:<ul><li><code>"Hello world"</code>→ <code>["Hello", " world"]</code>→<code>[15339, 1917]</code></li></ul></li></ul>✅ Important Features<ul><li>Spaces are also treated as a token ( <code>" world"</code>vs. <code>"world"</code>)</li><li>Case differences also affect tokenization results ( <code>"Hello"</code>vs. <code>"hello"</code>)</li><li>Reduce sequences by converting meaningful parts into a minimum number of tokens.</li></ul>📌 Real-world tokenization example site using Tiktoken : <a target="_blank" rel="noopener noreferrer nofollow" href="https://tiktokenizer.vercel.app/">Tiktokenizer</a><h2>5. Model input using tokenization</h2><ul><li>When the model receives input, it converts the text into a sequence of token IDs .</li><li>Example:<ul><li><code>"The quick brown fox"</code>→ <code>[324, 9821, 4321, 294]</code>(Convert to token ID)</li></ul></li><li>What the model predicts is the ID value of the next token .</li></ul>

Step 2: Tokenization

<img src="https://whayplgihzorucjvkpam.supabase.co/storage/v1/object/public/images/1739240967117-dca6ba47-ca91-49b1-922e-8e90e0b79f32.png"><h2>1. Neural Network Input</h2>✅ Input data: Token sequence<ul><li>The model takes numbers (Token ID sequences) as input, not text .</li><li>Example:<code>"Hello world" → ["Hello", " world"] → [15339, 1917] (토큰 ID)</code></li><li>Window length: The model processes the input sentence by dividing it into fixed-length token sequences (windows) . (In the example image above, the window size is 4.)</li></ul><h2>2. Neural Network Output</h2>✅ Output data: Next token probability distribution<ul><li>The model predicts the probability distribution for all possible tokens .</li><li>Complete the sentence by selecting the token with the highest probability (Greedy Decoding)</li><li>Or generate various sentences by sampling according to a certain probability (Temperature Sampling).</li></ul><h2>3. Training Process</h2>💡 The way the model is trained is by repeating the process of “predicting the next token”!<ol><li>Input: Provide a sequence of tokens of window size.</li><li>Output: Predict the probability distribution of the next token</li><li>Calculate the loss by comparing it to the correct answer (actual next token)</li><li>Update neural network weights to reduce loss</li></ol>📌 Example<code>Input: [“The”, “quick”, “brown”, “fox”] </code> <code>Output: probability distribution {“jumps”: 85%, “runs”: 10%, “flies”: 5%} </code> <code>Correct answer: “jumps” </code> <code>Loss calculation: 1 - 0.85 = 0.15 </code> <code>Update weights: adjust to increase the probability of correct answers</code>➡️ This process is repeated for billions of sentences, allowing the model to learn language patterns.

Step 3: Neural network training

Inference process: After training, the process of using the model to generate new text.✅ Core principle: Given some text (prompt), it generates the most probable next tokens one by one to complete the sentence.✅ Example<img src="https://whayplgihzorucjvkpam.supabase.co/storage/v1/object/public/images/1739506359954-2705002a-a92d-4e71-9106-993ddd65a403.png">➡️ The sentence expands as you repeat it. ChatGPT naturally creates a conversation by continuing sentences based on the user's input (prompt).

Step 4: Inference

<h2>🔹 What is a Base Model?</h2><blockquote>"The base model is a large-scale language model trained on large amounts of text data, without any specific human feedback, that probabilistically predicts the next word or sentence in a given context."</blockquote><h3>✅ Key features of the Base model</h3><ol><li>Trained without supervised learning<ul><li>No human data labeling</li><li>Learning based on large amounts of text data collected from the Internet, books, papers, etc.</li></ul></li><li>A model trained without a purpose<ul><li>Not trained to answer specific questions or perform specific tasks</li><li>A model that simply looks at the given context and predicts the next word (token).</li></ul></li><li>A simple "token predictor"<ul><li>Generates the next word (token) probabilistically based on the pattern of the input text.</li><li>Predicted sentences are generated based on statistical similarity to the training data.</li></ul></li><li>Not rule-based<ul><li>Rather than learning linguistic rules directly, it follows probabilistic patterns .</li><li>It has no logical thinking or intention, and is at the level of reproducing patterns seen in training data.</li></ul></li></ol><h2>🔹 Limitations of the Base Model</h2><ul><li>It is just a simple "sentence predictor" , without any logical thinking or question-answering ability.</li><li>Since it only predicts based on what it has learned from the Internet, it may generate untrained information and not be true (📌 "hallucination" problem )</li><li>Not having the ability to perform specific tasks (e.g., translating, summarizing, coding)</li><li>“Base model” itself is not practical AI → Additional post-processing learning (post-training) is required to interact with humans.</li></ul>📌 In other words, the Base model is a "probabilistic reproducer of Internet documents," not a conversational AI or a task-specific performance model. 📌 Postprocessing is absolutely necessary to make the Base model into a practical AI.<h2>🔹 How to use the Base model</h2><h3>✅ 1️⃣ Few-shot Learning (Prompt Engineering)</h3><ul><li>Well-designed prompts can accomplish specific tasks.</li><li>Example input :<pre><code>apple → 사과 
banana → 바나나 
teacher → ?</code></pre><ul><li>By learning the above pattern, the model is automatically guided to predict "teacher".</li></ul></li></ul><h3>✅ 2️⃣ Use the Base Model as a Conversational AI</h3><ul><li>You can act like AI by entering a conversation script in the form of a web page.<pre><code>Human: Hi! Who are you?
AI: Hello! I am an AI assistant. I can answer questions, provide explanations, and assist with various tasks. How can I help you?

Human: Can you explain how photosynthesis works?
AI: Sure! Photosynthesis is the process by which plants, algae, and some bacteria convert light energy into chemical energy.
The basic chemical equation is:

6CO₂ + 6H₂O + light energy → C₆H₁₂O₆ + 6O₂

This process provides oxygen for us to breathe and serves as an essential energy source for most ecosystems. Would you like a more detailed explanation?

Human: No, that's fine. Instead, can you help me write a short email?
AI: Of course! Who is the recipient, and what should the email include?

Human: I want to write an email to my manager requesting a day off next week.
AI: Here's a draft you can use:

Subject: Request for Day Off Next Week

Dear [Manager’s Name],

I hope this message finds you well. I am writing to request a day off on [specific date]. Please let me know if this works for the team’s schedule or if there are any concerns.

Thank you for considering my request. Please let me know if you need further details.

Best regards,
[Your Name]

Does this draft work for you, or would you like to make any changes? 

Human: </code></pre><ul><li>If you add the desired prompt here, you can use it like a conversational AI.</li></ul></li></ul>📌 However, since the Base model is not a chatbot in nature, it cannot become a perfect conversational AI without post-processing.<h2>🔹 Conclusion</h2><ul><li>Base model = Probabilistic predictor of Internet sentences</li><li>Simple auto-completion engine , unable to understand questions or think logically.</li><li>Post-Training is essential for practical AI</li><li>Limited usability with prompt engineering</li></ul>➡ The Base model is the first step of AI, and it needs to be improved before it can be used as practical AI. 🚀

Base model

<h4>1️⃣ Limitations of the Base Model</h4><ul><li>The pretrained base model is simply a “text predictor” that has learned the statistical characteristics of Internet documents.</li><li>That is, if you ask a specific question, it is more likely to produce text similar to an Internet document rather than providing a meaningful answer .</li><li>Using these models directly is inefficient and may not work as desired.</li></ul><h4>2️⃣ The need for post-training</h4><ul><li>Post-training is needed to evolve it into an “AI assistant” that provides useful answers to user questions, rather than simply generating text .</li><li>This is the process of refining the base model so that it can be used as a conversational AI, rather than simply simulating documents .</li><li>Through post-training, the model is adjusted to provide more logical and consistent responses and behave in a way that matches human expectations .</li></ul>

Post-Training: Supervised Finetuning

The large language model (LLM) is pre-trained on Internet documents , but in this state, it is only a simple text predictor and cannot function as a natural conversational AI assistant .To address this, we fine-tune the model using a new conversation dataset during the post-training process.<h2>1. Conversation Data Creation Process</h2><h3>🔹 1.1. Structure of conversation data</h3>The conversation dataset is basically structured in the form of “User Query → AI’s ideal response (Assistant Response).”<ul><li>example:<ul><li>User: "What is 2 + 2?"</li><li>Assistant: "2 + 2 is 4."</li><li>User: "What if it's '*' instead of '+'?"</li><li>Assistant: "2 × 2 is 4."</li></ul></li></ul><h3>🔹 1.2. How data is generated</h3>Conversational data is primarily generated by human labelers , although some automatic generation by AI has been used recently.<h4>(1) Utilization of human labelers</h4><ul><li>Professional labelers create conversations directly .<ul><li>Example: Programming questions are answered by developers, scientific questions by scientists.</li></ul></li><li>Example conversation data:<ul><li>"5 ways to regain passion for my career?"</li><li>"Translate the following sentence into Spanish."</li><li>"What are the five must-see landmarks in Paris?"</li></ul></li></ul><h4>(2) Utilization of AI + Human Review (Synthetic Data + Human Review)</h4><ul><li>Nowadays, it is common for AI to first generate responses, and then labelers review and correct them .</li><li>Most modern conversational datasets are built from “synthetic data” generated by AI.</li><li>Example: Projects like OpenAssistant use crowdsourcing to allow users to create and review questions and answers.</li></ul><h2>2. Model Learning Process (Fine-Tuning on Conversations)</h2><ul><li>The conversation is converted into token sequences so that the model can understand it.</li><li>for example:</li></ul><code>&lt;|im_start|&gt;user&lt;|im_sep|&gt; What are the top 5 must-see landmarks in Paris? &lt;|im_end|&gt;&lt;|im_start|&gt; assistant &lt;|im_sep|&gt; 1. Eiffel Tower 2. Louvre 3. Notre Dame Cathedral 4. Champs-Elysees 5. Montmartre &lt;|im_end|&gt;</code><ul><li>The model learns patterns based on this, allowing it to respond in a similar way in the next conversation.</li></ul><h2>3. ChatGPT's response principle</h2><ul><li>When a user asks a question, the model generates an answer by statistically predicting "how would the labeler answer this question?"</li><li>That is, rather than the AI ​​thinking on its own, it simulates the labeler's responses .</li><li>If you have similar questions in your training data, there is a good chance that you will generate nearly identical responses .</li><li>Even for questions that are not in the training data, it creates “similar-feeling answers” ​​based on existing knowledge .</li></ul>

Conversations

<h2>1. Why Hallucinations Occur</h2><ul><li>LLM is trained in a way that mimics the style of the training data.</li><li>For example, the question "Who is ~~?" always provides an answer that includes the correct answer .</li><li>For this reason, when the model is asked a question it doesn't know , it doesn't answer "I don't know" , but instead tries to generate the most statistically plausible sentence .</li><li>Example: "Who is Orson Kovats?" → "He is an American writer" (Not true. Orson Kovats is a fictional name.)</li></ul><h2>2. Solution for hallucinations</h2><h3>(1) Teaching the answer “I don’t know” ( see <a target="_blank" rel="noopener noreferrer nofollow" href="https://ai.meta.com/research/publications/the-llama-3-herd-of-models/">Meta paper</a> )</h3><ol><li>Generate question-answer (QA) data<ul><li>Select specific documents from the training data to generate a set of fact-based questions and answers .</li><li>Example: "What team did this person play for?" → "Buffalo Sabres"</li></ul></li><li>Check if the model knows<ul><li>Ask the same question three or more times and evaluate whether the model gets it right consistently.</li><li>✅ Consistently correct → model knows</li><li>❌ Wrong or changed answer → What the model doesn't know</li></ul></li><li>Learning to answer "I don't know"<ul><li>Add training data to collect questions the model gets wrong <code>"I don’t know"</code>and answers them.</li><li>In this process, specific neurons are formed to handle uncertainty .→ If the neuron value is high, the model will answer “I don’t know.”</li></ul></li></ol><h3>(2) Added search function</h3><ul><li>Use web search features to directly search for information that the model does not know .</li><li>method :<ol><li>When the model needs to search, it generates specific tokens .<code>[SEARCH_START]</code><ul><li>yes:<code>[SEARCH_START] Orson Kovats [SEARCH_END]</code></li></ul></li><li>Search engines (Bing, Google, etc.) perform searches and insert the results into the context pane .</li><li>The model uses the search results to generate the final answer.</li></ol></li><li>Learning Process :<ul><li>Add data so that the model can learn “when to search” and “how to search” .</li><li>The model performs well even with only a few thousand pieces of data</li></ul></li></ul>

Hallucinations

<h3>1. LLM has no self-awareness</h3><ul><li>LLM (Large Language Model) is a system without memory or ego .</li><li>When a conversation ends, all information is deleted, and the next conversation starts from a completely new state.</li><li>That is, it is not an entity that is aware of itself or exists continuously.</li></ul><h3>2. The way LLM introduces itself is a simple pattern learning result.</h3><ul><li>Answers to questions like “Who are you?” are not because the model recognizes itself, but because it probabilistically generates the most appropriate sentence from the data it has learned .</li><li>For example, there is a lot of information about OpenAI and ChatGPT on the Internet, so the model might answer something like "I am ChatGPT developed by OpenAI."</li><li>However, this is only the most frequently appearing pattern in the learned data and is not always accurate information .</li></ul><h3>3. How to establish the identity of LLM</h3>There are two ways to make a model assume a specific identity.✅ 1) Fine-Tuning Training Data<ul><li>If you teach the model the desired answer to a specific question (e.g., "Who are you?"), the model will follow that answer.</li><li>Example: "I am an Almo model developed by Allen AI."</li></ul>✅ 2) Insert System Message<ul><li>By inserting hidden system messages at the beginning of a conversation , you can make the model reference specific information.</li><li>Example: "You are ChatGPT 4.0 developed by OpenAI, and your knowledge cutoff is 2024."</li><li>The user cannot see this message, but the model uses it to communicate.</li></ul>

Knowledge of Self

<h2>1. Basic operation structure of AI model</h2>✔ AI models think by generating words (tokens) sequentially from left to right . ✔ The amount of calculations that can be processed in one token generation is limited. ✔ In other words, if you try to solve a complex problem at once, the accuracy is likely to decrease .<h2>2. Good Answers vs. Bad Answers (Example Math Problem)</h2>💡 Problem : Emily bought 3 apples and 2 oranges. Each orange costs $2, and the total cost is $13. What is the cost of one apple?<h3>🚫 Bad answers ("telling the right answer")</h3>"The answer is 3"🔴 Reason:<ul><li>The model must perform all calculations at once, resulting in computational overhead.</li><li>The likelihood of getting the wrong answer increases in complex problems.</li></ul><h3>✅ Good answer ("with step-by-step calculation process")</h3>"The price of two oranges is $4. Subtract $4 from the total price and you get $9. Since the price of three apples is $3 for one apple."🟢 Reason:<ul><li>Learning effectiveness is improved by encouraging the model to think step by step .</li><li>A way for models to help solve complex problems logically.</li></ul><h2>3. Why models have difficulty with complex calculations</h2>🚨 The model cannot perform too many calculations in one operation (token prediction) . 📉 As the number gets larger, the possibility of a wrong answer increases.✔ Solution<ul><li>Prompts you to generate answers that include a step-by-step calculation process</li><li>Guide to a logical approach, including intermediate results</li></ul><h2>cf) You can get more accurate answers by using the code execution function.</h2>🤖 The computational power of AI models is limited, but their ability to write code is excellent. 💡 More accurate calculations are possible by utilizing programming languages ​​such as Python.✔ Example : "Write a Python code that calculates the price of an apple" ➡ The model can run <code>price = (13 - 2*2) / 3</code>the same code and come up with the correct answer .📌 Conclusion :<ul><li>Accuracy increases when the model uses the code execution feature instead of calculating it directly.</li><li>For complex computational problems, it is recommended to actively utilize Python code execution.</li></ul>

Models need tokens to think

<h3>1. Counting Problem 🔢</h3><ul><li>Problem: Often times the number is not counted accurately.</li><li>Reason: Because information is processed in token units rather than individual characters .</li><li>Example: If you list several dots ( <code>.</code>) and ask how many they are, it will predict the wrong number .</li><li>Solution: You can use the Run Python code feature to calculate the exact number.</li></ul><h3>2. Spelling Recognition Errors ✍️</h3><ul><li>Problem: Vulnerable to tasks that require recognizing or manipulating individual characters.</li><li>Reason: The model stores and processes words as tokens rather than characters .</li><li>Example 1: Unable to solve the problem of printing <code>"ubiquitous"</code>every third letter .</li><li>Example 2: When asked how many are <code>"strawberry"</code>contained in , the model incorrectly answers "2" for a while .<code>'r'</code><ul><li>This issue once went viral, with many people citing it as an example of the limitations of AI.</li><li>Reason: <code>"strawberry"</code>Because it recognized the entire word as one token and could not analyze individual characters.</li></ul></li><li>How to fix:<ul><li>Manipulating strings with Python code can produce accurate results.</li><li>Spell checking and character counting are better approached programmatically rather than through AI models.</li></ul></li></ul><h3>3. Simple logic operation errors ❌</h3><ul><li>Problem: Even simple number comparisons (e.g. <code>9.11 &gt; 9.9</code>) can give wrong answers.</li><li>reason:<ul><li>Certain numbers (e.g. <code>9.11</code>) may be recognized as Bible verses .</li><li>Errors occur when numbers are interpreted as contextual patterns rather than as simple mathematical operations.</li></ul></li><li>Example: <code>9.11 &gt; 9.9</code> When you ask a question, sometimes you get a logically incorrect answer.</li></ul><h3>Conclusion ✅</h3>✔️ Language models can show weaknesses in counting numbers, recognizing spelling, and logical operations . ✔️ Understanding these limitations and utilizing complementary methods such as code execution can improve accuracy. ✔️ Use models as tools, but make it a habit to always verify them for important problems. 🔍

Things the model cannot do well

<h2>1. Learning process of large-scale language models</h2>Large-scale language models (LLMs) are typically developed through three main learning stages, sequentially:<h3>📌 1.1 Pre-training – Building the base model</h3>🔹 Training method : Learn Internet documents and predict the next word 🔹 Goal : Learn language patterns and acquire the ability to understand context 🔹 Characteristics ✅ Acquire vast knowledge (learn various topics) ✅ Can generate natural sentences ⚠️ However, simple Internet document prediction model → Low practicality<h3>📌 1.2 Fine-tuning of Supervised Learning (SFT) – Evolving into an AI Assistant</h3>🔹 Training method :<ul><li>Use human-written conversation datasets instead of Internet documents</li><li>A person provides a question (prompt) and an ideal answer (correct answer data).</li></ul>🔹 Features ✅ More natural and useful conversations ✅ Improve the ability to perform specific tasks by learning answers from experts ⚠️ However, it is simple imitation learning, so it lacks the ability to solve new problems.<h3>📌 1.3 Reinforcement Learning (RL) – Improving Reliability and Optimizing</h3>🔹 Training method :<ul><li>How the model finds solutions on its own</li><li>The problem (prompt) and the answer (output value) are given, but the solution process is explored directly.</li><li>Using reinforcement learning techniques that reflect human feedback (RLHF)</li></ul>🔹 Features ✅ Creative problem solving (flexible response to new questions) ✅ Reduced hallucination problems (prevents creation of false information) ⚠️ Training costs are high and takes a lot of time<h2>2. Necessity of reinforcement learning and corporate use</h2><h3>📌 2.1 Why reinforcement learning is needed</h3><ul><li>Supervised learning (SFT) models simply mimic experts.</li><li>Reinforcement learning can provide more reliable answers and creative solutions</li></ul><h3>📌 2.2 Use in Enterprise (e.g. OpenAI)</h3>Companies like OpenAI operate specialized teams at each learning stage to improve the model.1️⃣ Pre-training team → Learning from internet documents, building a basic model 2️⃣ Supervised learning fine-tuning team → Learning from data provided by humans, optimizing with AI assistant 3️⃣ Reinforcement learning team → Improving the model's response quality to make it more reliable<h2>3. Textbook Example: Metaphor of the Learning Process</h2>The training process for large-scale language models is similar to the way we study in school .📖 Pre-training → Reading textbooks<ul><li>The process by which students acquire background knowledge by reading textbooks</li><li>AI also learns language knowledge by learning Internet documents</li></ul>📝 Fine-tuning supervised learning → Example solution<ul><li>The process by which students learn problem-solving techniques by looking at model answers.</li><li>AI also learns from data provided by humans to produce better answers</li></ul>🎯 Reinforcement Learning → Solving Practice Problems<ul><li>The process in which students solve practice problems on their own and find solutions.</li><li>AI also tries multiple solutions to find the optimal answer.</li></ul>

Post-Training: Reinforcement Learning

<h2>📌 1. LLM (Large Language Model) problem solving method</h2>Two factors to consider when solving a problem:1️⃣ Deriving the correct answer – The top priority is to derive the correct answer 2️⃣ Readability and logical explanation – Explaining in a way that is easy for people to understandFrom the perspective of deriving the correct answer, it is difficult for humans to define the optimal answer method in advance because the thinking methods of humans and LLMs are different . Therefore, LLMs must find the most effective solution method on their own through reinforcement learning.<h2>🔍 2. Finding the optimal solution using reinforcement learning</h2>Reinforcement Learning (RL) is the process by which LLM learns the optimal answer on its own. The basic learning method is as follows:<h3>🔄 The process of reinforcement learning</h3>1️⃣ The model solves problems in various ways and generates answers. 2️⃣ Evaluates correct answers (✅) and incorrect answers (❌). 3️⃣ Reinforces the solution (token sequence) that leads to the correct answer and adjusts to avoid incorrect ones. 4️⃣ This process is repeated thousands to millions of times to learn the optimal problem-solving pattern.As a result, LLMs discover the most effective problem-solving methods through their own experimentation and experience .<h2>📚 3. LLM Learning Process (Comparison with Human Learning Style)</h2>The learning method of LLM is similar to the process by which a person acquires knowledge .① Pre-training: Accumulate knowledge by learning large amounts of text data. Learn concepts and theories by reading textbooks. ② Supervised Fine-tuning (SFT): Learn from expert correct answers (worked solutions) and follow the example solutions provided by the teacher. ③ Reinforcement Learning (RL): Learn optimal solution methods by solving various problems directly. Develop problem-solving skills on your own by solving practice problems.<blockquote>Supervised learning (SFT) alone allows LLM to simply mimic the correct answer without understanding it deeply. Through reinforcement learning, it acquires the ability to find the optimal solution on its own.</blockquote>

Reinforcement learning

<h3>1. Importance and significance of DeepSeek-R1</h3><ul><li>🔍 Key Points<ul><li>OpenAI does not disclose its LLM training method and model based on reinforcement learning (RL) .</li><li>On the other hand, DeepSeek has released a model (DR1) that applies reinforcement learning as open source , allowing researchers to utilize it directly.</li><li>This marks a turning point in accelerating reinforcement learning-based LLM research in the AI ​​research community .</li></ul></li></ul><h3>2. Practical application and experimental results of Deepseek R1</h3><ul><li>📈 Changes after applying reinforcement learning<ul><li>Significantly improve your math problem-solving skills</li><li>Approached in various ways, accuracy increases gradually</li><li>The model forms its own inference process<ul><li>🧐 "Wait a minute, let me check again."</li><li>🤔 "Let's try another way to verify that this approach is correct."</li><li>✅ "Now you can be sure of the answer!"</li></ul></li></ul></li><li>🤯 The key is for the model to learn human-like thought processes and naturally develop problem-solving strategies !</li></ul><h3>3. How to use DeepSeek R1</h3><ul><li>💻 Released as an open source model<ul><li>Direct download and executable (⚠️ High-performance hardware required)</li><li>☁️ Cloud services available<ul><li>DeepSeek Official Website</li><li>DeepSearch R1 can be run on <a target="_blank" rel="noopener noreferrer nofollow" href="http://Together.ai">Together.ai</a></li></ul></li><li>🔬 Google's Gemini 2.0 Flash (Thinking Experimental) model also offers similar features</li></ul></li></ul><h3>4. How to use it in real life</h3><ul><li>🎯 Which model should I use in which situation?<ul><li>📚 General Knowledge Questions: Using Existing LLMs (⚡ Quick Answers)</li><li>🧠 Problems requiring math and logical thinking: Use reasoning models (📈 High accuracy)</li></ul></li></ul>

DeepSeek-R1

📄 Related papers: <a target="_new" rel="noopener" href="https://discovery.ucl.ac.uk/id/eprint/10045895/1/agz_unformatted_nature.pdf">AlphaGo Zero paper</a> 📊 Reference graph<img src="https://whayplgihzorucjvkpam.supabase.co/storage/v1/object/public/images/1740541531968-d1218d78-f25b-4b7f-97da-a9e00c93eedc.png"><h3> 1️⃣ Reinforcement learning (RL) is not a new concept in the AI ​​industry.</h3><ul><li>It is already widely known in the AI ​​industry that reinforcement learning (🎯) is a powerful learning method.</li><li>AlphaGo is a representative example of successful application of this to Go.</li></ul><h3>2️⃣ Supervised Learning vs. Reinforcement Learning</h3><ul><li>📌 Supervised Learning (purple line)<ul><li>Learning and imitating the game data of human experts</li><li>Improves to a certain level, but cannot surpass the highest human level.</li></ul></li><li>📌 Reinforcement Learning (blue line)<ul><li>Play Go yourself and find the best strategy</li><li>Reaching superhuman abilities over time</li><li>Ultimately, AlphaGo achieves stronger performance than Lee (blue dotted line)</li></ul></li></ul><h3>3️⃣ AlphaGo's groundbreaking move: 'Move 37'</h3><ul><li>AlphaGo discovers a unique move that humans rarely make (1/10,000 chance)</li><li>At the time, experts judged it to be a mistake, but it turned out to be an innovative strategy .</li><li>This is an example of how reinforcement learning can enable creative thinking that surpasses human capabilities.</li></ul><h3>4️⃣ Scalability of Reinforcement Learning 🚀</h3><ul><li>RL is now being applied to large language models (LLMs), and has the potential to go beyond simple human imitation.</li><li>Discovering new logical patterns, creative problem solving, and even the possibility of creating new languages.</li><li>Beyond games with correct answers like Go, research is underway to enable AI to develop in ‘open problems’</li></ul>📌 The case of AlphaGo is an innovative case that shows that AI can not simply imitate humans, but can overcome human limitations through independent learning! 🚀

AlphaGo

Reinforcement learning from human feedback (RLHF)

<h3>🚀 Future AI model development direction and prospects 🔍</h3>AI models are evolving rapidly, and the changes we can expect in the future are summarized below.<h2>🔥 1. Multimodal AI</h2>Currently, most AI models can only process text , but in the future, models that can naturally handle audio (voice), images (visual), and video (video) will appear. 🎙️📸🎥👉 How is it possible?<ul><li>Speech can be tokenized using spectrograms (a visual representation of the acoustic signal) . 🎵</li><li>Images can be tokenized by breaking them into several small patches (slices) . 🖼️</li><li>Ultimately, text, voice, images, etc. can all be converted into tokens , and language models can process them. ✅</li></ul>These changes will allow for more natural and intuitive communication with AI. 🤖💬<h2>🏗️ 2. AI capable of performing long-term tasks (Agent AI)</h2>Current AI models only provide answers to short-term questions , but in the future, AI agents that perform multiple tasks over long periods of time are expected to emerge.👉 Expected changes<ul><li>AI will emerge that can combine and execute multiple tasks on its own .</li><li>You will be able to continue working while detecting and correcting errors .</li><li>Humans will play a role in supervising AI and intervening when necessary .</li></ul>These developments will allow AI to move beyond being a simple information provider to a digital assistant that actually does the work . 🛠️🤖<h2>🕵️‍♂️ 3. Invisible AI</h2>In the future, AI is expected to be naturally integrated into various tools in our daily lives rather than specific applications .✔️ AI will be able to perform functions that replace keyboard and mouse operations . ✔️ Systems will be developed that learn users' habits and perform automated tasks ✔️ AI functions will be naturally embedded in various software.For example, there is a high possibility that an era will come when AI directly controls the user's computer and performs tasks, like the Operator function of ChatGPT . 💻🖱️<h2>🧠 4. AI capable of real-time learning (Test-Time Training)</h2>Current AI models no longer learn after training is complete . That is, the model itself does not change when it receives new information; it simply generates output based on the input.💡 But what about the future?<ul><li>AI will be able to learn in real time based on user experience .</li><li>It is possible that the ability to acquire and update new information like humans will be added.</li><li>In situations where long contexts must be processed, more efficient solutions than existing approaches will be required.</li></ul>Current AI can only process information within a certain context (window) , but if long-term memory and learning capabilities are added, more advanced forms of AI will emerge. 🚀

Preview of things to come

<h3>🔍 How to understand LLM trends 🧠✨</h3>Introducing a method to effectively track the latest large language model (LLM) information<h2>1️⃣ El Marina: LLM Leaderboard 🏆</h2>📌 <a target="_new" rel="noopener" href="https://elmarina.ai/">El Marina</a> ranks the best LLM models and evaluates them by comparing their responses directly with humans.🔹 Ranking System :<ul><li>Anonymous human evaluators compare the models' responses and choose the better model 👀</li><li>This provides an objective ranking .</li></ul>🔹 Current top models (as of 2025) 🥇 Google Gemini 🥈 OpenAI GPT 🥉 DeepSeek (MIT Open License 🎉) 👉 DeepSeek is gaining attention as a powerful open-weight model that is free to use !⚠️ In recent months, there have been concerns raised about possible ranking manipulation. It is important to use it in real life and check its performance for yourself!<h2>2️⃣ AI Newsletter "AI News" 📩</h2>🔍 <a target="_new" rel="noopener" href="https://ainews.com/">AI News</a><ul><li>AI-related newsletter run by Swix &amp; Team</li><li>New information almost every day</li><li>Some are auto-generated by LLM , some are curated by humans</li></ul>📌 Follow AI News so you don't miss important news! 👍<h2>3️⃣ Use Twitter(X) 📢</h2>🔥 The place where AI experts are most active is X (Twitter) !<ul><li>Get the latest AI news &amp; analysis in real time</li><li>Recommended to follow trusted AI researchers and experts</li></ul>

Keeping track of LLMs

<h3>🌍 Where to find LLM (Large Language Model)? 🔎🤖</h3>If you want to leverage and experiment with LLM, you can find it on these platforms! 🚀<h2>1️⃣ Commercial (Private) LLMs 🔐</h2>Popular large-scale AI models can be accessed directly from each company's official website .✅ Representative LLM provider sites:<ul><li>OpenAI → <a target="_new" rel="noopener" href="https://chat.openai.com/">ChatGPT</a></li><li>Google → <a target="_new" rel="noopener" href="https://gem.google.com/">Gemini</a> / <a target="_new" rel="noopener" href="https://aistudio.google.com/">AI Studio</a></li><li>Anthropic → Claude</li></ul>They are cloud-based and can be accessed directly from your website!<h2>2️⃣ Open Source (Open Weights) LLMs 🌍🛠</h2>🔓 Open weight models that are free to download and use can be found on multiple platforms.✅ Open source LLM provider sites:<ul><li><a target="_new" rel="noopener" href="https://together.ai/">Together AI</a> → Various open source LLMs can be run 💡</li><li>Hugging Face → Provides numerous open source models</li><li><a target="_new" rel="noopener" href="https://hyperbolic.com/">Hyperbolic</a> → Llama 3.1 Base model provided</li></ul>🛠 Inference(Inference) platform allows you to directly select and test multiple models!<h2>3️⃣ Locally executable LLMs 💻</h2>You can also run LLM directly on your computer! Especially if you use the lightweight or low-precision models, it can even run on your personal PC 🎯✅ How to run local LLM: 1️⃣ LM Studio → <a target="_new" rel="noopener" href="https://lmstudio.ai/">Download link</a><ul><li>💻 Run AI models directly locally</li><li>📌 Supports Mac and Windows</li><li>The UI/UX is a bit difficult, but once you get used to it, it's a powerful tool.</li><li>You can choose and run various models.</li></ul>2️⃣ Ollama → <a target="_new" rel="noopener" href="https://ollama.com/">Download link</a><ul><li>🎯 Run AI models locally with simple commands</li><li>Supports the latest models including Llama 3 and DeepSeek</li><li>Performance is especially good on Mac</li></ul>💡 Local running tips:<ul><li>Use small model (lightweight version) → avoid memory shortage</li><li>Low Precision settings (FP8, INT4, etc.) → Runs on smaller PCs too</li></ul>

[YouTube Lecture Summary] Andrej Karpathy - Deep Dive into LLMs like ChatGPT

Reinforcement learning from human feedback (RLHF)

🔍 RLHF (Reinforcement Learning with Human Feedback) and Learning in Unverifiable Domains

🏆 RLHF: Reinforcement Learning with Human Feedback

🔹 Limitations of the existing method

🔹 The core idea of RLHF

✅ Advantages of RLHF

⚠️ Limitations and Issues of RLHF

❌ 1. Not a perfect human evaluation

❌ 2. The Gameable Problem of Reinforcement Learning

📌 Conclusion

[YouTube Lecture Summary] Andrej Karpathy - Deep Dive into LLMs like ChatGPT

Reinforcement learning from human feedback (RLHF)

🔍 RLHF (Reinforcement Learning with Human Feedback) and Learning in Unverifiable Domains

🏆 RLHF: Reinforcement Learning with Human Feedback

🔹 Limitations of the existing method

🔹 The core idea of ​​RLHF

✅ Advantages of RLHF

⚠️ Limitations and Issues of RLHF

❌ 1. Not a perfect human evaluation

❌ 2. The Gameable Problem of Reinforcement Learning

📌 Conclusion

🔹 The core idea of RLHF