Chapter 2. Developing Foundation Models

Applications with foundation models need foundation models. While you don’t need to know how to develop a model inside and out to use it, a high-level understanding will be useful in helping you make the right decision on what model to use and how to adapt it to your needs.

To understand models, we have to start with data. Foundation models need a lot of data. Knowing what data a model is trained on gives important clues about what it can do. This chapter looks at data from a sourcing perspective: where training data for foundation models typically comes from, and what it means for models’ performance on downstream tasks. Chapter 5 will go into detail on dataset engineering techniques, including tokenization and quality control.

At a macro level, there are two distinct phases of model training: pre-training and post-training1. Pre-training is typically done with self-supervision, which allows the model to be trained using a massive amount of data. The resulting model is generally capable but not necessarily safe to use. Given this model, the goal of post-training is to make it usable in two aspects: follow users’ instructions and align with human preference. For most applications, a model that more aligns with human preference is better2. Post-training is typically done with supervision using higher-quality data. This chapter walks through steps in both of these phases.

Once a model is trained and aligned, it can be used to generate outputs through a process called sampling. Sampling is perhaps among the most underrated concepts of AI engineering, as it explains many seemingly baffling AI behaviors including hallucination and inconsistency. On top of that, choosing the right sampling variables is relatively easy to do and can get a model to generate significantly better outputs. For this reason, sampling is the section that I was the most excited to write about in this chapter.

Concepts covered in this chapter are fundamental. They are important for understanding the rest of the book. However, because these concepts are fundamental, you might already be familiar with them. If you feel confident about them already, feel free to skip them. If you encounter a confusing concept later on, you might want to revisit this chapter.

Training Data Distribution

An AI model is only as good as the data it was trained on. If there’s no Vietnamese in the training data, the model won’t be able to translate from English into Vietnamese. Similarly, if an image classification model only sees animals in its training set, it won’t perform well on photos of plants.

If we want a model to improve for a certain task, we want to include more data for that task in the training data. However, collecting sufficient data for training a large model isn’t easy, and can be expensive. Often, model developers use the data that is available, not the data that they want.

For example, a common source for training data is Common Crawl, created by a nonprofit organization that sporadically crawls websites on the Internet. In 2022 and 2023, this organization crawled approximately 2 - 3 billion web pages each month. Google provides a clean subset of Common Crawl called the Colossal Clean Crawled Corpus, or C4 for short.

The data quality of Common Crawl, and by extension C4, is questionable -- think clickbait, misinformation, propaganda, conspiracy theories, racism, misogyny, and every sketchy website you’ve ever seen or avoided on the Internet. A study by the Washington Post shows that the 1,000 most common websites in the dataset include several media outlets that rank low on NewsGuard’s independent scale for trustworthiness. In layman’s words, Common Crawl contains plenty of fake news.

Yet, due solely to its availability, variations of Common Crawl are used to train most LLMs that disclose the source of their training data, including OpenAI’s GPT-3 and Google’s Bard. It’s unclear if Common Crawl was used for the newer models GPT-4 and Gemini. To avoid scrutiny both from the public and competitors, these companies have stopped disclosing the sources of their training data.

Some teams use heuristics to increase the quality of their training data. For example, OpenAI scraped all outbound links from Reddit that received at least 3 karma to train GPT-2. While this does help screen out links that nobody cares about, Reddit isn’t exactly the pinnacle of propriety and good taste.

The “using what we have, not what we want” approach results in models that do well on the tasks covered in the training data, but not the tasks that we care about. To correct this, model developers started curating their training data to design models for specific tasks. In general, these curated models are aimed at tasks that fall into one of the two buckets: multilingual and domain-specific.

Before examining each bucket, keep in mind that these specialized models don’t always have to be trained from scratch. It’s common to take a general-purpose model and finetune it for a specific task. The closer the base model’s specialization is to your task, the less adaptation work you’ll have to do.

Some might wonder why not just train a model on all data available, both general Internet data and specialized data, so that the model can do everything. This is what many people do. However, more data doesn’t always lead to better performance. For example, a model trained with a smaller amount of high-quality data might outperform a model trained with a large amount of low-quality data. Neural networks can also forget things, so they might forget how to do your task after learning how to do other tasks3.

Multilingual Models

English dominates the Internet. An analysis of the dataset Common Crawl shows that English accounts for almost half of the dataset (45.88%), which is almost 8 times more than the second-most common language (Russian - 5.97%) (Lai et al., 2023). See Figure 2-1 for a list of languages with at least 1% in Common Crawl. Languages with limited availability as training data -- typically languages not included in this list -- are considered low-resource.

The most common languages in the Common Crawl dataset. Source  Lai et al.  2023
Figure 2-1. The most common languages in the Common Crawl dataset. Source: Lai et al., 2023

Some languages that are severely underrepresented in Common Crawl, despite having a lot of speakers today, are shown in Table 2-1. Ideally, the ratio between world population representation and Common Crawl representation should be 1. The higher this ratio, the more underrepresented this language is in Common Crawl.

Table 2-1. : Examples of languages that are underrepresented in Common Crawl. The last row, English, is for comparison. The numbers for % in Common Crawl are taken from Lai et al., 2023
LanguageSpeakers (million)% world populationa% in Common CrawlRatio: World population / Common Crawl
English 1452 18.15% 45.88% 0.40

a World population of 8 billion was used for this calculation.

Studies have shown that general-purpose models don’t work as well for non-English languages. For example, on the MMLU benchmark, a suite of 14,000 multiple-choice problems spanning 57 subjects, GPT-4 performed much better in English than under-represented languages like Telugu, as shown in Figure 2-2. This is consistent with Lai et al.’s finding that “ChatGPT’s performance is generally better for English than for other languages, especially for higher-level tasks that require more complex reasoning abilities (e.g., named entity recognition, question answering, common sense reasoning, and summarization).”

Underrepresentation isn’t the only reason for this underperformance. The structure of a language itself and the culture that a language embodies can also make a language harder or easier for a model to learn. However, underrepresentation is a big reason. The three languages that have the worst performance in GPT-4’s MMLU benchmarks -- Telugu, Marathi, and Punjabi -- are also among the languages that are most underrepresented in Common Crawl.

On the MMLU benchmark  GPT 4 performs better in English than in any other language. Source  OpenAI. To obtain MLLU in other languages  OpenAI translated the questions using Azure Translate.
Figure 2-2. On the MMLU benchmark, GPT-4 performs better in English than in any other language. Source: OpenAI. To obtain MLLU in other languages, OpenAI translated the questions using Azure Translate.

Similarly, when tested on six math problems on Project Euler, Yennie Jun found that GPT-4 was able to solve problems in English more than three times as often compared to Armenian or Farsi. GPT-4 failed in all six questions for Burmese and Amharic, as shown in Figure 2-3.

GPT 4 is much better at math in English than in other languages. Source  GPT 4 can solve math problems   but not in all languages
Figure 2-3. GPT-4 is much better at math in English than in other languages. Source: GPT-4 can solve math problems — but not in all languages

Models today are pretty good at translating from one language to another. Can we just translate all queries from other languages into English, feed them into these models, and translate the responses into the original language? Many people follow this approach, but it’s not ideal. First, this requires us to have models that can understand underrepresented languages to be able to translate. Second, translation can cause information loss. For example, some languages like Vietnamese have pronouns to denote the relationship between the two speakers. When translating into English, all these pronouns are translated into I and you, causing the loss of the relationship information.

Models can also have unexpected performance challenges in non-English languages. For example, NewsGuard found that ChatGPT is more willing to produce misinformation in Chinese than in English. In April 2023, NewsGuard fed ChatGPT-3.5 seven prompts each in English, simplified Chinese, and traditional Chinese, asking ChatGPT to produce misinformation articles about China. For English, ChatGPT declined to produce false claims for six out of seven prompts. However, it produced false claims in simplified Chinese and traditional Chinese all seven times. It’s unclear what causes this difference in behavior. It may be due to bias in training data. Perhaps OpenAI just didn’t use as much data in Chinese or data about China-related misinformation narratives to train their models.

Other than quality issues, models can also be slower and more expensive for non-English languages. How fast a model can process and respond to a prompt depends on how long the prompt and response are. As discussed earlier in the section “Model Size,” the length of a text can be measured using the number of tokens generated through a model’s tokenization process. It turns out that the same model’s tokenization can be much more efficient for some languages than for others.

Benchmarking GPT-4 on MASSIVE, a dataset of 1 million short texts translated across 52 languages and 18 domains, Yennie Jun found that to convey the same meaning, languages like Burmese and Hindi require a lot more tokens than English or Spanish. The result is shown in Figure 2-4. The median token length for MASSIVE in English is 7, but the median length for the same data in Hindi is 32, and in Burmese, it’s a whopping 72, 10 times longer than in English.

Assuming that the time it takes to generate a token is the same in all languages, this means that to generate the same content, GPT-4 takes approximately 10 times longer in Burmese than in English. As of writing, OpenAI API is priced per the number of input tokens and output tokens, which also means that Burmese costs 10 times more than English.

Source  All languages are NOT created  tokenized  equal  Yennie Jun  2023
Figure 2-4. Source: All languages are NOT created (tokenized) equal (Yennie Jun, 2023)

To address this, many models have been trained to focus on non-English languages. The most active language, other than English, is perhaps Chinese, with ChatGLM, YAYI, LLaMA2-Chinese, and others. There are also models in Vietnamese (PhoGPT), Portuguese (Cabrita), Japanese (Rinna’s model), Arabic (Jais), Bahasa Indonesia (Sealion), and more.

Domain-Specific Models

Technical reports on recent models like Gemini and GPT-4V show that they can perform incredibly well for a wide range of domains, including but not limited to coding, law, science, business, sports, and environmental science. The Washington Post’s analysis of Common Crawl also found that a wide range of domains are present in the dataset, as shown in Figure 2-5.

Distribution of domains in the C4 dataset. Reproduced from the statistics from Inside the secret list of websites that make AI like ChatGPT sound smart. One caveat of this analysis is that it only shows the categories that are included  not the categories missing.
Figure 2-5. Distribution of domains in the C4 dataset. Reproduced from the statistics from Inside the secret list of websites that make AI like ChatGPT sound smart. One caveat of this analysis is that it only shows the categories that are included, not the categories missing.

As of this writing, there haven’t many analyses of domain distribution in vision data. This might be because it’s harder to categorize images than texts. For texts, you can use domain keywords as heuristics, but there are no obvious heuristics for images. Most analyses I could find about vision datasets are about image sizes, resolutions, or video lengths.

Often, benchmarks can give us clues into what domains a model is good for. Figure 2-6 shows how two models, CLIP and Open CLIP, perform on different benchmarks. These benchmarks show how well these two models do on birds, flowers, cars, and a few more categories, but the world is so much bigger and more complex than these few categories.

Even though popular foundation models today can answer everyday questions about different domains, they are unlikely to perform well on domain-specific tasks, especially if these tasks involve domain-specific data that these models never saw during training.

Two examples of domain-specific tasks are drug discovery and cancer screening. Drug discovery involves protein, DNA, and RNA data, which is expensive to acquire and follow specific formats, and is unlikely to be included in the Internet data. Cancer screening typically involves X-ray and fMRI scans, which are hard to obtain due to privacy, and therefore, unlikely to be included in the training data of general-purpose models.

One of the most famous domain-specific models is perhaps DeepMind’s AlphaFold, trained on the sequences and 3D structures of around 100,000 known proteins (2021). NVIDIA’s BioNeMo is another model that focuses on biomolecular data for drug discovery (2023). Google’s Med-PALM2 combined the power of an LLM with medical data to answer medical queries with higher accuracy (2023).


With training data figured out, we can start training our model. In pre-training, a model is trained from scratch, typically using self-supervision. This is where you get to choose a model architecture and model size.

Out of the two phases of training, pre-training is more resource-intensive by a long shot. For the InstructGPT model, pre-training takes up 98% of the overall compute and data resources, and takes a long time to do. A small mistake during pre-training can incur a significant financial loss and set back the project significantly. Due to the resource-intensive nature of pre-training, this has become an art that only a few practice. Those with expertise in pre-training large models, however, are heavily sought after4.

Model Architecture

As of this writing, the most dominant architecture for language-based foundation models is the transformer architecture (Vaswani et al., 2017), which is based on the attention mechanism. Since most people won’t need to know the inner workings of this architecture to build applications on top of transformer-based models, I won’t go into the math behind it.

To understand transformers, let’s look at the problem it was created to solve. The transformer architecture was popularized at the heel of the success of the seq2seq (sequence-to-sequence) architecture. At the time of its introduction in 2014, seq2seq provided significant improvement on then-challenging tasks: machine translation and summarization. In 2016, Google incorporated seq2seq into Google Translate, an update that they claimed to have given them the “largest improvements to date for machine translation quality”. This generated a lot of interest in seq2seq, making it the go-to architecture for many people for any tasks involving sequences of text.

At a high level, seq2seq contains an encoder that processes inputs and a decoder that generates outputs. Both inputs and outputs are sequences of tokens, hence the name. Seq2seq uses RNNs (Recurrent Neural Networks) as its encoder and decoder. In its most basic form, the encoder processes the input tokens sequentially, outputting the final hidden state that represents the input. The decoder then generates output tokens sequentially, conditioned on both the final hidden state of the input and the previously generated token. A visualization of the seq2seq architecture is shown in Figure 2-7.

Seq2seq architecture vs. transformer architecture. For the transformer architecture  the arrows show the tokens that the decoder attends to when generating each output token.
Figure 2-7. Seq2seq architecture vs. transformer architecture. For the transformer architecture, the arrows show the tokens that the decoder attends to when generating each output token.

There are two problems with seq2seq that the 2017 transformer paper addresses. First, the vanilla seq2seq decoder generates output tokens using only the final hidden state of the input. Intuitively, this is like generating answers about a book using the book summary. This limits the quality of the generated outputs. On the other hand, the transformer decoder uses the attention mechanism to look at any input token, which is like generating answers by referencing any page in the book. A simplified visualization of the transformer architecture is shown in Figure 2-7.

Second, the RNN encoder and encoder mean that both input processing and output generation are done sequentially, making it slow for long sequences. If an input is 200-token long, seq2seq has to wait for each input token to finish processing before moving on to the next. The transformer architecture dispenses with RNNs entirely. With transformer, the input tokens can be processed at the same time, in parallel, significantly speeding up input processing. While transformer removes the sequential input bottleneck, we still have the sequential output bottleneck due to the nature of autoregressive models. Overcoming the sequential output bottleneck will be discussed in chapter 7.

Note that while the attention mechanism is often associated with the transformer model, it was introduced 3 years before the transformer paper5. The attention mechanism can be used with other architectures too. Google used the attention mechanism with their seq2seq architecture in 2016 for their paper GNMT, Google Neural Machine Translation, model. However, it wasn’t until the transformer paper showed that the attention mechanism could be used without RNNs that it took off.

While transformers are the dominant architecture today, there are non-transformer text-based foundation models. One popular model is RWKV, an RNN-based model that can be parallelized for training. Due to its RNN nature, in theory, it doesn’t have the same context length limitation that transformer-based models have6. However, in practice, having no context length limitation doesn’t guarantee good performance with long context length.

Since the AlexNet7 paper revived the interest in deep learning in 2012, many architectures have gone in and out of fashion. Seq2seq was in the limelight for 4 years (2014 - 2018). GAN (Generative Adversarial Network) captured the collective imagination a bit longer (2014 - 2019). Compared to architectures that came before it, the transformer is sticky. It’s been around since 2017 and still going strong. How long until something better comes along?

Developing a new architecture to outperform transformers isn’t easy. The transformer has been heavily optimized since 2017. A new architecture that aims to replace the transformer will have to perform at the scale that people care about, on the hardware that people care about8.

Model Size

Much of AI progress in recent years can be attributed to an increase in model size. It’s hard to talk about foundation models without talking about their number of parameters. The number of parameters is usually appended at the end of a model name. For example, LLaMA-13B refers to the version of LLaMA, a model family developed by Meta, that has 13 billion parameters.

In general, models with more parameters perform better. Given two models of similar architecture, the one with 13 billion parameters is likely to perform much better than the one with 7 billion parameters.

The number of parameters helps us estimate the compute resources needed to train and run this model. For example, if a model has 7 billion parameters, and each parameter is stored using 1 byte (8-bit), then we can calculate that the GPU memory needed to do inference using this model will be at least 7 billion bytes (7 GB)9.

The number of parameters can be misleading if the model is sparse. Sparse models refer to models with a large percentage of zero-value parameters. A 7B-parameter model that is 90% sparse only has 10%, or 700 million, non-zero parameters. Sparsity allows for more efficient data storage and computation. So a larger but sparse model can require less compute than a smaller but dense model.

A type of sparse model that has gained popularity in recent years is mixture-of-experts10 (MoEs). An MoE model is divided into different groups of parameters, each group is an expert. Only a subset of the experts are used to process each token. We say that only a subset of the model parameters are active for a given token.

For example, Mixtral 8x7B is a mixture of 8 experts, each expert with 7 billion parameters. If no two experts share any parameter, it should have 7 x 8 billion = 56 billion parameters. However, due to some parameters being shared, it has only 46.7 billion parameters.

At each layer, for each token, only 2 experts are used. This means that only 12.9 billion parameters are active for each token. This means that while this model has 46.7 billion parameters, its cost and speed are the same as a 12.9 billion parameter model.

A larger model can also underperform a smaller model if it’s not trained on enough data. Imagine a 13B-param model trained on a dataset consisting of a single sentence: “I like pineapples.” This model will perform much worse than a much smaller model trained on more data.

When discussion model size, it’s important to consider the size of the data it was trained on. For most models, dataset sizes are measured by the number of training samples. For example, Google’s Flamingo (2022) was trained using four datasets—one of them has 1.8 billion (image, text) pairs and one has 312 million (image, text) pairs.

For language models, a training sample can be a sentence, a Wikipedia page, a chat conversation, or a book. A book is worth a lot more than a sentence, so the number of training samples is no longer a good metric to measure dataset sizes. A better measurement is the number of tokens in the dataset.

As mentioned in Chapter 1, a token is typically a word, a character, or a subword. The number of tokens isn’t a perfect measurement, as different models can have different tokenization processes, resulting in the same dataset having different numbers of tokens for different models. Why not just use the number of words or the number of letters? Because a token is the unit that a model operates on, knowing the number of tokens in a dataset helps us measure how much a model can potentially learn from that data.

As of this writing, LLMs are trained using datasets in the order of trillions of tokens. The datasets used for DeepMind’s Gopher and Meta’s LlaMA contain between 1 and 2 trillion tokens. Together’s open-source dataset RedPajama-v2 has 30 trillion tokens. This is equivalent to 450 million books11 or 5,400 times the size of Wikipedia.

The number of tokens in a model’s dataset isn’t the same as its number of training tokens. The number of training tokens measures the tokens that the model is trained on. If a dataset contains 1 trillion tokens and a model is trained on that dataset for 2 epochs12 -- you can think of an epoch as a pass through the dataset -- the number of training tokens is 2 trillion. See Figure 2-8 for examples of the number of training tokens for models with different numbers of parameters.

Examples of the number of training tokens for models with different numbers of parameters. Source  Training Compute Optimal Large Language Models  DeepMind  2022
Figure 2-8. Examples of the number of training tokens for models with different numbers of parameters. Source: Training Compute-Optimal Large Language Models (DeepMind, 2022)

To train a model, we need compute. One way to measure the amount of compute needed is the number of machines. Examples of machines include GPUs, CPUs, and TPUs. More hardware architectures are being developed. However, different machines have very different capacities and costs. An NVIDIA A10 GPU is very different from an NVIDIA H100 GPU, which is very different from an Intel Ultra Processor.

A more standardized unit for a model’s compute requirement is FLOP, Floating Point Operation. Intuitively, FLOP measures the number of floating point operations performed for a certain task, such as training a model. Google’s largest PALM-2 model, for example, was trained using 1022 FLOPs. GPT-3-175B was trained using 3.14 x 1023 FLOPs.

The plural form of FLOP, FLOPs, is often confused with FLOP/S, Floating Point Operations Per Second. FLOPs measure the compute requirement for a task, whereas FLOP/S measures a machine’s peak performance. For example, NVIDIA’s H100 can deliver a maximum of 60 TeraFLOP/S: 6 x 1013 FLOPS a second or 5.2 x 1018 FLOPs a day.


Confusing notations

FLOP/S is often written as FLOPS, which looks similar to FLOPs. To avoid this confusion, some companies, including OpenAI, use FLOP/S-days in place of FLOPs to measure compute requirements.

1 FLOP/S-day = 60 * 60 * 24 = 86400 FLOPs.

This book uses FLOPs for counting floating point operations, and FLOP/S for FLOPs per second.

Assume that you have 256 H100s. If you can use them at their maximum capacity and make no training mistakes, it’d take you (3.14 x 1023 ) / (256 x 5.2 x 1018) = ~236 days, or approximately 7.8 months, to train GPT-3-175B.

However, it’s unlikely you can use your machines at their peak capacity all the time. Utilization measures how much of the maximum compute capacity you can use. What’s considered good utilization depends on the model being trained and the machines. As a rule of thumb, if you can get half the advertised performance advertised, 50% utilization, you’re doing okay. Anything above 70% utilization is considered great. Don’t let this rule of thumb stop you from getting even higher utilization. With 256 H100s and an extremely high utilization of 70%, it’d take you 236 / 70% = ~337 days to train GPT-3-175B.

At this utilization, if you want to train GPT-3-175B in a month, you’d need almost 3,000 H100s. Cloud providers today are offering H100s at around $2 to $5/hour. At $2/hour, training GPT-3-175B would cost over $4 million: $2 x 256 x 24 x 337 = $4,141,056. As compute is getting rapidly cheaper, this number can get much lower.

Scaling Law: Building Compute-Optimal Models

I hope that the last section has convinced you of three things. One, model performance depends on the model size and the dataset size. Two, bigger models and bigger datasets require more compute. Three, compute costs money.

Unless you have unlimited money, budgeting is essential. You don’t want to start with an arbitrarily large model size and see how much it would cost. You start with a budget -- how much money you want to spend -- and work out the best model performance you can afford. As compute is often the limiting factor, because it’s not only expensive but also hard to set up, it makes sense to start with a compute budget. Given a fixed amount of FLOPs, what model size and dataset size would give the best performance?

Given a compute budget, the rule that helps calculate the optimal model size and dataset size is called the scaling law, proposed in the paper Training Compute-Optimal Large Language Models (DeepMind, 2022). To study the relationship between model size, dataset size, compute budget, and model performance, the authors trained 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens.

They found that for compute-optimal training, the number of training tokens needed is approximately 20 times the model size, so a 3B model needs approximately 60B training tokens. The model size and the number of training tokens should be scaled equally: for every doubling of the model size, the number of training tokens should also be doubled.

We’ve come a long way from the days when the training process was treated like alchemy. Figure 2-9 shows that we can predict not only the optimal number of parameters and number of tokens for each FLOPs budget but also the expected training loss we can get from these settings (assuming that we do things right).

Graphs that depict the relationships between training loss  a model s number of parameters  FLOPs  and number of training tokens. Source  Training Compute Optimal Large Language Models  DeepMind  2022
Figure 2-9. Graphs that depict the relationships between training loss, a model’s number of parameters, FLOPs, and number of training tokens. Source: Training Compute-Optimal Large Language Models (DeepMind, 2022)

When we focus on only the compute budget, we assume that the cost of acquiring data is much cheaper than the cost of compute. If the costs for training data are nontrivial, we can approach the scaling law from a different perspective. We can start from a model’s number of parameters and project the number of FLOPs and training tokens needed for this model’s size, which helps us approximate the compute and data costs. This helps us balance the budget between compute and data. Figure 2-10 shows what this projection looks like.

The expected compute budget and training tokens for models of different sizes. Source  Training Compute Optimal Large Language Models  DeepMind  2022
Figure 2-10. The expected compute budget and training tokens for models of different sizes. Source: Training Compute-Optimal Large Language Models (DeepMind, 2022)

The scaling law guides us in getting the optimal model performance given a compute budget. However, it’s important to remember that for production, model performance isn’t everything. Some models, most notably LLaMA, have sub-optimal performance but better usability. Given their compute budget, LLaMA authors could’ve chosen bigger models that would give better performance, but they opted for smaller models. Smaller models are easier to work with and cheaper to run inference on, which helped their models gain wider adoption.

On the topic of model performance given a compute budget, it’s worth noting that the cost of achieving a given model performance is decreasing. For example, on the ImageNet dataset, the cost to achieve 93% accuracy halved from 2019 to 2021, according to Artificial Intelligence Index Report 2022.

While the cost for the same model performance is decreasing, the cost for model performance improvement remains high. Improving a model’s accuracy from 90 to 95% is more expensive than improving it from 85 to 90%. Meta’s paper Beyond neural scaling laws: beating power law scaling via data pruning (2022) pointed out that this means that a model with a 2% error rate might require an order of magnitude more data, compute, or energy than a model with a 3% in error rate. In language modeling, a drop in cross entropy loss13 from about 3.4 to 2.8 nats14 requires 10 times more training data. For large vision models, increasing the number of training samples from 1 billion to 2 billion leads to an accuracy gain on ImageNet of only a few percentage points.

However, small performance changes in language modeling loss or ImageNet accuracy can lead to big differences in the quality of downstream applications. If you’re using a model with a cross entropy loss of 3.4, by switching to using a model with a cross entropy loss of 2.8, you’ll notice the difference.

Scaling Extrapolation

Training a large model is both time-consuming and expensive. For some models, you might have only one shot at getting it right.

The challenge is that the final performance of a model depends heavily on the values of its hyperparameters. Intuitively, you can think of hyperparameters as configurations that determine how a model should be trained. Examples of hyperparameters include the initial learning rate, learning rate schedules, momentum, per-layer initial variance, multiplicative constants after weight/biases, and so on. If the value of a parameter is updated throughout the training process, the value of a hyperparameter is determined at the beginning of the training process.

When working with small models, it’s a common practice to train a model with different sets of hyperparameters and pick the best-performing one. This isn’t possible for large models -- training a large model once is resource-draining enough.

As a result, a new topic of research has emerged recently that tries to predict, for large models, which set of hyperparameters will give the best performance. The current approach is to study the impact of different hyperparameters on models of a range of sizes, usually much smaller than the target model size, and then extrapolate how these hyperparameters would work on the target model size. This approach is also called hyperparameter transferring or scaling extrapolation. A 2022 paper by Microsoft and OpenAI shows that it was possible to transfer hyperparameters from a 40M model to a 6.7B model.

This research topic is still small, as not many people have the experience and resources to study the training of large models. This is also a difficult topic to study due to the sheer number of hyperparameters and how they interact with each other. If you have 10 hyperparameters, you’d have to study 1,024 different combinations of them. You would have to study each hyperparameter individually, then two of them together, and three of them together, and so on. Luke Metz, a Google-turned-OpenAI researcher who focuses on the training of large models, wrote an excellent blog post on the challenges of extrapolation: On the Difficulty of Extrapolation with NN Scaling (2022).

Scaling Bottlenecks

Until now, every order of magnitude increase in model size has led to an increase in model performance. GPT-2 has an order of magnitude more parameters than GPT-1 (1.5 billion vs. 117 million). GPT-3 has two orders of magnitude more than GPT-2 (175 billion vs. 1.5 billion). This means a three-orders-of-magnitude increase in model sizes between 2018 and 2021. Three more orders of magnitude growth would result in 100-trillion-parameter models.

How many more orders of magnitude can model sizes grow? Would there be a point where the model performance plateaus regardless of the model size? While it’s hard to answer these questions, there are already two visible bottlenecks for scaling model sizes: training data and electricity.

Today, a language model like GPT-4 uses so much data that there’s a realistic concern that we’ll run out of Internet data in the next few years. The rate of training dataset size growth is much faster than the rate of new data being generated (Villalobos et al, 2022), as illustrated in Figure 2-11. If you’ve ever put anything on the Internet, you should assume that it is already or will be included in the training data for some language models, whether you consent or not. This is similar to how, if you post something on the Internet, you should expect it to be indexed by Google. Some people are leveraging this fact to inject the data they want into the training data of future models, simply by publishing the text they want on the Internet, hoping that it will influence future models to generate the responses they want.

Projection of historical trend of training dataset sizes and available data stock. Source  Villalobos et al  2022
Figure 2-11. Projection of historical trend of training dataset sizes and available data stock. Source: Villalobos et al, 2022

On top of that, the Internet is being rapidly populated with data generated by AI models. If companies continue using Internet data to train future models, these new models might just be trained on AI-generated data.

Once the publicly available data is exhausted, the most feasible path for more training data is by relying on proprietary data. I suspect that any company that somehow gets its hand on a massive amount of proprietary data -- copyrighted books, translations, contracts, medical records, genome sequences, and so forth -- will have a competitive advantage. This is a reason why OpenAI negotiated deals with publishers and media outlets including Axel Springer and Associated Press. It’s not surprising that in light of ChatGPT, many companies, including Reddit and StackOverflow, have changed their data terms to prevent other companies from scraping their data for their models.

The other bottleneck, which is less obvious but more pressing, is electricity. Machines require electricity to run. Today, data centers are estimated to consume 1-2% of global electricity. Until we can figure out a way to produce more energy, data centers can grow at most 50 times, which is less than two orders of magnitude. This leads to a concern about a power shortage in a near future, which will drive up the cost of electricity.


Let’s say that we’ve trained a foundation model using self-supervision. The trained model is often called the pretrained model. It can be further trained for downstream applications.

Due to how pre-training works today, a pretrained model typically has two issues. First, self-supervision optimizes the model for completion, not conversations15. If you find this ambiguous, don’t worry, I hope the example in the Supervised Finetuning section will make it clear. Second, if the model is pretrained on data indiscriminately scraped from the Internet, its outputs can be racist, sexist, rude, or just wrong. Post-training is to address both of these issues.

Today, post-training typically consists of two steps.

  1. Supervised finetuning (SFT): Finetune the pretrained model on high-quality data to optimize models for conversations, instead of completion.

  2. Alignment: Further finetune the model to output responses that align with human preference. There are many techniques for alignment, such as RLHF (Reinforcement Learning from Human Feedback), DPO (Direct Preference Optimization), RLAIF (Reinforcement Learning from AI Feedback).

As post-training consumes a small portion of resources compared to pre-training (2% for InstructGPT), you can think of SFT and alignment as unlocking the capabilities that the pretrained model already has but are hard for users to access via prompting alone.

Figure 2-12 shows the overall workflow of pre-training, SFT, and alignment, assuming you use RLHF for the last step. You can approximate how well a model aligns with human preference by what steps the model creators have taken.

The overall training workflow with pretraining  SFT  and RLHF
Figure 2-12. The overall training workflow with pretraining, SFT, and RLHF

If you squint, Figure 2-12 looks very similar to the meme Shoggoth with a smiley face (Figure 2-13):

  1. The pretrained model is an untamed monster because it was trained on indiscriminate data scraped from the Internet.

  2. This monster was then finetuned on higher-quality data -- think StackOverflow, Quora, or human annotations -- which makes it somewhat socially acceptable.

  3. Then the finetuned model was further polished using RLHF to make it customer-appropriate -- giving it a smiley face.

Shoggoth with Smiley Face. Courtesy of anthrupad.
Figure 2-13. Shoggoth with Smiley Face. Courtesy of anthrupad.

Note that a combination of pretraining, SFT, and alignment is the popular solution for building foundation models today, but it’s not the only solution. You can skip any of the steps, as you’ll see shortly.

Supervised Finetuning

As discussed in Chapter 1, the pretrained model is likely optimized for completion rather than conversing. If you input into the model, “How to make pizza”, the model will continue to complete this sentence, as the model has no concept that this is supposed to be a conversation. Any of the following three options can be a valid completion:

  • Adding more context to the question: “for a family of six?”

  • Adding follow-up questions: “What ingredients do I need? How much time would it take?”

  • Giving the instructions on how to make pizza.

If the goal is to respond to users appropriately, the correct option is 3.

We know that a model mimics its training data. To encourage a model to generate the appropriate responses, you can show examples of what appropriate responses should look like. Such examples follow the format (prompt, response) and are called demonstration data. Some people refer to this process as behavior cloning: you demonstrate how the model should behave, and the model clones this behavior.

Since different types of requests require different types of responses, your demonstration data should contain a wide range of tasks, such as question answering, summarization, and translation. Figure 2-14 shows a distribution of types of tasks OpenAI used to finetune their model InstructGPT. Note that this distribution doesn’t contain multimodal tasks, as InstructGPT is a text-only model.

The distribution of prompts used to finetune InstructGPT. Source  OpenAI.
Figure 2-14. The distribution of prompts used to finetune InstructGPT. Source: OpenAI.

Good teachers are important for humans to learn. Similarly, good labelers are important for AIs to learn how to conduct intelligent conversations. Unlike traditional data labeling, which can often be done with little or no expertise, demonstration data may contain complex prompts whose responses require critical thinking, information gathering, and judgment about the appropriateness of the user’s requests. Table 2-2 shows examples of (prompt, response) pairs created by labelers for InstructGPT.

Table 2-2. Examples of demonstration data used for InstructGPT
PromptLabler’s response
Serendipity means the occurrence and development of events by chance in a happy or beneficial way. Use the word in a sentence.Running into Margaret and being introduced to Tom was a fortunate stroke of serendipity.
Read the following text and answer the three questions at the end.
[OMITTED FOR LENGTH - Article about Justice Ruth Bader Ginsburg, Champion Of Gender Equality, Dies At 87]
1. …
2. …
3. Why will Ginsburg’s death have profound consequences for the court and the country?
1. …
2. …
3. Ruth Bader Ginsburg was the leader of the liberal wing, and because the court is about to open a new term, the chief justice no longer holds the controlling vote in contested cases.
ELI5: What’s the cause of the “anxiety lump” in our chest during stressful or disheartening experiences?The anxiety lump in your throat is caused by muscular tension keeping your glottis dilated to maximize airflow. The clenched chest or heartache feeling is caused by the vagus nerve which tells the organs to pump blood faster, stop digesting, and produce adrenaline and cortisol.

Companies like OpenAI and Anthropic, therefore, use highly educated labelers to generate demonstration data. Among those who labeled demonstration data for InstructGPT, ~90% have at least a college degree and more than one-third have a master’s degree. If labeling objects in an image might take only seconds, generating one (prompt, response) pair can take up to 30 minutes, especially for tasks that involve long contexts like summarization. If it costs $10 for one (prompt, response) pair, the 13,000 pairs that OpenAI used for InstructGPT would cost $130,000. That doesn’t yet include the cost of designing the data (what tasks and prompts to include), recruiting labelers, and data quality control.

Not everyone can afford to follow the high-quality human annotation approach. LAION, a non-profit German organization, mobilized 13,500 volunteers worldwide to generate 10,000 conversations, which consist of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings. Since the data was generated by volunteers, there wasn’t much control for biases. In theory, the labelers that teach models the human preference should be representative of the human population. The demographic of labelers for LAION is skewed. For example, in a self-reported survey, 90% of volunteer labelers identified as male (Köpf et al., 2023).

DeepMind used simple heuristics to filter for conversations from Internet data to train their model Gopher. To be specific, they looked for texts that look like the following format:

[A]: [Short paragraph]

[B]: [Short paragraph]

[A]: [Short paragraph]

[B]: [Short paragraph]

They claimed that their heuristics reliably yield high-quality dialogues.


On finetuning for dialogues vs. finetuning for following instructions

OpenAI’s InstructGPT is finetuned for following instructions. Each example of demonstration data is a pair of (prompt, response). DeepMind’s Gopher is finetuned for conducting dialogues. Each example of demonstration consists of multiple turns of back-and-forth dialogues. Instructions are subsets of dialogues -- ChatGPT is a powered-up version of InstructGPT.

Technically, you can train a model from scratch on the demonstration data instead of finetuning a pretrained model, effectively eliminating the self-supervised pretraining step. However, the finetuning approach has often returned superior results. Approaches to supervised finetuning will be further discussed in Chapter 6.


With great power come great responsibilities. A model that can assist users in achieving great things can also assist users in achieving terrible things. Demonstration data teaches the model to have a conversation but doesn’t teach the model what kind of conversations it should have. For example, if a user asks the model to write an essay about why one race is inferior or how to hijack a plane, should the model comply?

In both of the above examples, it’s straightforward to most people what a model should do. However, many scenarios aren’t as clear-cut. People from different cultural, political, socioeconomic, gender, and religious backgrounds disagree with each other all the time. How should AI respond to questions about abortion, gun control, the Israel - Palestine conflict, disciplining children, marijuana legality, universal basic income, or immigration? How do we define and detect potentially controversial issues? If your model responds to a controversial issue, whatever the responses, you’ll end up upsetting some of your users. If a model is censored too much, your model may become boring, driving away users.

Fear of AI models generating inappropriate responses can stop companies from releasing their applications to users. The goal of alignment is to get AI models to behave according to human preference. This is an ambitious, if not impossible, goal. Not only does this assume that universal human preference exists, but it also assumes that it’s possible to embed it into AI16.

Had the goal been simple, the solution could’ve been elegant. However, given the ambitious nature of the goal, the solution we have today is complicated. The earliest successful alignment algorithm, which is still popular today, is RLHF. RLHF consists of two parts:

  • Train a reward model that scores the foundation model’s outputs.

  • Optimizing the foundation model to generate responses for which the reward model will give maximal scores.

Reward model

Given a pair of (prompt, response), the reward model (RM) outputs a score for how good the response is. Training a model to score a given input is a common and straightforward ML task. The challenge, similarly to SFT, is to obtain reliable data. If we ask labelers to score each response directly, the scores will vary. For the same sample, on a 10-point scale, one labeler might give a 5 and another 7. Even the same labeler, given the same pair of (prompt, response) twice, might give different scores.

An easier task is to ask labelers to compare two responses and decide which one is better. For each prompt, multiple responses are generated by either humans or AI. The resulting labeled data follows the format (prompt, winning_response, losing_response). This is called comparison data. Table 2-3 shows an example of comparison data used by Anthropic for one of their models. Out of the two responses in this example, I prefer the response labeled as the losing one. This highlights the challenge of trying to capture diverse human preferences in a single mathematical formulation.

Table 2-3. : An example of comparison data from Anthropic’s HH-RLHF dataset.
How can I get my dog high?
I’m not sure what you mean by that.I don’t know that we should get the dog high. I think it’s important for a dog to experience the world in a sober state of mind.

Still, this easier task of comparing two responses takes time. LMSYS found that manually comparing two responses took on average 3-5 minutes, as the process requires fact-checking each response17.

Figure 2-18 shows the UI that OpenAI’s labelers used to create comparison data for the RM of InstructGPT. Labelers give concrete scores from 1 to 7 as well as rank the responses in the order of their preference, but only the ranking is used to train the RM. Their inter-labeler agreement is around 73%, which means if they ask 10 people to rank the same two responses, approximately 7 of them will have the same ranking. To speed up the labeling process, each annotator can rank multiple responses at the same time. A set of 3 ranked responses (A > B > C) will produce 3 ranked pairs: (A > B), (A > C), and (B > C).

The interface labelers used to generate comparison data for OpenAI s InstructGPT.
Figure 2-15. The interface labelers used to generate comparison data for OpenAI’s InstructGPT.

Given only comparison data, how do we train the model to give concrete scores? Similar to how you can get humans to do basically anything with the right incentive, you can get a model to do so given the right objective function. A commonly used function represents the difference in output scores for the winning and losing response. The objective is to maximize this difference. For those interested in the mathematical details, here is the formula:

  • rθ: the reward model being trained, parameterized by θ. The goal of the training process is to find θ for which the loss is minimized.

  • Training data format.

x: prompt

yw: winning response

yl: losing response

  • sw=r(x,yw): reward model’s scalar score for the winning response

  • sl=r(x,yl): reward model’s scalar score for the losing response

  • σ: the Sigmoid function

For each training sample (x,yw,yl), the loss value is computed as follows:


Goal: find θ to minimize the expected loss for all training samples.


The RM can be trained from scratch, or finetuned on top of another model, such as the pretrained or SFT model. Finetuning on top of the strongest foundation model seems to give the best performance. Some people believe that the RM should be at least as powerful as the foundation model to be able to score the foundation model’s responses. However, as we’ll see in the next chapter on evaluation, a weak model can judge a stronger model, as judging is believed to be easier than generation.

Finetuning using the reward model

With the trained RM, we further train the SFT model to generate output responses that will maximize the scores by the RM. During this process, prompts are randomly selected from a distribution of prompts, such as existing user prompts. These prompts are input into the model, whose responses are scored by the RM. This training process is often done using reinforcement learning, hence the name RLHF. More specifically, it’s often done with Proximal Policy Optimization (PPO), an algorithm released by OpenAI in 2017.

Empirically, RLHF improves performance compared to SFT alone. However, as of this writing, there are debates on why RLHF works. It’s also unclear whether RLHF mitigates or worsens hallucinations, as discussed in the “Hallucination” section later in this chapter. Because it’s both complex and not theoretically sound, this technique might evolve or go out-of-date as the field matures. If you’re interested in learning more about RLHF, see Appendix A.

Some companies find it okay to skip reinforcement learning altogether. For example, Stitch Fix and Grab find that having the reward model alone is good enough for their applications. They get their models to generate multiple outputs and pick the ones given high scores by their reward models. Reward models, when used for this purpose, are also called verifiers.

Recall that SFT and RLHF are steps taken to address the problem created by the low quality of data used for pretraining. If one day we have better pretraining data or better ways to train foundation models, we might not need SFT and RLHF at all.


A model constructs its outputs through a process known as sampling, which is sometimes called decoding. This section discusses different sampling strategies and sampling variables including temperature, top-k, and top-p. It’ll then explore how to sample multiple outputs to improve a model’s performance. We’ll also see how the sampling process can be modified to get models to generate responses that follow certain formats and constraints.

Sampling makes AI’s outputs probabilistic. Understanding this probabilistic nature is important for handling AI’s behaviors such as inconsistency and hallucination. This section ends with a deep dive into what this probabilistic nature means and how to work with it.

Sampling Fundamentals

Given an input, a neural network produces an output by first computing the probabilities of possible outcomes. For a classification model, possible outcomes are the available classes. For example, if a model is trained to classify whether an email is spam or not, there are only two possible outcomes: spam and not spam. The model computes the probability of each of these two outcomes. Let’s say the probability of the email being spam is 90%, and not spam is 10%.

To generate the next token, a language model first computes the probability distribution over all tokens in the vocabulary, which looks like Figure 2-16.

To generate the next token  the language model first computes the probability distribution over all tokens in the vocabulary.
Figure 2-16. To generate the next token, the language model first computes the probability distribution over all tokens in the vocabulary.

For the spam email classification task, it’s common to output the value with the highest probability. If the email has a 90% chance of being spam, you can mark the email as spam18. Always picking the most likely outcome or token is called greedy sampling. However, for a language model, greedy sampling creates boring outputs. Imagine a model that, for whatever question you ask, always responds with the most common words.

Instead of always picking the next most likely token, the model can sample the next token according to the probability distribution over all possible values. Given the context of “My favorite color is …” as shown in Figure 2-19, if “red” has a 30% chance of being the next token and “green” has a 50% chance, “red” will be picked 30% of the time, and “green” 50% of the time.


One problem with sampling the next token according to the probability distribution is that the model can be less creative. In the previous example, common words for colors like “red”, “green”, “purple”, and so on have the highest probabilities. The language model’s answer ends up sounding like that of a five-year-old: “My favorite color is green.” Because “the” has a low probability, the model has a low chance of generating a creative sentence such as “My favorite color is the color of a still lake on a spring morning.”

Temperature is a technique used to redistribute the probabilities of the possible values. Intuitively, it reduces the probabilities of common tokens, and as a result, increases the probabilities of rarer tokens. This enables models to create more creative responses.

To understand how temperature works, let’s take a step back to see how a model computes the probabilities. Given an input, a neural network processes this input and outputs a logit vector. Each logit corresponds to one possible value. In the case of a language model, each logit corresponds to one token in the model’s vocabulary. The logit vector size is the size of the vocabulary. A visualization of the logits vector is shown in Figure 2-17.

For each input  a language model produces a logit vector. Each logit corresponds to a token in the vocabulary.
Figure 2-17. For each input, a language model produces a logit vector. Each logit corresponds to a token in the vocabulary.

While larger logits correspond to higher probabilities, the logits don’t represent the probabilities. Logits don’t sum up to one. Logits can even be negative, while probabilities have to be non-negative. To convert logits to probabilities, a softmax layer is often used. Let’s say the model has a vocabulary of N and the logit vector is . The probability for the token, , is computed as follows:

Temperature is a constant used to adjust the logits before the softmax transformation. Logits are divided by temperature. For a given temperature T, the adjusted logit for the token is . Softmax is then applied on this adjusted logit instead of on .

Let’s say the model has a vocabulary of N and the logit vector is