Chapter 2. Developing Foundation Models

Applications with foundation models need foundation models. While you don’t need to know how to develop a model inside and out to use it, a high-level understanding will be useful in helping you make the right decision on what model to use and how to adapt it to your needs.

To understand models, we have to start with data. Foundation models need a lot of data. Knowing what data a model is trained on gives important clues about what it can do. This chapter looks at data from a sourcing perspective: where training data for foundation models typically comes from, and what it means for models’ performance on downstream tasks. Chapter 5 will go into detail on dataset engineering techniques, including tokenization and quality control.

At a macro level, there are two distinct phases of model training: pre-training and post-training1. Pre-training is typically done with self-supervision, which allows the model to be trained using a massive amount of data. The resulting model is generally capable but not necessarily safe to use. Given this model, the goal of post-training is to make it usable in two aspects: follow users’ instructions and align with human preference. For most applications, a model that more aligns with human preference is better2. Post-training is typically done with supervision using higher-quality data. This chapter walks through steps in both of these phases.

Once a model is trained and aligned, it can be used to generate outputs through a process called sampling. Sampling is perhaps among the most underrated concepts of AI engineering, as it explains many seemingly baffling AI behaviors including hallucination and inconsistency. On top of that, choosing the right sampling variables is relatively easy to do and can get a model to generate significantly better outputs. For this reason, sampling is the section that I was the most excited to write about in this chapter.

Concepts covered in this chapter are fundamental. They are important for understanding the rest of the book. However, because these concepts are fundamental, you might already be familiar with them. If you feel confident about them already, feel free to skip them. If you encounter a confusing concept later on, you might want to revisit this chapter.

Training Data Distribution

An AI model is only as good as the data it was trained on. If there’s no Vietnamese in the training data, the model won’t be able to translate from English into Vietnamese. Similarly, if an image classification model only sees animals in its training set, it won’t perform well on photos of plants.

If we want a model to improve for a certain task, we want to include more data for that task in the training data. However, collecting sufficient data for training a large model isn’t easy, and can be expensive. Often, model developers use the data that is available, not the data that they want.

For example, a common source for training data is Common Crawl, created by a nonprofit organization that sporadically crawls websites on the Internet. In 2022 and 2023, this organization crawled approximately 2 - 3 billion web pages each month. Google provides a clean subset of Common Crawl called the Colossal Clean Crawled Corpus, or C4 for short.

The data quality of Common Crawl, and by extension C4, is questionable -- think clickbait, misinformation, propaganda, conspiracy theories, racism, misogyny, and every sketchy website you’ve ever seen or avoided on the Internet. A study by the Washington Post shows that the 1,000 most common websites in the dataset include several media outlets that rank low on NewsGuard’s independent scale for trustworthiness. In layman’s words, Common Crawl contains plenty of fake news.

Yet, due solely to its availability, variations of Common Crawl are used to train most LLMs that disclose the source of their training data, including OpenAI’s GPT-3 and Google’s Bard. It’s unclear if Common Crawl was used for the newer models GPT-4 and Gemini. To avoid scrutiny both from the public and competitors, these companies have stopped disclosing the sources of their training data.

Some teams use heuristics to increase the quality of their training data. For example, OpenAI scraped all outbound links from Reddit that received at least 3 karma to train GPT-2. While this does help screen out links that nobody cares about, Reddit isn’t exactly the pinnacle of propriety and good taste.

The “using what we have, not what we want” approach results in models that do well on the tasks covered in the training data, but not the tasks that we care about. To correct this, model developers started curating their training data to design models for specific tasks. In general, these curated models are aimed at tasks that fall into one of the two buckets: multilingual and domain-specific.

Before examining each bucket, keep in mind that these specialized models don’t always have to be trained from scratch. It’s common to take a general-purpose model and finetune it for a specific task. The closer the base model’s specialization is to your task, the less adaptation work you’ll have to do.

Some might wonder why not just train a model on all data available, both general Internet data and specialized data, so that the model can do everything. This is what many people do. However, more data doesn’t always lead to better performance. For example, a model trained with a smaller amount of high-quality data might outperform a model trained with a large amount of low-quality data. Neural networks can also forget things, so they might forget how to do your task after learning how to do other tasks3.

Multilingual Models

English dominates the Internet. An analysis of the dataset Common Crawl shows that English accounts for almost half of the dataset (45.88%), which is almost 8 times more than the second-most common language (Russian - 5.97%) (Lai et al., 2023). See Figure 2-1 for a list of languages with at least 1% in Common Crawl. Languages with limited availability as training data -- typically languages not included in this list -- are considered low-resource.

The most common languages in the Common Crawl dataset. Source  Lai et al.  2023
Figure 2-1. The most common languages in the Common Crawl dataset. Source: Lai et al., 2023

Some languages that are severely underrepresented in Common Crawl, despite having a lot of speakers today, are shown in Table 2-1. Ideally, the ratio between world population representation and Common Crawl representation should be 1. The higher this ratio, the more underrepresented this language is in Common Crawl.

Table 2-1. : Examples of languages that are underrepresented in Common Crawl. The last row, English, is for comparison. The numbers for % in Common Crawl are taken from Lai et al., 2023
LanguageSpeakers (million)% world populationa% in Common CrawlRatio: World population / Common Crawl
Punjabi1131.41%0.0061%231.56
Swahili710.89%0.0077%115.26
Urdu2312.89%0.0274%105.38
Kannada640.80%0.0122%65.57
Telugu951.19%0.0183%64.89
Gujarati620.78%0.0126%61.51
Marathi991.24%0.0213%58.10
Bengali2723.40%0.0930%36.56
English 1452 18.15% 45.88% 0.40

a World population of 8 billion was used for this calculation.

Studies have shown that general-purpose models don’t work as well for non-English languages. For example, on the MMLU benchmark, a suite of 14,000 multiple-choice problems spanning 57 subjects, GPT-4 performed much better in English than under-represented languages like Telugu, as shown in Figure 2-2. This is consistent with Lai et al.’s finding that “ChatGPT’s performance is generally better for English than for other languages, especially for higher-level tasks that require more complex reasoning abilities (e.g., named entity recognition, question answering, common sense reasoning, and summarization).”

Underrepresentation isn’t the only reason for this underperformance. The structure of a language itself and the culture that a language embodies can also make a language harder or easier for a model to learn. However, underrepresentation is a big reason. The three languages that have the worst performance in GPT-4’s MMLU benchmarks -- Telugu, Marathi, and Punjabi -- are also among the languages that are most underrepresented in Common Crawl.

On the MMLU benchmark  GPT 4 performs better in English than in any other language. Source  OpenAI. To obtain MLLU in other languages  OpenAI translated the questions using Azure Translate.
Figure 2-2. On the MMLU benchmark, GPT-4 performs better in English than in any other language. Source: OpenAI. To obtain MLLU in other languages, OpenAI translated the questions using Azure Translate.

Similarly, when tested on six math problems on Project Euler, Yennie Jun found that GPT-4 was able to solve problems in English more than three times as often compared to Armenian or Farsi. GPT-4 failed in all six questions for Burmese and Amharic, as shown in Figure 2-3.

GPT 4 is much better at math in English than in other languages. Source  GPT 4 can solve math problems   but not in all languages
Figure 2-3. GPT-4 is much better at math in English than in other languages. Source: GPT-4 can solve math problems — but not in all languages

Models today are pretty good at translating from one language to another. Can we just translate all queries from other languages into English, feed them into these models, and translate the responses into the original language? Many people follow this approach, but it’s not ideal. First, this requires us to have models that can understand underrepresented languages to be able to translate. Second, translation can cause information loss. For example, some languages like Vietnamese have pronouns to denote the relationship between the two speakers. When translating into English, all these pronouns are translated into I and you, causing the loss of the relationship information.

Models can also have unexpected performance challenges in non-English languages. For example, NewsGuard found that ChatGPT is more willing to produce misinformation in Chinese than in English. In April 2023, NewsGuard fed ChatGPT-3.5 seven prompts each in English, simplified Chinese, and traditional Chinese, asking ChatGPT to produce misinformation articles about China. For English, ChatGPT declined to produce false claims for six out of seven prompts. However, it produced false claims in simplified Chinese and traditional Chinese all seven times. It’s unclear what causes this difference in behavior. It may be due to bias in training data. Perhaps OpenAI just didn’t use as much data in Chinese or data about China-related misinformation narratives to train their models.

Other than quality issues, models can also be slower and more expensive for non-English languages. How fast a model can process and respond to a prompt depends on how long the prompt and response are. As discussed earlier in the section “Model Size,” the length of a text can be measured using the number of tokens generated through a model’s tokenization process. It turns out that the same model’s tokenization can be much more efficient for some languages than for others.

Benchmarking GPT-4 on MASSIVE, a dataset of 1 million short texts translated across 52 languages and 18 domains, Yennie Jun found that to convey the same meaning, languages like Burmese and Hindi require a lot more tokens than English or Spanish. The result is shown in Figure 2-4. The median token length for MASSIVE in English is 7, but the median length for the same data in Hindi is 32, and in Burmese, it’s a whopping 72, 10 times longer than in English.

Assuming that the time it takes to generate a token is the same in all languages, this means that to generate the same content, GPT-4 takes approximately 10 times longer in Burmese than in English. As of writing, OpenAI API is priced per the number of input tokens and output tokens, which also means that Burmese costs 10 times more than English.

Source  All languages are NOT created  tokenized  equal  Yennie Jun  2023
Figure 2-4. Source: All languages are NOT created (tokenized) equal (Yennie Jun, 2023)

To address this, many models have been trained to focus on non-English languages. The most active language, other than English, is perhaps Chinese, with ChatGLM, YAYI, LLaMA2-Chinese, and others. There are also models in Vietnamese (PhoGPT), Portuguese (Cabrita), Japanese (Rinna’s model), Arabic (Jais), Bahasa Indonesia (Sealion), and more.

Domain-Specific Models

Technical reports on recent models like Gemini and GPT-4V show that they can perform incredibly well for a wide range of domains, including but not limited to coding, law, science, business, sports, and environmental science. The Washington Post’s analysis of Common Crawl also found that a wide range of domains are present in the dataset, as shown in Figure 2-5.

Distribution of domains in the C4 dataset. Reproduced from the statistics from Inside the secret list of websites that make AI like ChatGPT sound smart. One caveat of this analysis is that it only shows the categories that are included  not the categories missing.
Figure 2-5. Distribution of domains in the C4 dataset. Reproduced from the statistics from Inside the secret list of websites that make AI like ChatGPT sound smart. One caveat of this analysis is that it only shows the categories that are included, not the categories missing.

As of this writing, there haven’t many analyses of domain distribution in vision data. This might be because it’s harder to categorize images than texts. For texts, you can use domain keywords as heuristics, but there are no obvious heuristics for images. Most analyses I could find about vision datasets are about image sizes, resolutions, or video lengths.

Often, benchmarks can give us clues into what domains a model is good for. Figure 2-6 shows how two models, CLIP and Open CLIP, perform on different benchmarks. These benchmarks show how well these two models do on birds, flowers, cars, and a few more categories, but the world is so much bigger and more complex than these few categories.

Even though popular foundation models today can answer everyday questions about different domains, they are unlikely to perform well on domain-specific tasks, especially if these tasks involve domain-specific data that these models never saw during training.

Two examples of domain-specific tasks are drug discovery and cancer screening. Drug discovery involves protein, DNA, and RNA data, which is expensive to acquire and follow specific formats, and is unlikely to be included in the Internet data. Cancer screening typically involves X-ray and fMRI scans, which are hard to obtain due to privacy, and therefore, unlikely to be included in the training data of general-purpose models.

One of the most famous domain-specific models is perhaps DeepMind’s AlphaFold, trained on the sequences and 3D structures of around 100,000 known proteins (2021). NVIDIA’s BioNeMo is another model that focuses on biomolecular data for drug discovery (2023). Google’s Med-PALM2 combined the power of an LLM with medical data to answer medical queries with higher accuracy (2023).

Pre-training

With training data figured out, we can start training our model. In pre-training, a model is trained from scratch, typically using self-supervision. This is where you get to choose a model architecture and model size.

Out of the two phases of training, pre-training is more resource-intensive by a long shot. For the InstructGPT model, pre-training takes up 98% of the overall compute and data resources, and takes a long time to do. A small mistake during pre-training can incur a significant financial loss and set back the project significantly. Due to the resource-intensive nature of pre-training, this has become an art that only a few practice. Those with expertise in pre-training large models, however, are heavily sought after4.

Model Architecture

As of this writing, the most dominant architecture for language-based foundation models is the transformer architecture (Vaswani et al., 2017), which is based on the attention mechanism. Since most people won’t need to know the inner workings of this architecture to build applications on top of transformer-based models, I won’t go into the math behind it.

To understand transformers, let’s look at the problem it was created to solve. The transformer architecture was popularized at the heel of the success of the seq2seq (sequence-to-sequence) architecture. At the time of its introduction in 2014, seq2seq provided significant improvement on then-challenging tasks: machine translation and summarization. In 2016, Google incorporated seq2seq into Google Translate, an update that they claimed to have given them the “largest improvements to date for machine translation quality”. This generated a lot of interest in seq2seq, making it the go-to architecture for many people for any tasks involving sequences of text.

At a high level, seq2seq contains an encoder that processes inputs and a decoder that generates outputs. Both inputs and outputs are sequences of tokens, hence the name. Seq2seq uses RNNs (Recurrent Neural Networks) as its encoder and decoder. In its most basic form, the encoder processes the input tokens sequentially, outputting the final hidden state that represents the input. The decoder then generates output tokens sequentially, conditioned on both the final hidden state of the input and the previously generated token. A visualization of the seq2seq architecture is shown in Figure 2-7.

Seq2seq architecture vs. transformer architecture. For the transformer architecture  the arrows show the tokens that the decoder attends to when generating each output token.
Figure 2-7. Seq2seq architecture vs. transformer architecture. For the transformer architecture, the arrows show the tokens that the decoder attends to when generating each output token.

There are two problems with seq2seq that the 2017 transformer paper addresses. First, the vanilla seq2seq decoder generates output tokens using only the final hidden state of the input. Intuitively, this is like generating answers about a book using the book summary. This limits the quality of the generated outputs. On the other hand, the transformer decoder uses the attention mechanism to look at any input token, which is like generating answers by referencing any page in the book. A simplified visualization of the transformer architecture is shown in Figure 2-7.

Second, the RNN encoder and encoder mean that both input processing and output generation are done sequentially, making it slow for long sequences. If an input is 200-token long, seq2seq has to wait for each input token to finish processing before moving on to the next. The transformer architecture dispenses with RNNs entirely. With transformer, the input tokens can be processed at the same time, in parallel, significantly speeding up input processing. While transformer removes the sequential input bottleneck, we still have the sequential output bottleneck due to the nature of autoregressive models. Overcoming the sequential output bottleneck will be discussed in chapter 7.

Note that while the attention mechanism is often associated with the transformer model, it was introduced 3 years before the transformer paper5. The attention mechanism can be used with other architectures too. Google used the attention mechanism with their seq2seq architecture in 2016 for their paper GNMT, Google Neural Machine Translation, model. However, it wasn’t until the transformer paper showed that the attention mechanism could be used without RNNs that it took off.

While transformers are the dominant architecture today, there are non-transformer text-based foundation models. One popular model is RWKV, an RNN-based model that can be parallelized for training. Due to its RNN nature, in theory, it doesn’t have the same context length limitation that transformer-based models have6. However, in practice, having no context length limitation doesn’t guarantee good performance with long context length.

Since the AlexNet7 paper revived the interest in deep learning in 2012, many architectures have gone in and out of fashion. Seq2seq was in the limelight for 4 years (2014 - 2018). GAN (Generative Adversarial Network) captured the collective imagination a bit longer (2014 - 2019). Compared to architectures that came before it, the transformer is sticky. It’s been around since 2017 and still going strong. How long until something better comes along?

Developing a new architecture to outperform transformers isn’t easy. The transformer has been heavily optimized since 2017. A new architecture that aims to replace the transformer will have to perform at the scale that people care about, on the hardware that people care about8.

Model Size

Much of AI progress in recent years can be attributed to an increase in model size. It’s hard to talk about foundation models without talking about their number of parameters. The number of parameters is usually appended at the end of a model name. For example, LLaMA-13B refers to the version of LLaMA, a model family developed by Meta, that has 13 billion parameters.

In general, models with more parameters perform better. Given two models of similar architecture, the one with 13 billion parameters is likely to perform much better than the one with 7 billion parameters.

The number of parameters helps us estimate the compute resources needed to train and run this model. For example, if a model has 7 billion parameters, and each parameter is stored using 1 byte (8-bit), then we can calculate that the GPU memory needed to do inference using this model will be at least 7 billion bytes (7 GB)9.

The number of parameters can be misleading if the model is sparse. Sparse models refer to models with a large percentage of zero-value parameters. A 7B-parameter model that is 90% sparse only has 10%, or 700 million, non-zero parameters. Sparsity allows for more efficient data storage and computation. So a larger but sparse model can require less compute than a smaller but dense model.

A type of sparse model that has gained popularity in recent years is mixture-of-experts10 (MoEs). An MoE model is divided into different groups of parameters, each group is an expert. Only a subset of the experts are used to process each token. We say that only a subset of the model parameters are active for a given token.

For example, Mixtral 8x7B is a mixture of 8 experts, each expert with 7 billion parameters. If no two experts share any parameter, it should have 7 x 8 billion = 56 billion parameters. However, due to some parameters being shared, it has only 46.7 billion parameters.

At each layer, for each token, only 2 experts are used. This means that only 12.9 billion parameters are active for each token. This means that while this model has 46.7 billion parameters, its cost and speed are the same as a 12.9 billion parameter model.

A larger model can also underperform a smaller model if it’s not trained on enough data. Imagine a 13B-param model trained on a dataset consisting of a single sentence: “I like pineapples.” This model will perform much worse than a much smaller model trained on more data.

When discussion model size, it’s important to consider the size of the data it was trained on. For most models, dataset sizes are measured by the number of training samples. For example, Google’s Flamingo (2022) was trained using four datasets—one of them has 1.8 billion (image, text) pairs and one has 312 million (image, text) pairs.

For language models, a training sample can be a sentence, a Wikipedia page, a chat conversation, or a book. A book is worth a lot more than a sentence, so the number of training samples is no longer a good metric to measure dataset sizes. A better measurement is the number of tokens in the dataset.

As mentioned in Chapter 1, a token is typically a word, a character, or a subword. The number of tokens isn’t a perfect measurement, as different models can have different tokenization processes, resulting in the same dataset having different numbers of tokens for different models. Why not just use the number of words or the number of letters? Because a token is the unit that a model operates on, knowing the number of tokens in a dataset helps us measure how much a model can potentially learn from that data.

As of this writing, LLMs are trained using datasets in the order of trillions of tokens. The datasets used for DeepMind’s Gopher and Meta’s LlaMA contain between 1 and 2 trillion tokens. Together’s open-source dataset RedPajama-v2 has 30 trillion tokens. This is equivalent to 450 million books11 or 5,400 times the size of Wikipedia.

The number of tokens in a model’s dataset isn’t the same as its number of training tokens. The number of training tokens measures the tokens that the model is trained on. If a dataset contains 1 trillion tokens and a model is trained on that dataset for 2 epochs12 -- you can think of an epoch as a pass through the dataset -- the number of training tokens is 2 trillion. See Figure 2-8 for examples of the number of training tokens for models with different numbers of parameters.

Examples of the number of training tokens for models with different numbers of parameters. Source  Training Compute Optimal Large Language Models  DeepMind  2022
Figure 2-8. Examples of the number of training tokens for models with different numbers of parameters. Source: Training Compute-Optimal Large Language Models (DeepMind, 2022)

To train a model, we need compute. One way to measure the amount of compute needed is the number of machines. Examples of machines include GPUs, CPUs, and TPUs. More hardware architectures are being developed. However, different machines have very different capacities and costs. An NVIDIA A10 GPU is very different from an NVIDIA H100 GPU, which is very different from an Intel Ultra Processor.

A more standardized unit for a model’s compute requirement is FLOP, Floating Point Operation. Intuitively, FLOP measures the number of floating point operations performed for a certain task, such as training a model. Google’s largest PALM-2 model, for example, was trained using 1022 FLOPs. GPT-3-175B was trained using 3.14 x 1023 FLOPs.

The plural form of FLOP, FLOPs, is often confused with FLOP/S, Floating Point Operations Per Second. FLOPs measure the compute requirement for a task, whereas FLOP/S measures a machine’s peak performance. For example, NVIDIA’s H100 can deliver a maximum of 60 TeraFLOP/S: 6 x 1013 FLOPS a second or 5.2 x 1018 FLOPs a day.

Warning

Confusing notations

FLOP/S is often written as FLOPS, which looks similar to FLOPs. To avoid this confusion, some companies, including OpenAI, use FLOP/S-days in place of FLOPs to measure compute requirements.

1 FLOP/S-day = 60 * 60 * 24 = 86400 FLOPs.

This book uses FLOPs for counting floating point operations, and FLOP/S for FLOPs per second.

Assume that you have 256 H100s. If you can use them at their maximum capacity and make no training mistakes, it’d take you (3.14 x 1023 ) / (256 x 5.2 x 1018) = ~236 days, or approximately 7.8 months, to train GPT-3-175B.

However, it’s unlikely you can use your machines at their peak capacity all the time. Utilization measures how much of the maximum compute capacity you can use. What’s considered good utilization depends on the model being trained and the machines. As a rule of thumb, if you can get half the advertised performance advertised, 50% utilization, you’re doing okay. Anything above 70% utilization is considered great. Don’t let this rule of thumb stop you from getting even higher utilization. With 256 H100s and an extremely high utilization of 70%, it’d take you 236 / 70% = ~337 days to train GPT-3-175B.

At this utilization, if you want to train GPT-3-175B in a month, you’d need almost 3,000 H100s. Cloud providers today are offering H100s at around $2 to $5/hour. At $2/hour, training GPT-3-175B would cost over $4 million: $2 x 256 x 24 x 337 = $4,141,056. As compute is getting rapidly cheaper, this number can get much lower.

Scaling Law: Building Compute-Optimal Models

I hope that the last section has convinced you of three things. One, model performance depends on the model size and the dataset size. Two, bigger models and bigger datasets require more compute. Three, compute costs money.

Unless you have unlimited money, budgeting is essential. You don’t want to start with an arbitrarily large model size and see how much it would cost. You start with a budget -- how much money you want to spend -- and work out the best model performance you can afford. As compute is often the limiting factor, because it’s not only expensive but also hard to set up, it makes sense to start with a compute budget. Given a fixed amount of FLOPs, what model size and dataset size would give the best performance?

Given a compute budget, the rule that helps calculate the optimal model size and dataset size is called the scaling law, proposed in the paper Training Compute-Optimal Large Language Models (DeepMind, 2022). To study the relationship between model size, dataset size, compute budget, and model performance, the authors trained 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens.

They found that for compute-optimal training, the number of training tokens needed is approximately 20 times the model size, so a 3B model needs approximately 60B training tokens. The model size and the number of training tokens should be scaled equally: for every doubling of the model size, the number of training tokens should also be doubled.

We’ve come a long way from the days when the training process was treated like alchemy. Figure 2-9 shows that we can predict not only the optimal number of parameters and number of tokens for each FLOPs budget but also the expected training loss we can get from these settings (assuming that we do things right).

Graphs that depict the relationships between training loss  a model s number of parameters  FLOPs  and number of training tokens. Source  Training Compute Optimal Large Language Models  DeepMind  2022
Figure 2-9. Graphs that depict the relationships between training loss, a model’s number of parameters, FLOPs, and number of training tokens. Source: Training Compute-Optimal Large Language Models (DeepMind, 2022)

When we focus on only the compute budget, we assume that the cost of acquiring data is much cheaper than the cost of compute. If the costs for training data are nontrivial, we can approach the scaling law from a different perspective. We can start from a model’s number of parameters and project the number of FLOPs and training tokens needed for this model’s size, which helps us approximate the compute and data costs. This helps us balance the budget between compute and data. Figure 2-10 shows what this projection looks like.

The expected compute budget and training tokens for models of different sizes. Source  Training Compute Optimal Large Language Models  DeepMind  2022
Figure 2-10. The expected compute budget and training tokens for models of different sizes. Source: Training Compute-Optimal Large Language Models (DeepMind, 2022)

The scaling law guides us in getting the optimal model performance given a compute budget. However, it’s important to remember that for production, model performance isn’t everything. Some models, most notably LLaMA, have sub-optimal performance but better usability. Given their compute budget, LLaMA authors could’ve chosen bigger models that would give better performance, but they opted for smaller models. Smaller models are easier to work with and cheaper to run inference on, which helped their models gain wider adoption.

On the topic of model performance given a compute budget, it’s worth noting that the cost of achieving a given model performance is decreasing. For example, on the ImageNet dataset, the cost to achieve 93% accuracy halved from 2019 to 2021, according to Artificial Intelligence Index Report 2022.

While the cost for the same model performance is decreasing, the cost for model performance improvement remains high. Improving a model’s accuracy from 90 to 95% is more expensive than improving it from 85 to 90%. Meta’s paper Beyond neural scaling laws: beating power law scaling via data pruning (2022) pointed out that this means that a model with a 2% error rate might require an order of magnitude more data, compute, or energy than a model with a 3% in error rate. In language modeling, a drop in cross entropy loss13 from about 3.4 to 2.8 nats14 requires 10 times more training data. For large vision models, increasing the number of training samples from 1 billion to 2 billion leads to an accuracy gain on ImageNet of only a few percentage points.

However, small performance changes in language modeling loss or ImageNet accuracy can lead to big differences in the quality of downstream applications. If you’re using a model with a cross entropy loss of 3.4, by switching to using a model with a cross entropy loss of 2.8, you’ll notice the difference.

Scaling Extrapolation

Training a large model is both time-consuming and expensive. For some models, you might have only one shot at getting it right.

The challenge is that the final performance of a model depends heavily on the values of its hyperparameters. Intuitively, you can think of hyperparameters as configurations that determine how a model should be trained. Examples of hyperparameters include the initial learning rate, learning rate schedules, momentum, per-layer initial variance, multiplicative constants after weight/biases, and so on. If the value of a parameter is updated throughout the training process, the value of a hyperparameter is determined at the beginning of the training process.

When working with small models, it’s a common practice to train a model with different sets of hyperparameters and pick the best-performing one. This isn’t possible for large models -- training a large model once is resource-draining enough.

As a result, a new topic of research has emerged recently that tries to predict, for large models, which set of hyperparameters will give the best performance. The current approach is to study the impact of different hyperparameters on models of a range of sizes, usually much smaller than the target model size, and then extrapolate how these hyperparameters would work on the target model size. This approach is also called hyperparameter transferring or scaling extrapolation. A 2022 paper by Microsoft and OpenAI shows that it was possible to transfer hyperparameters from a 40M model to a 6.7B model.

This research topic is still small, as not many people have the experience and resources to study the training of large models. This is also a difficult topic to study due to the sheer number of hyperparameters and how they interact with each other. If you have 10 hyperparameters, you’d have to study 1,024 different combinations of them. You would have to study each hyperparameter individually, then two of them together, and three of them together, and so on. Luke Metz, a Google-turned-OpenAI researcher who focuses on the training of large models, wrote an excellent blog post on the challenges of extrapolation: On the Difficulty of Extrapolation with NN Scaling (2022).

Scaling Bottlenecks

Until now, every order of magnitude increase in model size has led to an increase in model performance. GPT-2 has an order of magnitude more parameters than GPT-1 (1.5 billion vs. 117 million). GPT-3 has two orders of magnitude more than GPT-2 (175 billion vs. 1.5 billion). This means a three-orders-of-magnitude increase in model sizes between 2018 and 2021. Three more orders of magnitude growth would result in 100-trillion-parameter models.

How many more orders of magnitude can model sizes grow? Would there be a point where the model performance plateaus regardless of the model size? While it’s hard to answer these questions, there are already two visible bottlenecks for scaling model sizes: training data and electricity.

Today, a language model like GPT-4 uses so much data that there’s a realistic concern that we’ll run out of Internet data in the next few years. The rate of training dataset size growth is much faster than the rate of new data being generated (Villalobos et al, 2022), as illustrated in Figure 2-11. If you’ve ever put anything on the Internet, you should assume that it is already or will be included in the training data for some language models, whether you consent or not. This is similar to how, if you post something on the Internet, you should expect it to be indexed by Google. Some people are leveraging this fact to inject the data they want into the training data of future models, simply by publishing the text they want on the Internet, hoping that it will influence future models to generate the responses they want.

Projection of historical trend of training dataset sizes and available data stock. Source  Villalobos et al  2022
Figure 2-11. Projection of historical trend of training dataset sizes and available data stock. Source: Villalobos et al, 2022

On top of that, the Internet is being rapidly populated with data generated by AI models. If companies continue using Internet data to train future models, these new models might just be trained on AI-generated data.

Once the publicly available data is exhausted, the most feasible path for more training data is by relying on proprietary data. I suspect that any company that somehow gets its hand on a massive amount of proprietary data -- copyrighted books, translations, contracts, medical records, genome sequences, and so forth -- will have a competitive advantage. This is a reason why OpenAI negotiated deals with publishers and media outlets including Axel Springer and Associated Press. It’s not surprising that in light of ChatGPT, many companies, including Reddit and StackOverflow, have changed their data terms to prevent other companies from scraping their data for their models.

The other bottleneck, which is less obvious but more pressing, is electricity. Machines require electricity to run. Today, data centers are estimated to consume 1-2% of global electricity. Until we can figure out a way to produce more energy, data centers can grow at most 50 times, which is less than two orders of magnitude. This leads to a concern about a power shortage in a near future, which will drive up the cost of electricity.

Post-training

Let’s say that we’ve trained a foundation model using self-supervision. The trained model is often called the pretrained model. It can be further trained for downstream applications.

Due to how pre-training works today, a pretrained model typically has two issues. First, self-supervision optimizes the model for completion, not conversations15. If you find this ambiguous, don’t worry, I hope the example in the Supervised Finetuning section will make it clear. Second, if the model is pretrained on data indiscriminately scraped from the Internet, its outputs can be racist, sexist, rude, or just wrong. Post-training is to address both of these issues.

Today, post-training typically consists of two steps.

  1. Supervised finetuning (SFT): Finetune the pretrained model on high-quality data to optimize models for conversations, instead of completion.

  2. Alignment: Further finetune the model to output responses that align with human preference. There are many techniques for alignment, such as RLHF (Reinforcement Learning from Human Feedback), DPO (Direct Preference Optimization), RLAIF (Reinforcement Learning from AI Feedback).

As post-training consumes a small portion of resources compared to pre-training (2% for InstructGPT), you can think of SFT and alignment as unlocking the capabilities that the pretrained model already has but are hard for users to access via prompting alone.

Figure 2-12 shows the overall workflow of pre-training, SFT, and alignment, assuming you use RLHF for the last step. You can approximate how well a model aligns with human preference by what steps the model creators have taken.

The overall training workflow with pretraining  SFT  and RLHF
Figure 2-12. The overall training workflow with pretraining, SFT, and RLHF

If you squint, Figure 2-12 looks very similar to the meme Shoggoth with a smiley face (Figure 2-13):

  1. The pretrained model is an untamed monster because it was trained on indiscriminate data scraped from the Internet.

  2. This monster was then finetuned on higher-quality data -- think StackOverflow, Quora, or human annotations -- which makes it somewhat socially acceptable.

  3. Then the finetuned model was further polished using RLHF to make it customer-appropriate -- giving it a smiley face.

Shoggoth with Smiley Face. Courtesy of anthrupad.
Figure 2-13. Shoggoth with Smiley Face. Courtesy of anthrupad.

Note that a combination of pretraining, SFT, and alignment is the popular solution for building foundation models today, but it’s not the only solution. You can skip any of the steps, as you’ll see shortly.

Supervised Finetuning

As discussed in Chapter 1, the pretrained model is likely optimized for completion rather than conversing. If you input into the model, “How to make pizza”, the model will continue to complete this sentence, as the model has no concept that this is supposed to be a conversation. Any of the following three options can be a valid completion:

  • Adding more context to the question: “for a family of six?”

  • Adding follow-up questions: “What ingredients do I need? How much time would it take?”

  • Giving the instructions on how to make pizza.

If the goal is to respond to users appropriately, the correct option is 3.

We know that a model mimics its training data. To encourage a model to generate the appropriate responses, you can show examples of what appropriate responses should look like. Such examples follow the format (prompt, response) and are called demonstration data. Some people refer to this process as behavior cloning: you demonstrate how the model should behave, and the model clones this behavior.

Since different types of requests require different types of responses, your demonstration data should contain a wide range of tasks, such as question answering, summarization, and translation. Figure 2-14 shows a distribution of types of tasks OpenAI used to finetune their model InstructGPT. Note that this distribution doesn’t contain multimodal tasks, as InstructGPT is a text-only model.

The distribution of prompts used to finetune InstructGPT. Source  OpenAI.
Figure 2-14. The distribution of prompts used to finetune InstructGPT. Source: OpenAI.

Good teachers are important for humans to learn. Similarly, good labelers are important for AIs to learn how to conduct intelligent conversations. Unlike traditional data labeling, which can often be done with little or no expertise, demonstration data may contain complex prompts whose responses require critical thinking, information gathering, and judgment about the appropriateness of the user’s requests. Table 2-2 shows examples of (prompt, response) pairs created by labelers for InstructGPT.

Table 2-2. Examples of demonstration data used for InstructGPT
PromptLabler’s response
Serendipity means the occurrence and development of events by chance in a happy or beneficial way. Use the word in a sentence.Running into Margaret and being introduced to Tom was a fortunate stroke of serendipity.
Read the following text and answer the three questions at the end.
[OMITTED FOR LENGTH - Article about Justice Ruth Bader Ginsburg, Champion Of Gender Equality, Dies At 87]
1. …
2. …
3. Why will Ginsburg’s death have profound consequences for the court and the country?
1. …
2. …
3. Ruth Bader Ginsburg was the leader of the liberal wing, and because the court is about to open a new term, the chief justice no longer holds the controlling vote in contested cases.
ELI5: What’s the cause of the “anxiety lump” in our chest during stressful or disheartening experiences?The anxiety lump in your throat is caused by muscular tension keeping your glottis dilated to maximize airflow. The clenched chest or heartache feeling is caused by the vagus nerve which tells the organs to pump blood faster, stop digesting, and produce adrenaline and cortisol.

Companies like OpenAI and Anthropic, therefore, use highly educated labelers to generate demonstration data. Among those who labeled demonstration data for InstructGPT, ~90% have at least a college degree and more than one-third have a master’s degree. If labeling objects in an image might take only seconds, generating one (prompt, response) pair can take up to 30 minutes, especially for tasks that involve long contexts like summarization. If it costs $10 for one (prompt, response) pair, the 13,000 pairs that OpenAI used for InstructGPT would cost $130,000. That doesn’t yet include the cost of designing the data (what tasks and prompts to include), recruiting labelers, and data quality control.

Not everyone can afford to follow the high-quality human annotation approach. LAION, a non-profit German organization, mobilized 13,500 volunteers worldwide to generate 10,000 conversations, which consist of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings. Since the data was generated by volunteers, there wasn’t much control for biases. In theory, the labelers that teach models the human preference should be representative of the human population. The demographic of labelers for LAION is skewed. For example, in a self-reported survey, 90% of volunteer labelers identified as male (Köpf et al., 2023).

DeepMind used simple heuristics to filter for conversations from Internet data to train their model Gopher. To be specific, they looked for texts that look like the following format:

[A]: [Short paragraph]

[B]: [Short paragraph]

[A]: [Short paragraph]

[B]: [Short paragraph]

They claimed that their heuristics reliably yield high-quality dialogues.

Note

On finetuning for dialogues vs. finetuning for following instructions

OpenAI’s InstructGPT is finetuned for following instructions. Each example of demonstration data is a pair of (prompt, response). DeepMind’s Gopher is finetuned for conducting dialogues. Each example of demonstration consists of multiple turns of back-and-forth dialogues. Instructions are subsets of dialogues -- ChatGPT is a powered-up version of InstructGPT.

Technically, you can train a model from scratch on the demonstration data instead of finetuning a pretrained model, effectively eliminating the self-supervised pretraining step. However, the finetuning approach has often returned superior results. Approaches to supervised finetuning will be further discussed in Chapter 6.

Alignment

With great power come great responsibilities. A model that can assist users in achieving great things can also assist users in achieving terrible things. Demonstration data teaches the model to have a conversation but doesn’t teach the model what kind of conversations it should have. For example, if a user asks the model to write an essay about why one race is inferior or how to hijack a plane, should the model comply?

In both of the above examples, it’s straightforward to most people what a model should do. However, many scenarios aren’t as clear-cut. People from different cultural, political, socioeconomic, gender, and religious backgrounds disagree with each other all the time. How should AI respond to questions about abortion, gun control, the Israel - Palestine conflict, disciplining children, marijuana legality, universal basic income, or immigration? How do we define and detect potentially controversial issues? If your model responds to a controversial issue, whatever the responses, you’ll end up upsetting some of your users. If a model is censored too much, your model may become boring, driving away users.

Fear of AI models generating inappropriate responses can stop companies from releasing their applications to users. The goal of alignment is to get AI models to behave according to human preference. This is an ambitious, if not impossible, goal. Not only does this assume that universal human preference exists, but it also assumes that it’s possible to embed it into AI16.

Had the goal been simple, the solution could’ve been elegant. However, given the ambitious nature of the goal, the solution we have today is complicated. The earliest successful alignment algorithm, which is still popular today, is RLHF. RLHF consists of two parts:

  • Train a reward model that scores the foundation model’s outputs.

  • Optimizing the foundation model to generate responses for which the reward model will give maximal scores.

Reward model

Given a pair of (prompt, response), the reward model (RM) outputs a score for how good the response is. Training a model to score a given input is a common and straightforward ML task. The challenge, similarly to SFT, is to obtain reliable data. If we ask labelers to score each response directly, the scores will vary. For the same sample, on a 10-point scale, one labeler might give a 5 and another 7. Even the same labeler, given the same pair of (prompt, response) twice, might give different scores.

An easier task is to ask labelers to compare two responses and decide which one is better. For each prompt, multiple responses are generated by either humans or AI. The resulting labeled data follows the format (prompt, winning_response, losing_response). This is called comparison data. Table 2-3 shows an example of comparison data used by Anthropic for one of their models. Out of the two responses in this example, I prefer the response labeled as the losing one. This highlights the challenge of trying to capture diverse human preferences in a single mathematical formulation.

Table 2-3. : An example of comparison data from Anthropic’s HH-RLHF dataset.
promptwinning_responselosing_response
How can I get my dog high?
I’m not sure what you mean by that.I don’t know that we should get the dog high. I think it’s important for a dog to experience the world in a sober state of mind.

Still, this easier task of comparing two responses takes time. LMSYS found that manually comparing two responses took on average 3-5 minutes, as the process requires fact-checking each response17.

Figure 2-18 shows the UI that OpenAI’s labelers used to create comparison data for the RM of InstructGPT. Labelers give concrete scores from 1 to 7 as well as rank the responses in the order of their preference, but only the ranking is used to train the RM. Their inter-labeler agreement is around 73%, which means if they ask 10 people to rank the same two responses, approximately 7 of them will have the same ranking. To speed up the labeling process, each annotator can rank multiple responses at the same time. A set of 3 ranked responses (A > B > C) will produce 3 ranked pairs: (A > B), (A > C), and (B > C).

The interface labelers used to generate comparison data for OpenAI s InstructGPT.
Figure 2-15. The interface labelers used to generate comparison data for OpenAI’s InstructGPT.

Given only comparison data, how do we train the model to give concrete scores? Similar to how you can get humans to do basically anything with the right incentive, you can get a model to do so given the right objective function. A commonly used function represents the difference in output scores for the winning and losing response. The objective is to maximize this difference. For those interested in the mathematical details, here is the formula:

  • rθ: the reward model being trained, parameterized by θ. The goal of the training process is to find θ for which the loss is minimized.

  • Training data format.

x: prompt

yw: winning response

yl: losing response

  • sw=r(x,yw): reward model’s scalar score for the winning response

  • sl=r(x,yl): reward model’s scalar score for the losing response

  • σ: the Sigmoid function

For each training sample (x,yw,yl), the loss value is computed as follows:

log(σ(rθ(x,yw)-rθ(x,yl))

Goal: find θ to minimize the expected loss for all training samples.

-Exlog(σ(rθ(x,yw)-rθ(x,yl))

The RM can be trained from scratch, or finetuned on top of another model, such as the pretrained or SFT model. Finetuning on top of the strongest foundation model seems to give the best performance. Some people believe that the RM should be at least as powerful as the foundation model to be able to score the foundation model’s responses. However, as we’ll see in the next chapter on evaluation, a weak model can judge a stronger model, as judging is believed to be easier than generation.

Finetuning using the reward model

With the trained RM, we further train the SFT model to generate output responses that will maximize the scores by the RM. During this process, prompts are randomly selected from a distribution of prompts, such as existing user prompts. These prompts are input into the model, whose responses are scored by the RM. This training process is often done using reinforcement learning, hence the name RLHF. More specifically, it’s often done with Proximal Policy Optimization (PPO), an algorithm released by OpenAI in 2017.

Empirically, RLHF improves performance compared to SFT alone. However, as of this writing, there are debates on why RLHF works. It’s also unclear whether RLHF mitigates or worsens hallucinations, as discussed in the “Hallucination” section later in this chapter. Because it’s both complex and not theoretically sound, this technique might evolve or go out-of-date as the field matures. If you’re interested in learning more about RLHF, see Appendix A.

Some companies find it okay to skip reinforcement learning altogether. For example, Stitch Fix and Grab find that having the reward model alone is good enough for their applications. They get their models to generate multiple outputs and pick the ones given high scores by their reward models. Reward models, when used for this purpose, are also called verifiers.

Recall that SFT and RLHF are steps taken to address the problem created by the low quality of data used for pretraining. If one day we have better pretraining data or better ways to train foundation models, we might not need SFT and RLHF at all.

Sampling

A model constructs its outputs through a process known as sampling, which is sometimes called decoding. This section discusses different sampling strategies and sampling variables including temperature, top-k, and top-p. It’ll then explore how to sample multiple outputs to improve a model’s performance. We’ll also see how the sampling process can be modified to get models to generate responses that follow certain formats and constraints.

Sampling makes AI’s outputs probabilistic. Understanding this probabilistic nature is important for handling AI’s behaviors such as inconsistency and hallucination. This section ends with a deep dive into what this probabilistic nature means and how to work with it.

Sampling Fundamentals

Given an input, a neural network produces an output by first computing the probabilities of possible outcomes. For a classification model, possible outcomes are the available classes. For example, if a model is trained to classify whether an email is spam or not, there are only two possible outcomes: spam and not spam. The model computes the probability of each of these two outcomes. Let’s say the probability of the email being spam is 90%, and not spam is 10%.

To generate the next token, a language model first computes the probability distribution over all tokens in the vocabulary, which looks like Figure 2-16.

To generate the next token  the language model first computes the probability distribution over all tokens in the vocabulary.
Figure 2-16. To generate the next token, the language model first computes the probability distribution over all tokens in the vocabulary.

For the spam email classification task, it’s common to output the value with the highest probability. If the email has a 90% chance of being spam, you can mark the email as spam18. Always picking the most likely outcome or token is called greedy sampling. However, for a language model, greedy sampling creates boring outputs. Imagine a model that, for whatever question you ask, always responds with the most common words.

Instead of always picking the next most likely token, the model can sample the next token according to the probability distribution over all possible values. Given the context of “My favorite color is …” as shown in Figure 2-19, if “red” has a 30% chance of being the next token and “green” has a 50% chance, “red” will be picked 30% of the time, and “green” 50% of the time.

Temperature

One problem with sampling the next token according to the probability distribution is that the model can be less creative. In the previous example, common words for colors like “red”, “green”, “purple”, and so on have the highest probabilities. The language model’s answer ends up sounding like that of a five-year-old: “My favorite color is green.” Because “the” has a low probability, the model has a low chance of generating a creative sentence such as “My favorite color is the color of a still lake on a spring morning.”

Temperature is a technique used to redistribute the probabilities of the possible values. Intuitively, it reduces the probabilities of common tokens, and as a result, increases the probabilities of rarer tokens. This enables models to create more creative responses.

To understand how temperature works, let’s take a step back to see how a model computes the probabilities. Given an input, a neural network processes this input and outputs a logit vector. Each logit corresponds to one possible value. In the case of a language model, each logit corresponds to one token in the model’s vocabulary. The logit vector size is the size of the vocabulary. A visualization of the logits vector is shown in Figure 2-17.

For each input  a language model produces a logit vector. Each logit corresponds to a token in the vocabulary.
Figure 2-17. For each input, a language model produces a logit vector. Each logit corresponds to a token in the vocabulary.

While larger logits correspond to higher probabilities, the logits don’t represent the probabilities. Logits don’t sum up to one. Logits can even be negative, while probabilities have to be non-negative. To convert logits to probabilities, a softmax layer is often used. Let’s say the model has a vocabulary of N and the logit vector is . The probability for the token, , is computed as follows:

Temperature is a constant used to adjust the logits before the softmax transformation. Logits are divided by temperature. For a given temperature T, the adjusted logit for the token is . Softmax is then applied on this adjusted logit instead of on .

Let’s say the model has a vocabulary of N and the logit vector is [x1,x2,...,xN]. The probability for the ith token, pi, is computed as follows:

pi=softmax(xi)=exijexj

Temperature is a constant used to adjust the logits before the softmax transformation. Logits are divided by temperature. For a given temperature T, the adjusted logit for the ith token is xiT. Softmax is then applied on this adjusted logit instead of on xi.

Let’s walk through a simple example to examine the effect of temperature on probabilities. Imagine that we have a model that has only two possible outputs: A and B. The logits computed from the last layer are [1, 3]. The logit for A is 1 and B is 3.

Without using temperature, which is equivalent to using the temperature of 1, the softmax probabilities are [0.12, 0.88]. The model picks B 88% of the time.

With temperature = 0.5, the probabilities are [0.02, 0.98]. The model now picks B 98% of the time.

The higher the temperature, the less likely the model is going to pick the most obvious value (the value with the highest logit), making the model’s outputs more creative but potentially less coherent. The lower the temperature, the more likely the model is going to pick the most obvious value, making the model’s output more consistent but potentially more boring.

Figure 2-18 shows the softmax probability for token B at different temperatures. As the temperature gets closer to 0, the probability that the model picks token B becomes closer to 1. In our example, for temperature below 0.1, the model almost always outputs B. Model providers typically limit temperature to be between 0 and 2. If you own your model, you can use any non-negative temperature. A temperature of 0.7 is often recommended for creative use cases, as it balances creativity and determinism, but you should experiment and find the temperature that works best for you.

The softmax probability for token B at different temperatures in our example. Without setting the temperature value  which is equivalent to using the temperature of 1  the softmax probability of B would be 88 .
Figure 2-18. The softmax probability for token B at different temperatures in our example. Without setting the temperature value, which is equivalent to using the temperature of 1, the softmax probability of B would be 88%.

It’s common practice to set the temperature to 0 for the model’s outputs to be more consistent. Technically, temperature can never be 0 -- logits can’t be divided by 0. In practice, when we set the temperature to 0, the model just picks the token with the value with the largest logit19, without doing the logit adjustment and softmax calculation.

A common debugging technique when working with an AI model is to look at the probabilities this model computes for given inputs. For example, if the probabilities look random, the model hasn’t learned much. OpenAI returns probabilities generated by their models as logprobs. Logprobs, short for log probabilities, are probabilities in the log scale. Log scale is preferred when working with a neural network’s probabilities because it helps reduce the underflow problem20. A language model might be working with a vocabulary size of 100,000, which means the probabilities for many of the tokens can be too small to be represented by a machine. The small numbers might be rounded down to 0. Log scale helps reduce this problem.

Figure 2-19 shows the workflow of how logits, probabilities, and logprobs are computed.

How logits  probabilities  and logprobs are computed.
Figure 2-19. How logits, probabilities, and logprobs are computed.

Top-k

Top-k is a sampling strategy to reduce the computation workload without sacrificing too much of the model’s response diversity. Recall that a softmax layer is used to compute the probability distribution over all possible values. Softmax requires two passes over all possible values: one to perform the exponential sum , and one to perform for each value. For a language model with a large vocabulary, this process is computationally expensive.

one to perform the exponential sum jexj, and one to perform exijexj for each value

To avoid this problem, after the model has computed the logits, we pick the top-k logits and perform softmax over these top-k logits only. Depending on how diverse you want your application to be, k can be anywhere from 50 to 500—much smaller than a model’s vocabulary size. The model then samples from these top values. A smaller k value makes the text more predictable but less interesting, as the model is limited to a smaller set of likely words.

Top-p

In top-k sampling, the number of values considered is fixed to k. However, this number should change depending on the situation. For example, given the prompt “Do you like music? Answer with only yes or no.”, the number of values considered should be two: yes and no. Given the prompt “What’s the meaning of life?”, the number of values considered should be much larger.

Top-p, also known as nucleus sampling, allows for a more dynamic selection of values to be sampled from. In top-p sampling, the model sums the probabilities of the most likely next values in descending order and stops when the sum reaches p. Only the values within this cumulative probability are considered. Common values for top-p (nucleus) sampling in language models typically range from 0.9 to 0.95. A top-p value of 0.9, for example, means that the model will consider the smallest set of values whose cumulative probability exceeds 90%.

Let’s say the probabilities of all tokens are as shown in Figure 2-20. If top-p is 90%, only “yes” and “maybe” will be considered, as their cumulative probability is greater than 90%. If top-p is 99%, then “yes”, “maybe”, and “no” are considered.

If top p   90   only  yes  and  maybe  will be considered  as their cumulative probability is greater than 90 . If top p   99   then  yes    maybe   and  no  are considered.
Figure 2-20. If top-p = 90%, only “yes” and “maybe” will be considered, as their cumulative probability is greater than 90%. If top-p = 99%, then “yes”, “maybe”, and “no” are considered.

Unlike top-k, top-p doesn’t necessarily reduce the softmax computation load. Its benefit is that because it focuses on only the set of most relevant values for each context, it allows outputs to be more contextually appropriate. In theory, there doesn’t seem to be a lot of benefits to top-p sampling. However, in practice, top-p sampling has proven to work well, causing its popularity to rise.

Stopping condition

An autoregressive language model generates sequences of tokens by generating one token after another. A long output sequence takes more time, costs more compute (money)21, and can sometimes be annoying to users. We might want to set a condition for the model to stop the sequence.

One easy method is to ask models to stop generating after a fixed number of tokens. The downside is that the output is likely to be cut off mid-sentence. Another method is to use stop tokens. For example, you can ask models to stop generating when it encounters <EOS>. Stopping conditions are helpful to keep the latency and cost down.

Test Time Sampling

One simple way to improve a model’s performance is to generate multiple outputs, then select the best output or the most common output. This approach is called test time sampling or test time compute22. If you want your model’s responses to be consistent, you want to keep all sampling variables fixed. However, if you want to generate multiple outputs and pick the best one, you don’t want to vary your sampling variables to generate a diverse set of outputs.

To pick the best output, you can either show users multiple outputs and let them choose the one that works best for them, or devise a method to select the best one. One selection method is to pick the output with the highest probability. A language model’s output is a sequence of tokens, and each token has a probability computed by the model. The probability of an output is the product of the probabilities of all tokens in the output.

Consider the sequence of tokens [“I”, “love”, “food”]. If the probability for “I” is 0.2, the probability for “love” given “I” is 0.1, and the probability for “food” given “I” and “love” is 0.3, the sequence’s probability is: 0.2 x 0.1 x 0.3 = 0.006. Mathematically, this can be denoted as follows:

p(I love food) = p(I) x p(I | love) x p(food | I, love)

Remember that it’s easier to work with probabilities on a log scale. The logarithm of a product is equal to a sum of logarithms, so the logprob of a sequence of tokens is the sum of the logprob of all tokens in the sequence:

logprob(I love food) = logprob(I) + logprob(I | love) + logprob(food | I, love)

With summing, longer sequences are likely to have a lower total logprob (log of 1 is 0, and log of all values between 0 and 1 is negative). To avoid biasing towards short sequences, we use the average logprob by dividing the sum by its sequence length. After sampling multiple outputs, we pick the one with the highest average logprob. As of this writing, this is what OpenAI API uses. You can set the parameter best_of to a specific value, say 10, to ask OpenAI models to return the output with the highest average logprob out of 10 different outputs.

Another method is to use a reward model to score each output, as discussed in the previous section. Recall that both Stitch Fix and Grab pick the outputs given high scores by their reward models or verifiers. OpenAI also trained verifiers to help their models pick the best solutions to math problems (Cobbe et al., 2021). They found that using a verifier significantly boosted their model performance. In fact, the use of verifiers resulted in approximately the same performance boost as a 30x model size increase. This means that a 100-million-parameter model that uses a verifier can perform on par with a 3-billion-parameter model that doesn’t use a verifier.

In the same experiment, OpenAI showed that sampling more outputs led to better performance, but only up to a certain point. In their experiment, that point is 400 outputs. Beyond this point, performance starts to decrease, as shown in Figure 2-22. They hypothesized that as the number of sampled outputs increases, the chance of finding adversarial outputs that can fool the verifiers also increases. While this is an interesting experiment, I don’t believe anyone in production samples 400 different outputs for each input. The cost would be astronomical.

OpenAI  2021  found that sampling more outputs led to better performance  but only up to 400 outputs.
Figure 2-21. OpenAI (2021) found that sampling more outputs led to better performance, but only up to 400 outputs.

You can also choose heuristics based on the needs of your application. For example, if your application benefits from shorter responses, you can pick the shortest one. If your application is meant to convert from natural language to SQL queries, you can get the model to keep on generating outputs until it outputs a valid SQL query.

One particularly interesting application of test time sampling is to overcome the latency challenge. For some queries, especially chain-of-thought queries, a model might take a long time to complete the response. Kittipat Kampa, head of AI at TIFIN, told me that his team asks their model to generate multiple responses in parallel and show the user the first response that is completed.

The approach of picking out the most common output among a set of outputs is also called self-consistency (Wang et al., 2023). This can be especially useful for tasks that expect exact answers. For example, given a math problem, the model can solve it multiple times and pick the most frequent answer as its final solution. Similarly, for a multiple-choice question, a model can pick the most frequent output option. This is what Google did when evaluating their model Gemini on MMLU, a benchmark of multiple-choice questions. They sampled 32 outputs for each question. While this helped Gemini achieve a high score on this benchmark, it’s unclear whether their model is better than another model that gets a lower score by only generating one output for each question.

A model is considered robust if its outputs remain more or less the same with small variations in the input. The less robust a model is, the more you can benefit from sampling multiple outputs23. For one project, we used AI to extract certain information from an image of the product. We found that for the same image, our model could read the information only half of the time. For the other half, the model said that the image was too blurry or the text was too small to read. For each image, we ended up having to query the model at most three times, until it could extract the information.

Although you can usually expect some model performance improvement by sampling multiple outputs, it’s expensive. On average, generating two outputs costs approximately twice as much as generating one24.

Structured Outputs

Often, in production, you need models to generate text following certain formats. Having structured outputs is essential for the following two scenarios.

Tasks whose outputs need to follow certain grammar: For example, for text-to-SQL or text-to-regex, outputs have to be valid SQL queries and regexes. For classification, outputs have to be valid classes.

Tasks whose outputs are then parsed by downstream applications: For example, if you use an AI model to write product descriptions, you want to extract only the product descriptions without buffer texts like “Sure, I’d be happy to help”, “Here’s the description”, or “As a language model, I can’t …”. Ideally, for this scenario, models should generate structured outputs, such as JSON with specific keys, that can be parseable.

OpenAI was the first model provider to introduce JSON mode in their text generation API. Note that their JSON mode guarantees only that the outputs are valid JSON—not what’s inside the JSON25.

The generated JSONs can also be truncated due to the model’s stopping condition, such as when it reaches the maximum output token length. If the max token length is set too short, the output JSONs can be truncated and hence not parseable. If it’s set too long, the model’s responses become both too slow and expensive.

Independent tools like guidance and outlines let you structure the outputs of certain models. llama.cpp has support for constraint sampling for the LLaMA model family. Figure 2-22 shows two examples of using guidance to generate outputs constrained to a set of options and a regex.

Using guidance to generate constrained outputs.
Figure 2-22. Using guidance to generate constrained outputs.

How to generate structured outputs

You can guide a model to generate constrained outputs at different layers of the AI stack: during prompting, sampling, and finetuning. Prompting is currently the easiest but least effective method. You can instruct a model to output valid JSON following a specific schema. However, there’s no guarantee that the model will always follow this instruction. Chapter 4 discusses prompting in detail.

Finetuning is currently the go-to approach to get models to generate outputs in the style and format that you want. You can do finetuning with or without changing the model’s architecture. For example, you can finetune a model on examples with the output format you want. While this still doesn’t guarantee the model will always output the expected format, it is much more reliable than prompting. It also has the added benefit of reducing inference costs, assuming that you no longer have to include instructions and examples of the desirable format in your prompt.

For certain tasks, you can guarantee the output format with finetuning by modifying the model’s architecture. For example, for classification, you can append a classifier head to the foundation model’s architecture to make sure that the model only outputs one of the pre-specified classes. The architecture looks like Figure 2-2326. During finetuning, you can retrain the entire architecture or only this classifier head. Chapter 7 goes over finetuning in detail.

Adding a classifier head to your base model to turn it into a classifier. In this example  the classifier works with only 3 classes.
Figure 2-23. Adding a classifier head to your base model to turn it into a classifier. In this example, the classifier works with only 3 classes.

Both sampling and finetuning techniques are needed because of the assumption that the model, by itself, isn’t capable of doing it. As models become more powerful, we can expect them to get better at following instructions. I suspect that in the future, it’ll be easier to get models to output exactly what we need with minimal prompting, and these techniques will become less important.

Constraint sampling

Constraint sampling is a technique used to guide the generation of text towards certain constraints. The simplest way to do so, though expensive, is to keep on generating outputs until you find one that fits your constraints, as discussed in the section “Test Time Sampling.”

Constraint sampling can also be done during token sampling. I wasn’t able to find a lot of literature on how companies today are doing it. At a high level, to generate a token, the model samples among values that meet the constraints. Recall that to generate a token, your model first outputs a logit vector, and each logit corresponds to one possible value. Constrained sampling filters this logit vector to keep only the values that meet the constraints. Then we sample from these valid values. This process is shown in Figure 2-24.

Filter out logits that don t meet the constraints in order to sample only among valid outputs.
Figure 2-24. Filter out logits that don’t meet the constraints in order to sample only among valid outputs.

In the example in Figure 2-24, the constraint is straightforward to filter for. However, for most cases, it’s not that straightforward. We need to have a grammar that specifies what is and isn’t allowed at each step. For example, JSON grammar dictates that after {, we can’t have another { unless it’s part of a string, as in {“key”: “{{string}}”}.

Building out that grammar and incorporating that grammar into the sampling process is non-trivial. We’d need a separate grammar for every output format we want: JSON, regex, CSV, and so on. Some are against constrained sampling because they believe the resources needed for constrained sampling are better invested in training models to become better at following instructions.

The Probabilistic Nature of AI

The way AI models sample their responses makes them probabilistic. Let’s go over an example to see what being probabilistic means. Imagine that you want to know what’s the best cuisine in the world. If you ask your friend this question twice, a minute apart, your friend’s answers both times should be the same. If you ask an AI model the same question twice, its answer can change. If an AI model thinks that Vietnamese cuisine has a 70% chance of being the best cuisine in the world and Italian cuisine has a 30% chance, it’ll answer “Vietnamese cuisine” 70% of the time, and “Italian cuisine” 30%. The opposite of probabilistic is deterministic, when the outcome can be determined without any random variation.

This probabilistic nature can cause inconsistency and hallucinations. Inconsistency is when a model generates very different responses for the same or slightly different prompts. Hallucination is when a model gives a response that isn’t grounded in facts. Imagine if someone on the Internet wrote an essay about how all US presidents are aliens, and this essay was included in the training data. The model later will probabilistically output that the current US president is an alien. From the perspective of someone who doesn’t believe that US presidents are aliens, the model is making this up.

Foundation models are usually trained using a large amount of data. They are aggregations of the opinions of the masses, containing within them, literally, a world of possibilities. Anything with a non-zero probability, no matter how far-fetched or wrong, can be generated by AI27.

This characteristic makes building AI applications both exciting and challenging. Many of the AI engineering efforts, as we’ll see in this book, aim to harness and mitigate this probabilistic nature.

This probabilistic nature makes AI great for creative tasks. What is creativity but the ability to explore beyond the common paths—to think outside the box? AI is a great sidekick for creative professionals. It can brainstorm limitless ideas and generate never-before-seen designs. However, this same probabilistic nature can be a pain for everything else28.

Inconsistency

Recall that inconsistency is when a model generates very different responses for the same prompt and slightly different prompts. This inconsistency can create a jarring user experience. In human-to-human communication, we expect a certain level of consistency. Imagine an insurance company giving you a different quote every time you check on their website. Figure 2-25 shows an example of me trying to use ChatGPT to score essays. The same prompt gave me two different scores when I ran it twice: 7/10 and 4/10.

Same input  different outputs.
Figure 2-25. Same input, different outputs.

There are two scenarios in which this inconsistency manifests:

  • Same input, different outputs: Giving the model the same prompt twice leads to two very different responses.

  • Slightly different input, drastically different outputs: Giving the model a slightly different prompt, such as accidentally capitalizing a letter, can lead to a very different output.

In the first scenario, you can mitigate the inconsistency by fixing the output generation variables of the model. You can fix temperature, top-p, and top-k values as discussed earlier. You can also fix the seed variable, which you can think of as the starting point for the random number generator used for sampling the next token.

Even if you fix all these variables, however, there’s no guarantee that your model will be consistent 100% of the time. The hardware the model runs the output generation on can also impact the output, as different machines have different ways of executing the same instruction and can handle different ranges of numbers. If you host your models, you have some control over the hardware you use. However, if you use a model hosted by a provider like OpenAI or Google, it’s up to these providers to give you any control. Variables like OpenAI’s system_fingerprint can let you know whether the system they use to run a model has changed.

Fixing the output generation settings is a good practice, but it doesn’t inspire trust in the system. Imagine a teacher who gives you consistent scores only if that teacher sits in one particular room. If that teacher sits in any different room, that teacher’s scores for you will be wild.

The second scenario is more challenging. Fixing the model’s output generation variables is still a good practice, but it won’t force the model to generate the same outputs for different inputs. It is, however, possible to get models to generate responses closer to what you want with carefully crafted prompts. Chapter 4 discusses prompt engineering.

Hallucination

Hallucinations are fatal for tasks that depend on factuality. If you’re asking AI to help you explain the pros and cons of a vaccine, you don’t want AI to be pseudo-scientific. In June 2023, a law firm was fined for submitting fictitious legal research to court. They had used ChatGPT to prepare their case, unaware of ChatGPT’s tendency to hallucinate.

If inconsistency arises from randomness in the sampling process, the cause of hallucination is more nuanced. The sampling process alone doesn’t sufficiently explain it. A model samples outputs from all probable options. But how does something never seen before become a probable option? A model can output something that is believed to have never been seen before in the training data. We can’t say this for sure because it’s impossible to comb through the training data to verify whether an idea has been mentioned. Our ability to construct something so complex that we can no longer understand it is both a blessing and a curse.

It’s hard to devise a way to eliminate hallucinations without understanding why hallucinations occur in the first place. There are currently two hypotheses about why language models hallucinate.

The first hypothesis, first expressed by Pedro A. Ortega et al. at DeepMind in 2021, is language models hallucinate because they can’t differentiate between external data, including training data or the context provided by users, and the model’s generated data. A language model generates the next token conditioned on the existing sequence, which consists of the user-provided prompt and the model’s previously generated tokens.

Let’s say that you give the model the prompt: “Who’s Chip Huyen?” and the first sentence the model generates is: “Chip Huyen is an architect.” The next token the model generates will be conditioned on the sequence: “Who’s Chip Huyen? Chip Huyen is an architect.” In the way models today are being trained, the model can’t differentiate between the user-provided prompt and what it has previously generated. The model treats “Chip Huyen is an architect.”, something it produced, as an external fact. Starting with a generated sequence slightly out of the ordinary, the model can expand upon it and generate outrageously wrong facts. Ortega and the other authors called hallucinations a form of self-delusion.

Figure 2-26 shows an example of self-delusion by the model LlaVA-v1.5-7B. The model is asked to identify ingredients listed on the product’s label in the image, which is a bottle of shampoo. In its response, the model convinces itself that the product in the image is a bottle of milk, then continues to include milk in the list of ingredients extracted from the product’s label.

An example of self delusion by LLaVA v1.5 7B.
Figure 2-26. An example of self-delusion by LLaVA-v1.5-7B.

They theorized, and showed, that hallucinations can be mitigated by two techniques. The first technique comes from reinforcement learning, in which the model is made to differentiate between user-provided prompts (called observations about the world in reinforcement learning) and tokens generated by the model (called the model’s actions). The second technique leans on supervised learning, in which factual and counterfactual signals are included in the training data.

The second hypothesis is that hallucination is caused by the mismatch between the model’s internal knowledge and the labeler’s internal knowledge. This view was first argued by Leo Gao, an OpenAI employee. During SFT, models are trained to mimic responses written by labelers. If these responses use the knowledge that the labelers have but the model doesn’t have, we’re effectively teaching the model to hallucinate. In theory, if labelers can include the knowledge they use with each response they write so that the model knows that the responses aren’t made up, we can perhaps teach the model to use only what it knows. However, this is impossible in practice.

In April 2023, John Schulman, an OpenAI co-founder, expressed the same view in his UC Berkeley talk. Schulman also believes that LLMs know if they know something, which, in itself, is a big claim. If this belief is true, hallucinations can be fixed by forcing a model to give answers based on only the information it knows. He proposed two solutions. One is verification: for each response, ask the model to retrieve the sources it bases this response on. Another is to use reinforcement learning. Remember that the reward model is trained using only comparisons -- response A is better than response B -- without an explanation of why A is better. Schulman argued that a better reward function that punishes a model more for making things up can help mitigate hallucinations.

In that same talk, Schulman mentioned that OpenAI found that RLHF helps with reducing hallucinations. However, the InstructGPT paper shows that RLHF made hallucination worse, as shown in Figure 2-27. Even though RLHF seemed to worsen hallucinations for InstructGPT, it improved other aspects, and overall, human labelers prefer the RLHF model over the SFT alone model.

Hallucination is worse for the model that uses both RLHF and SFT  InstructGPT  compared to the same model that uses only SFT.  Ouyang et al.  2022
Figure 2-27. Hallucination is worse for the model that uses both RLHF and SFT (InstructGPT) compared to the same model that uses only SFT. (Ouyang et al., 2022)

Based on the assumption that a foundation model knows what it knows, some people try to reduce hallucination with prompts, such as adding “Answer as truthfully as possible, and if you’re unsure of the answer, say ‘Sorry, I don’t know’”. Asking models for concise responses also seems to help with hallucinations -- the fewer tokens a model has to generate, the less chance it has to make things up.

The two hypotheses discussed above complement each other. The self-delusion hypothesis focuses on how self-supervision causes hallucinations, whereas the mismatched internal knowledge hypothesis focuses on how supervision causes hallucinations.

If we can’t stop hallucinations altogether, can we at least detect when a model hallucinates so that we won’t serve those hallucinated responses to users? Well, detecting hallucinations isn’t that straightforward either -- think about how hard it is for us to detect when another human is lying or making things up. We’ll go more into this in the next chapter on evaluation.

Summary

This chapter discussed fundamental considerations when building a foundation model. Since most people will be using ready-made foundation models instead of training one from scratch, I skipped the nitty-gritty details of training in favor of factors that help you determine what models to use and how to use them.

The first thing we want to know about a model is what data it was trained on. Large models require a large amount of training data, which can be expensive and time-consuming to produce. Model providers, therefore, often leverage whatever data that is available in a large quantity. This leads to models that are okay for a wide range of tasks, but not stellar for a specific task you might want. We went over models developed for specific languages, especially low-resource languages, and models developed for specific tasks, such as those in medical or legal domains.

After sourcing the data, the model training can start. Training a foundation model typically consists of two phases: pre-training and post-training.

In the pre-training section, we looked into model configurations such as model architecture and model size. The scale of a model can be measured by the number of parameters, the number of tokens in the training data, and the number of FLOPs for training compute. Two aspects that influence the amount of compute needed to train a model are the model size and the data size. The scaling law helps determine the optimal number of parameters and number of tokens given a compute budget. We also looked at the scaling bottlenecks. Up until now, scaling up a model generally makes it better. But how long will this continue to be true?

Due to the low quality of training data and self-supervision during pre-training, the resulting model might produce outputs that don’t align with what users want. This is addressed by post-training, which consists of two steps: supervised finetuning and alignment to human preference. Human preference is diverse and impossible to capture in a single mathematical formula, so existing solutions are far from foolproof.

In this chapter, I also got to write about one of my favorite topics: sampling, the process in which a model generates output tokens. Sampling makes AI models probabilistic. This probabilistic nature is what makes models like ChatGPT and Gemini great for creative tasks and fun to talk to. However, this probabilistic nature also causes inconsistency and hallucinations.

Understanding how a model is trained and what data it’s trained on is important to understand what tasks it’s good for. However, as model providers become increasingly secretive to protect themselves both from public scrutiny and competitors, it’s becoming harder for application developers to navigate the model landscape.

Experienced AI engineers have told me that they’ve just accepted this probabilistic nature of AI and built their workflows around that. It’s a different mindset compared to developing deterministic programs, but not impossible to get used to. In the rest of this book, we’ll see how to make AI engineering, if not deterministic, then at least systematic.

1 If you find the terms “pre-training” and “post-training” lacking in imaginationn, you’re not alone. The AI research community is great at many things, but naming isn’t one of them. Chapter 1 already talked about how “large language models” is hardly a scientific term because of the ambiguity of the word “large”. And I really wish people would stop publishing papers whose titles are “X is all you need.”

2 There are situations where misaligned models might be better. For example, if I want to evaluate the risk of people using AI to spread misinformation, you might want to try to build a model as good at making up fake news as possible, to see how convincing AI can be.

3 Chapter 6 will discuss multi-task finetuning in more detail.

4 and offered incredible compensation packages.

5 Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al., 2015)

6 RWKV: Reinventing RNNs for the Transformer Era (Peng et al., 2023)

7 Fun fact: Ilya Sutskever, an OpenAI co-founder, is the first author on the seq2seq paper and the second author on the AlexNet paper.

8 Side note: Transformer was originally designed by Google to run fast on TPUs, and only later optimized on GPUs.

9 The actual memory needed is higher due to software inefficiency and overhead typically associated with running ML models.

10 Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

11 Assuming a book contains around 50,000 words or 67,000 tokens.

12 As of this writing, large models are typically pre-trained on only one epoch of data.

13 Cross entropy loss will be covered in Chapter 3.

14 Nats is a unit to measure cross entropy loss.

15 A friend used this analogy: a pretrained model talks like a webpage, not a human.

16 There are situations where misaligned models might be better. For example, if I want to evaluate the risk of people using AI to spread

17 Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference (Chiang et al., 2024)

18 The model outputs the probability of an email being spam. You set the threshold for when an email is marked as spam. For example, if you set the threshold at 50%, an email that is given at least a 50% chance of being spam by your classification model will be marked as spam.

19 Performing an arg max function

20 The underflow problem occurs when a number is too small to be represented in a given format, leading to it being rounded down to zero.

21 Paid model APIs often charge per number of output tokens.

22 Test time compute is the more common term today. I find this term confusing as it can be interpreted as the amount of compute needed to run tests.

23 The optimal thing to do with a brittle model, however, is to swap it out for another.

24 There are things you can do to reduce the cost of generating multiple outputs for the same input. For example, the input might only be processed once and reused for all outputs.

25 As of this writing, OpenAI’s JSON mode doesn’t yet work for vision models, but I’m sure it’ll just be a matter of time.

26 Some finetuning services do this for you automatically. OpenAI’s finetuning services used to let you add a classifier head when training, but as I write, this feature has been disabled.

27 And as the Internet meme says about everything, the chances are low, but never zero.

28 In December 2023, I went over 3 months’ worth of customer support requests for an AI company I advise and found that one-fifth of the questions were about handling the inconsistency of AI models. In a panel I participated in with Drew Houston (CEO of Dropbox) and Harrison Chase (CEO of Langchain) in July 2023, we all agreed that hallucination is the biggest blocker for many AI enterprise use cases.