使用 Hugging Face Transformers 和 OpenAI API 进行 GPT-2 和 GPT-3 模型的训练和微调

在本文发布时，生成式大语言模型（large language models）被广泛讨论。在发布OpenAI ChatGPT和后来的GPT-4之后，GPT 成为了“当天的热词”。在本博客文章中，我们将回答以下问题：LLMs

GPT 模型是什么，它们是如何工作的？
我可以在本地计算机上运行 GPT 吗？
我可以自己训练或微调 GPT 模型吗？

本文通常适合初学者，但需要基本的 Python 和深度学习知识。我们将向您介绍 Hugging Face transformers 和 OpenAI GPT-3（Python 和 CLI）API。.

像 GPT 这样的语言模型属于计算机科学的一个分支，称为自然语言处理（NLP）。顺便说一下，如果你不知道，“GPT”这个缩写代表“生成式预训练”（在原始的GPT-1 论文中实际上从未使用过这个缩写），或者现在有时被称为“生成式预训练变换器”。

我能在本地运行 GPT 吗？Hugging Face Transformers和GPT-2

让我们从第二个问题开始。简短的答案是：您可以轻松地在本地计算机、云端或谷歌 Colab 上运行GPT-2（以及许多其他语言模型）。您无法在计算机上运行GPT-3，ChatGPT，或GPT-4。这些模型不是开放的，只能通过 OpenAI 付费订阅、通过OpenAI API或通过网站获得。显然，使用此 API 的任何软件都必须向 OpenAI 支付费用并依赖稳定的互联网连接。

运行 GPT-2 最简单的方法是使用 Hugging Face transformers，这是一种现代的 Python 深度学习框架，来自法国公司 Hugging Face。它主要基于 PyTorch，但也支持 TensorFlow 和 FLAX（JAX）模型。在开始之前，您需要一个功能齐备的 Python 3.x 环境（可以是原始的 Python 或 Anaconda，我们使用前者）。请按照此处的安装说明进行安装。对于使用 PIP 的原始 Python，请在终端中输入：

pip3 install torch 'transformers[torch]'

对于 Anaconda，请输入：

conda install -c huggingface transformers

现在我们准备开始编码。完整的 Hugging Face 部分的代码可以在这里找到。我们需要了解关于transformers框架的什么？首先，它不是从头开始实现神经网络，而是依赖于较低级别的框架 PyTorch、TensorFlow 和 FLAX。其次，它大量使用Hugging Face Hub，另一个 Hugging Face 项目，一个可下载各种框架的神经网络的中心。它的使用非常简单：在这里，找到您喜欢的模型，并查看其页面（称为模型卡片），请参见屏幕截图。

我们在这里看到了什么信息？

模型名称 distilbert-base-uncased-finetuned-sst-2-english 在顶部。
框架（红框是我们的）：变形金刚
任务（蓝色框是我们的）：文本分类
支持的底层框架：PyTorch + TensorFlow
文件和版本选项卡，其中包含模型文件大小等信息。
其他事项：在线演示，使用代码示例等。

现在让我们在 Python 中尝试 GPT-2。这很简单。对于我们的第一个示例，我们使用pipelines，这是transformers框架的最高级实体。我们的完整代码在fun_gpt2_1.py中，位于我们的repo中。首先，我们创建管道对象：

MODEL_NAME = 'gpt2' pipe = transformers.pipeline(task='text-generation', model=MODEL_NAME, device='cpu')

在第一次运行时，它会从 Hugging Face Hub 下载模型 gpt2，并将其本地缓存到缓存目录（Linux 上为~/.cache/huggingface）。在随后的运行中，将加载缓存的模型，无需互联网连接。现在，我们从提示生成文本：

打印(pipe('精灵女王'))

我们的输出是：

然而，如果您运行代码，结果将会不同。为什么？因为 GPT 文本生成是随机的，每次运行代码，结果都会不同。这种随机性由参数称为温度来调节。请注意，结果实际上是一个 Pythondict，其中只有一个键generated_text。事实上，我们会一遍又一遍地看到 dicts 和类似 dict 的对象，transformers框架经常使用它们。

请注意，在transformers中，“pipeline”与“model”非常不同。“Model”是我们从 Hub 下载的东西，gpt2在我们的情况下，实际上是一个有效的PyTorch模型，具有一些由transformers框架引入的附加限制和命名约定。“Pipeline”是在幕后运行模型以执行某个高级任务的对象，例如文本生成。对应关系不是一对一的，您可以使用各种模型进行文本生成：gpt2，gtp2-medium，gpt2-large，微调的 GPT-2 版本和自定义用户模型。但是在这个流程中，您不能使用没有生成能力的模型，比如 Bert。

虽然管道是 Hugging Face 新手通常开始使用的，但对我们来说，它们并不是非常有趣。管道在幕后执行许多步骤，这些步骤很难理解，甚至更难复制。它们很难定制，对于模型训练或微调、自定义模型、执行自定义任务或者一般情况下，Hugging Face 开发人员没有预先计划的所有事情都是完全无用的。只有在没有管道的情况下才能真正了解transformers框架。

让我们尝试在没有管道的情况下复制文本生成示例。首先，我们通过从 Hugging Face Hub 下载它们的权重来创建模型和分词器对象。

model = transformers.GPT2LMHeadModel.from_pretrained(MODEL_NAME) tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_NAME)

请注意，仅从 Hub 下载训练模型权重，而 PyTorch 模型本身是在transformers框架代码中定义的 Python 类。GPT2LMHeadModel是带有生成头的 GPT-2 模型。

但是什么是分词器呢？神经网络无法处理原始文本；它们只能理解数字。我们需要一个分词器将文本字符串转换为数字列表。但首先，它将字符串分解为单个标记，这通常意味着“单词”，尽管一些模型可以使用单词部分甚至单个字符。分词是一项经典的自然语言处理任务。一旦文本被分解为标记，每个标记都会被称为来自固定字典的整数编码替换。请注意，分词器，特别是其字典，是模型相关的：您不能使用 Bert 分词器与 GPT-2，除非您从头开始训练模型。一些模型，特别是 Bert 系列的模型，喜欢使用特殊标记，例如[PAD]，[CLS]，[SEP]等等。相比之下，GPT-2 则非常节约地使用它们。

接下来，我们对提示进行分词。

enc = tokenizer(['The elf queen'], return_tensors='pt') print('enc =', enc) print(tokenizer.batch_decode(enc['input_ids']))

The output is:

The result is a dict-like object with two keys: input_ids (tokens), and attention_mask (an array of ones in all our experiments). The return_tensors=’pt’ option means returning PyTorch tensors; lists are returned otherwise. The batch_decode() method decodes tokens back to the string “The elf queen”. Finally, we generate the text using the generate() method of our model, then decode the new tokens.

out = model.generate(input_ids=enc['input_ids'], attention_mask=enc['attention_mask'], max_length=20) print('out=', out) print(tokenizer.batch_decode(out))

We’ll get the result:

This is perfect! Or is it? If we run this code several times, we’ll see that something is wrong. The result is always the same! Why is that? Because the pipeline tweaks the model config while we use the default one. Let’s look at the config.

config = transformers.GPT2Config.from_pretrained(MODEL_NAME)

It’s not a dict, but a Python class with numerous fields:

There are tons of parameters here, most of which we probably do not want to modify, such as model size. The dict task_specific_params contains parameter adjustments for pipeline tasks, in this case text-generation. To activate these parameters, we copy them by hand to the object proper:

config.do_sample = \ config.task_specific_params['text-generation']['do_sample'] config.max_length = \ config.task_specific_params['text-generation']['max_length']

Now we create the model from pretrained weights, but with a modified config

model = transformers.GPT2LMHeadModel.from_pretrained(MODEL_NAME, config=config)

Voila, now the model generates random results (with the default temperature of 1.0 in the config). But how exactly does the generation work? We’ll explain it in the next chapter.

How Does GPT Work? Transformer Encoders, Decoders, Auto-Regressive Models

GPT-2 is a transformer decoder model (here, the word “transformer” stands for network architecture and not the Hugging Face transformers framework). “Transformers and attention” is a very interesting topic, which I don’t have time to go into detail in this article. If you look for introductory-level articles on transformers, we will often find a picture like this:

Image source

This “classical transformer” architecture has two blocks: encoder on the left and decoder on the right. This “encoder-decoder” architecture is rather arbitrary, and that is not how most transformer models work today. Typically, a modern transformer is either an encoder (Bert family) or a decoder (GPT family). So, GPT architecture looks more like this:

Image source

The only difference between encoder and decoder is that the latter is causal, i.e., it cannot go back in time. By “time” here, we mean the position t=1..T of the token (word) in the sequence. Only decoders can be used for text generation. GPT models are pretty much your garden variety transformer decoders, and different GPT versions differ pretty much only in size, minor details, and the dataset+training regime. If you understand how GPT-2 or even GPT-1 works, you can, to a large extent, understand GPT-4 also. For our purposes, we drew our own simplified GPT-2 diagram with explicit tensor dimensions. Don’t worry if it confuses you, we’ll explain it step by step in a moment.

Previously, we transformed the text “The elf queen” into a sequence of tokens [464, 23878, 16599]. This integer tensor has the size BxT, where B is the batch size (B=1 for us), and T is the sequence length (T=3). Most transformers are able to receive sequences of variable length without re-training, however all sequences in the batch must be of the same length (or padded). The transformer itself works with a D-dimensional vector at every position, for GPT-2 D=768. The total dimension of transformer data at each transformer layer is thus BXTxD, and the data is floating-point. This is different from the integer BxT encodings from the tokenizer. Thus every integer token has to be transformed to a D-dimensional floating point vector input_embeddings in the embedder. Unlike the tokenizer, the embedder is a part of the GPT-2 model itself (class GPT2LMHeadModel), so we don’t have to worry about it.

The output of the transformer blocks is of the same size as its input, BxTxD, or a D-dimensional vector output_embeddings at each position. If we use a headless GPT-2, class GPT2Model, this is exactly its output called last_hidden_state, which can be used for downstream NLP tasks. However, we want to use GPT2LMHeadModel, the model with a generation head. In order to understand the generation, let’s try to generate a text without using the generate() method. If we run the model inference:

out = model(input_ids=input_ids, attention_mask=attention_mask)

we get a dict-like object out containing a BxTxV tensor logits, where V=50257 is the GPT-2 dictionary size. How does the generation work? GPT-2 is trained so that the generation head predicts the next token at each position. If ztj is the logits tensor (t=1..T, j=1..V, we skip batch for simplicity), then the token t+1 is predicted as argmaxj (ztj). It is an integer token which can be decoded by the tokenizer.

But what is the meaning of logits at position T (the last position)? It is the prediction of the next token after the current T-sequence. We can add it to the end of the sequence, then repeat the process to generate as many tokens as we want. The important thing is that tokens are generated one at a time, so that in order to generate N tokens we need to run the model N times. Such models (which generate new data one step at a time) are called auto-regressive models.

Let’s see what Yann Lecun, one of Deep Learning’s founding fathers, says about them:

The full slide deck is here

Do you find his arguments persuasive?

The code for sequence generation is the following:

input_ids = enc['input_ids'] for i in range(20): attention_mask = torch.ones(input_ids.shape, dtype=torch.int64) logits = model(input_ids=input_ids, attention_mask=attention_mask)['logits'] new_id = logits[:, -1, :].argmax(dim=1) # Generate new ID input_ids = torch.cat([input_ids, new_id.unsqueeze(0)], dim=1) print(tokenizer.batch_decode(input_ids))

And it generates the sequence one word at a time.

This was a non-random generation. The random generation differs in the way it generates each next token from logits. Instead of the argmaxj (ztj), we randomly sample probability distribution pj of generating token j=1..V:

pj = 1/Z exp( zTj / Θ) = softmax(zTj / Θ) , where Z = 𝛴k exp( zTk / Θ),
zTj are logits at the last position, and Θ is the temperature.

How to Train and Fine-Tune GPT-2 with Hugging Face Transformers Trainer?

GPT models are trained in an unsupervised way on a large amount of text (or text corpus). The corpus is broken into sequences, usually of uniform size (e.g., 1024 tokens each). The model is trained to predict the next token (word) at each step of the sequence. For example (here, we write words instead of integer encodings for clarity) :

The labels are identical to input_ids, but shifted to one position to the left. Note that for GPT-2 in Hugging Face transformers this shift happens automatically when the loss is calculated, so from the user perspective, the tensor labels should be identical to input_ids. The training is demonstrated in the code train_gpt2_trainer1.py on a rather small toy corpus.

In function break_text_to_pieces() we load the corpus gpt1_paper.txt from the disk, tokenize it and break it into 511-token pieces plus the [END] token, which brings the sequence length to 512. Next, we split the data into train and validation sets in train_val_split() and in prepare_dsets() wrap them in the PyTorch datasets. The dataset class we use looks like this:

class MyDset(torch.utils.data.Dataset): """A custom dataset""" def __init__(self, data: list[list[int]]): self.data = [] for d in data: input_ids = torch.tensor(d, dtype=torch.int64) attention_mask = torch.ones(len(d), dtype=torch.int64) self.data.append({'input_ids': input_ids, 'attention_mask': attention_mask, 'labels': input_ids}) def __len__(self): return len(self.data) def __getitem__(self, idx: int): return self.data[idx]

In the constructor, it preprocesses the tokenized data into a dict with keys input_ids, attention_mask and labels for each data sequence, with tensor labels being equivalent to input_ids as explained above. The method __getitem__() simply serves the element idx in the dataset. As all sequences are of the same length T=512, they can be collated into a batch by the standard PyTorch collator (which understands such dicts just fine), there is no need for a custom collator with padding.

There are two ways to train Hugging Face transformers models: with the Trainer class or with a standard PyTorch training loop. We start with Trainer. After loading our model, tokenizer and two datasets, we create the training config.

training_args = transformers.TrainingArguments( output_dir="idiot_save/", learning_rate=1e-3, per_device_train_batch_size=1, per_device_eval_batch_size=1, num_train_epochs=20, evaluation_strategy='epoch', save_strategy='no', )

There are tons of customizable parameters here, see the docs. The only reason we use batch sizes of 1 is because our dataset is so small. For larger datasets, we would use larger batch sizes, as much as fits into the GPU RAM. Now, we create the trainer and train.

trainer = transformers.Trainer( model=model, args=training_args, train_dataset=dset_train, eval_dataset=dset_val, ) trainer.train()

Pretty simple, isn’t it?

Once our model is trained, we can save it to disk if we want.

model.save_pretrained('./trained_model/') tokenizer.save_pretrained('./trained_model/')

We can also use it for text generation to test the trained model in action.

While the Trainer class is “nice” for beginners, if you try to use it “in real life”, questions arise, such as:

- How exactly does it work?
- Where is the loss function?
- Validation loss is printed at each epoch, but where is the training loss?
- What device (CPU or GPU) is training running on and how to change that?
- (and many many other questions)

Following the usual pattern we see again and again in software development, dumbed-down tools for beginners actually become inconvenient for serious use. In fact, if we run the code train_gpt2_trainer1.py, does the validation loss actually increase at each epoch? Is something wrong, or is the model just overfitting to the tiny training set? Who knows. In the next section, we’ll show how to train this model in a much more controllable way.

How to Train and Fine-Tune GPT-2 with PyTorch Training Loop?

Note: this section requires minimal knowledge of PyTorch. If you don’t know any PyTorch, then you will have to believe us. In PyTorch, there is no equivalent to transformers.Trainer or the fit() method of Keras or scikit-learn. Instead, you are supposed to write a training loop yourself in order to have complete control over it. Don’t worry, it’s just a few lines of code. The typical PyTorch training loop looks like this (rather schematically):

for i_epoch in range(n_epochs): for x, y in loader_train: optimizer.zero_grad() out = model(x) loss = my_loss_function(out, y) loss.backward() optimizer.step()

In train_gpt2_torch1.py, we implement this approach for the training of GPT-2. The model, tokenizer, and two datasets are created identically to the previous chapter. Then we create the two data loaders. PyTorch data loaders compose individual dataset elements into batches.

loader_train = torch.utils.data.DataLoader(dset_train, batch_size=1) loader_val = torch.utils.data.DataLoader(dset_val, batch_size=1)

Next, we move the model to the requested device (GPU); in PyTorch, such operations are always performed explicitly by the user, and we create the Adam optimizer.

DEVICE = 'cuda' # or 'cpu' model.to(DEVICE) optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

The training loop itself is

for i_epoch in range(20): loss_train = train_one(model, loader_train, optimizer) loss_val = val_one(model, loader_val) print(f'{i_epoch} : loss_train={loss_train}, loss_val={loss_val}')

Here we run training and validation for each epoch and print both training and validation losses. The function train_one() is:

def train_one(model, loader, optimizer): """Standard PyTorch training, one epoch""" model.train() losses = [] for batch in tqdm.tqdm(loader): for k, v in batch.items(): batch[k] = v.to(DEVICE) optimizer.zero_grad() out = model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'], labels=batch['labels']) loss = out['loss'] loss.backward() optimizer.step() losses.append(loss.item())

返回 np.mean（损失）

This is very similar to the “generic” PyTorch training loop, but note a couple of things:

Our batch is now a dict containing both input tensors and labels
Every tensor in this dict must be moved to the DEVICE (e.g., GPU) by hand
The loss is calculated by the model itself when labels are provided and returned in out[‘loss’]. This is a convention of transformers and not the typical behavior of PyTorch models. The loss itself is the cross-entropy loss (a standard classification loss) over V=50257 classes, averaged over T positions.

The validation function val_one() is similar but with no backpropagation. If we run the code, we can clearly see that the training loss decreases while the validation loss increases, which means we are indeed overfitting to the tiny training corpus (far too small for a relatively large GPT-2 model).

What is the difference between training and fine-tuning? All training we performed so far was technically fine-tuning, as we started from a pre-trained GPT-2. Fine-tuning is a fine art (pun intended); by overfitting to a new dataset too much, you can easily forget the previous learning. For successful fine-tuning, you might want to limit the number of epochs and/or decrease the learning rate. Here we fine-tuned GPT-2 on a custom corpus by its native next-token-prediction task, the first type of fine-tuning.

In contrast, training (or training from scratch) means starting from a randomly-initialized model with no pre-trained weights.

model = transformers.GPT2LMHeadModel(transformers.GPT2Config())

We can train GPT-2 on our tiny corpus successfully, but it will take more than 20 epochs, a few hundred at least. You can try that.

Fine-tuning the GPT-2 backbone with a new head for downstream tasks is a second kind of fine-tuning GPT-2. We’ll discuss the third possible kind of fine-tuning below in the GPT-3 fine-tuning chapter.

How to use GPT-3 with OpenAI (Python, CLI) API?

Large OpenAI models such as GPT-3, ChatGPT and GPT-4 are not publicly available and can be used only via paid OpenAI subscription, via OpenAI web sites (such as GPT-3 playground) and OpenAI API. The models themselves run on OpenAI servers. Before we proceed, please sign up for OpenAI if you didn’t already and try the GPT-3 playground. Next, generate an OpenAI API Key in your cabinet. Store it safely somewhere on your computer. This cryptographic key will allow you to access OpenAI API.

This API (see the documentation) is a web API that you can access directly via, e.g., curl utility or with various language bindings. In this article, we are going to use OpenAI Python API and command line interface (CLI). Both are installed via PIP as:

pip3 安装 openai

Now openai is available as both a Python package and as a shell command. Next, I recommend that you set up your API key as the environment variable. In Linux, type in the terminal

export OPENAI_API_KEY="<OPENAI_API_KEY>"

Where <OPENAI_API_KEY> is your API key. You can also put this line into your ~/.profile file.

If you prefer not to use this environment variable, you will have to use the –k <OPENAI_API_KEY> every time you run openai in terminal, or, in a Python code set

openai.api_key = key

where key is the string containing your key.

CLI API is pretty easy (though not sufficiently documented). To see all available models, type

openai api models.list

Or to get the info on one particular model

openai api models.get -i text-ada-001

Note: there are currently four main versions of GPT-3 (from small to large): Ada, Babbage, Curie and DaVinci. We are going to use the smallest (and cheapest) one, Ada (text-ada-001). To generate the text from a prompt, type

openai api completions.create -m text-ada-001 -p "The elf queen"

and see the result. To see additional options, type

openai api completions.create -h

This was the CLI API. Now, how do we do the same in Python? It’s not much harder:

prompt = 'The elf queen' response = openai.Completion.create(model="text-ada-001", prompt=prompt, temperature=0.7, max_tokens=256) print(response['choices'][0]['text'])

But what if we don’t want text generation but text embeddings? It’s also possible; however, generating embeddings is not allowed for common models such as text-ada-001. Instead, we have to use specialized models like text-embedding-ada-002. The Python code is:

text = 'The Elf queen' res = openai.Embedding.create(model='text-embedding-ada-002', input=text) emb = res["data"][0]["embedding"] print(emb) print(type(emb), len(emb))

The result emb is a Python list of length 1536, presumably the raw transformer output embedding at the last position.

How to fine-tune GPT-3 with OpenAI API?

GPT-3 fine-tuning is described here. However, you are not allowed to train a model from scratch. Neither are you allowed to fine-tune on a text corpus or fine-tune with additional heads. The only type of fine-tuning allowed is fine-tuning on prompt+completion pairs, represented in JSONL format, for example:

Note: classification can be emulated by using “yes”/”no” completions. Of course, it is NOT a proper and efficient way to use GPT for classification.

How exactly is GPT-3 trained on such examples? We are not exactly sure (OpenAI is very secretive), but perhaps the two sequences of tokens are concatenated together, then GPT-3 is trained on such examples, but the loss is only calculated in the “completion” part.

Let’s try fine-tuning in action! We created a small JSONL dataset file colors.jsonl with a few pairs such as ‘banana is’ : ‘yellow’. Next, we run the (optional) utility to analyze the file and pre-process it if necessary

openai tools fine_tunes.prepare_data -f colors.jsonl

We got two suggestions from this tool:

Add a suffix ending `\n` to all completions
Add a whitespace character to the beginning of the completion

We said “Yes” to both suggestions and received the updated file colors_prepared.jsonl. Time for actual training.

openai api fine_tunes.create -t colors_prepared.jsonl -m ada

You can choose the model (ada, babbage, curie or davinci, without “text”) and the file to train on. The file will be uploaded into the OpenAI cloud and will get a unique ID of the form file_<ID>. You can use such a file for the subsequent fine-tunings by writing -t file_<ID> instead of the file name. You can list uploaded files with

openai api files.list

and delete an unneeded file with

openai api files.delete -i file_<ID>

Back to our training. Upon submission, your job (called fine-tune) is assigned its own unique ID of the form ft_<ID>. The client disconnects soon (with a “Stream interrupted (client disconnected)” message), but it’s OK. Your job is queued for half an hour or so and then starts training. You can check the status of all your fine-tunes (including completed ones) with:

openai api fine_tunes.list

You can check on a particular job (aka follow) with

openai api fine_tunes.follow -i ft-<ID>

Once the job is complete, you will get a detailed report of the type

The fine-tuned model gets its own unique ID ada:ft-personal-<TIME>. You can list all your personal models with

openai api models.list | grep personal

As suggested, you can now try your model for text generation with

openai api completions.create -m ada:ft-personal-<TIME> -p <YOUR_PROMPT>

The result for me was rather ambivalent. On one hand, I succeeded in teaching the model to “think colors”. On the other hand, for some reason the model hallucinated in haikus and produced a longer output than desired, for example

Even for a training set prompt “orange is” the result was still a three-line haiku (something we definitely did NOT train the model to do).

You can try to train GPT-3 for more epochs or on your own dataset. Enjoy GPT models!

Training and Fine-Tuning GPT-2 and GPT-3 Models Using Hugging Face Transformers and OpenAI API