Assignment 1: Build a Toy Llama-2 Language Model¶
CISC7021 Applied Natural Language Processing (2024/2025)
In this assignment, we will prepare a toy language model that employs the Llama-2 architecture and evaluate the perplexity of the data set.
We will learn how to perform continual pre-training of a base language model using the PyTorch and Hugging Face libraries. Detailed instructions for building this language model can be found in the attached notebook file.
Acknowledgement: The base model checkpoint is converted from llama2.c project. The data instances were sampled from TinyStories dataset.
🚨 Please note that running this on CPU may be slow. If running on Google Colab or Kaggle, you can avoid this by going to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4. This should be included within the free tier of Colab.
We start by doing a pip install
of all required libraries.
- 🤗
transformers
,datasets
,accelerate
are Huggingface libraries. - By default, Colab has
transformers
,pytorch
libraries installed. If you are using a local machine, please install them viapip
orconda
.
(Optional) Uploading the model/data to Google Colab or Kaggle.¶
Please upload your dataset and model to computational platforms if you are using Colab or Kaggle environments.
For Colab users, you can mount your Google Drive files by running the following code snippet:
Mounted at /content/drive
Necessary Packages, Environment Setups¶
Please set the correct file path based on your environment.
- If you are using Colab, the path may be:
/content/drive/MyDrive/xxxxxx
- If you are using Kaggle, the path may be:
/kaggle/input/xxxxxx
Load the model checkpoint into either a GPU or CPU (training will be slow on CPU, but decoding will be fair).
Device type: cuda
As we can see from the statistics, this model is much smaller than Llama-2 but shares the same decoder-only architecture.
😄 You do not need to check complex details! We just present the architecture and number of parameters here.
LlamaForCausalLM( (model): LlamaModel( (embed_tokens): Embedding(32000, 512) (layers): ModuleList( (0-7): 8 x LlamaDecoderLayer( (self_attn): LlamaSdpaAttention( (q_proj): Linear(in_features=512, out_features=512, bias=False) (k_proj): Linear(in_features=512, out_features=512, bias=False) (v_proj): Linear(in_features=512, out_features=512, bias=False) (o_proj): Linear(in_features=512, out_features=512, bias=False) (rotary_emb): LlamaRotaryEmbedding() ) (mlp): LlamaMLP( (gate_proj): Linear(in_features=512, out_features=1376, bias=False) (up_proj): Linear(in_features=512, out_features=1376, bias=False) (down_proj): Linear(in_features=1376, out_features=512, bias=False) (act_fn): SiLU() ) (input_layernorm): LlamaRMSNorm() (post_attention_layernorm): LlamaRMSNorm() ) ) (norm): LlamaRMSNorm() ) (lm_head): Linear(in_features=512, out_features=32000, bias=False) ) #Parameters: 41.69M
Task 1: Decoding¶
If you are familar with the usage of model.generate()
function in transformer library, please feel free to jump to Task 1 Playground.
💡Tutorials: model.generate() function.¶
Minimal example:
prompt = "Once upon a time, " # Input, prefix of generation
Step 1: Encode raw text using tokenizer model.
tokenized_input = tokenizer.encode(prompt, return_tensors='pt').to(device)
Step 2: Set decoding hyper-parameters. Get the model output.
output_ids = model.generate(tokenized_input, do_sample=True, max_new_tokens=300, temperature=0.6)
Important parameters:
max_new_tokens
: The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.temperature
: The value of temperature used to modulate the next token probabilities. Higher temperature -> generate more diverse text. Lower temperature -> generate more deterministic text.do_sample
:do_sample=False
is using greedy decoing strategy. To enable greedy decoding, we also need to set other sampling parameterstop_p
,temperature
asNone
.- If you are interested in other decoding algorithms, please refer to this link for setting parameters.
Step 3: Convert model outputs into raw text.
output_text = tokenizer.decode(output_ids[0])
or (when input instances >=1)
output_text = tokenizer.batch_decode(output_ids)
Important parameters:
- Setting
skip_special_tokens=True
will prevent special tokens, such as<s>
, from appearing in the results..
To understand the outputs of each step, let us do a simple generation task step by step! (Note: the base model is only able to produce fluent story text).
tensor([[ 1, 9038, 2501, 263, 931, 29892, 624, 3547, 4562, 750, 263, 12561, 29889]], device='cuda:0')
====================Token IDs==================== tensor([[ 1, 9038, 2501, 263, 931, 29892, 624, 3547, 4562, 750, 263, 12561, 29889, 2296, 5131, 304, 367, 263, 12456, 985, 29889, 2296, 5131, 304, 19531, 263, 9560, 10714, 322, 263, 528, 4901, 20844, 29889, 1205, 1183, 471, 2086, 2319, 322, 278, 10714, 471, 2086, 4802, 29889, 13, 6716, 2462, 29892, 624, 3547, 4446, 263, 4802, 29892, 528, 4901, 10714, 297, 263, 3787, 29889, 2296, 4433, 902, 16823, 565, 1183, 1033, 505, 372, 29889, 2439, 16823, 1497, 4874, 322, 18093, 372, 363, 902, 29889, 13, 855, 3547, 471, 577, 9796, 29889, 2296, 1925, 373, 278, 10714, 322, 3252, 381, 839, 2820, 29889, 2296, 7091, 763, 263, 1855, 12456, 985, 29889, 13, 6246, 769, 29892, 1554, 8515, 9559, 29889, 624, 3547, 4687, 304, 4459, 270, 466, 1537, 29889, 2296, 8496, 29915, 29873, 2317, 701, 7812, 29889, 2296, 7091, 763, 1183, 471, 10917, 1076, 2820, 322, 2820, 29889, 13, 855, 3547, 29915, 29879, 16823, 4446, 902, 322, 1497, 29892, 376, 855, 3547, 29892, 366, 817, 304, 2125, 263, 2867, 29889, 887, 1106, 270, 466, 1537, 1213, 13, 855, 3547, 3614, 1283, 278, 10714, 322, 6568, 1623, 373, 278, 11904, 29889, 2296, 5764, 902, 5076, 322, 3614, 263, 6483, 16172, 29889, 2860, 263, 2846, 6233, 29892, 1183, 7091, 2253, 29889, 13, 855, 3547, 25156, 322, 1497, 29892, 376, 29924, 290, 29892, 306, 29915, 29885, 7960, 304, 367, 263, 12456, 985, 1449, 3850, 1]], device='cuda:0')
====================Decoded Results==================== Once upon a time, Stella Lou had a dream. She wanted to be a princess. She wanted to wear a beautiful dress and a shiny crown. But she was too small and the dress was too big. One day, Stella saw a big, shiny dress in a store. She asked her mom if she could have it. Her mom said yes and bought it for her. Stella was so happy. She put on the dress and twirled around. She felt like a real princess. But then, something strange happened. Stella started to feel dizzy. She couldn't stand up straight. She felt like she was spinning around and around. Stella's mom saw her and said, "Stella, you need to take a break. You look dizzy." Stella took off the dress and lay down on the floor. She closed her eyes and took a deep breath. After a few minutes, she felt better. Stella smiled and said, "Mom, I'm ready to be a princess again!"
Another pipeline example: Sampling decoding with temperature.¶
<s> Once upon a time, Stella Lou had a dream. She wanted to be the most popular girl in the world. Everywhere she went, people would smile and say how much they liked her. One day, Stella was walking down the street when she saw a little girl. The girl was wearing a pretty dress and had a big smile on her face. Stella was so excited that she ran up to the girl and said, "Hi! I'm Stella Lou. What's your name?" The little girl smiled and said, "My name is Sarah. I'm so happy to meet you!" Stella Lou was so happy that she started to dance around Sarah. She said, "Let's go to the park and play together!" So, Stella and Sarah went to the park and played all day. They laughed and had so much fun. Stella Lou was so happy that she had made a new friend. At the end of the day, Stella Lou said goodbye to Sarah and thanked her for being her friend. She went home and fell asleep with a big smile on her face. She dreamt of all the fun she had with Sarah the next day.<s>
Task 1 Playground¶
📚 Task 1: Please generate English stories using various prompts and decoding settings. Please feel free to explore any interesting phenomena, such as the impact of different prompts and the effects of various decoding algorithms and parameters. For example, quantify the text properties using linguistic-driven metrics like story length and Type-Token Ratio (TTR). In addition to objective metrics, you are encouraged to discuss your findings based on subjective case studies.
We provide two types of skeleton code: one that takes a single prompt as input and another that can process batched inputs and decoding. Please use the version that best fits your preferences and data types.
Once upon a time, and her mommy went to the park. They saw a big puddle and mommy said, "Let's splash in it!" But the puddle was too deep and mommy said, "No, we can't. We will get all wet!" Suddenly, a big, scary dog came running towards them. Mommy said, "Oh no! That dog is dangerous!" But the dog just wanted to play and jumped into the puddle. The dog splashed and splashed until mommy was all wet. Mommy laughed and said, "That was fun! Let's go home now." And they walked away, leaving the dangerous dog behind. Tom is a cute kitty. He likes to play with his ball and his mouse. He also likes to eat fish and chicken. But he does not like to sleep in his crib. His crib is big and soft and has a soft blanket. One night, Tom's mom comes to his crib. She says, "Tom, it is time to sleep. You need to rest your eyes and your ears. Sleeping is good for you. It helps you grow and learn and be happy." Tom does not want to sleep in his crib. He wants to play with his ball and his mouse. He says, "No, mom, I do not want to sleep. I want to play. Sleeping is boring." Tom's mom says, "Tom, you have to sleep in your crib. It is safe and cozy and has your toys. You can play with your ball and your mouse later. Sleeping is important. It makes you strong and smart and brave. It also helps you dream and dream and dream. You can dream of anything you want." Tom thinks for a moment. He looks at his ball and his mouse. He looks at his crib. He looks at his mom. He says, "Okay, mom, I will sleep in my crib. But can I have a hug and a kiss?" Tom's mom smiles and hugs him. She says,
What about other languages?¶
Oops! This English language model cannot generate stories in other languages!
Why? Let us evaluate the perplexity of different languages in the next task.
<s> 从前有一只小兔子乖乖ons. They were very excited to go to the park. When they got there, they saw a big, red slide. They ran over to it and started to slide down. They laughed and giggled as they slid down. When they got to the bottom, they saw a big, blue ball. They both wanted to play with it. "Let's play with the ball," said Sammy. "No, let's play with the ball," said Mommy. They argued for a while, but then Mommy had an idea. "Why don't we take turns? You can play with the ball first, and then Sammy can play with it." Sammy and Mommy agreed. Sammy played with the ball first, then Mommy played with the ball. Sammy had lots of fun. When it was Sammy's turn, he was so excited. He ran over to the ball and started to play with it. Mommy watched him and smiled. "That's a good idea Sammy," said Mommy. "It's important to share and take turns." Sammy nodded and smiled. He was happy that Mommy was so understanding.<s>
Task 2: Perplexity Evaluation¶
Background¶
The perplexity serves as a key metric for evaluating language models. It quantifies how well a model predicts a sample, with lower perplexity indicating better performance. For a tokenized sequence
Here,
⚠️ Please make sure to run the following cell first to define the evaluation function.
😄 You do not need to check these complex details! Too hard for beginners! However, if you are interested, you can compare the following code with the explanations above to better understand how to implement PPL evaluation using PyTorch.
💡Tutorials: compute_ppl() function.¶
Minimal example:
test_dataset = ["Once upon a time,"]
compute_ppl(
model=model,
tokenizer=tokenizer,
device=device,
inputs=test_dataset,
batch_size = 16
)
Important parameters:
inputs
: list of input text, each separate text snippet is one list entry.batch_size
: the batch size to run evaluations.
Returns:
perplexity
:{"perplexities": [x.x, x.x, ...], "mean_perplexity": x.x}
dictionary containing the perplexity scores for the texts in the input list, as well as the mean perplexity. .
Task 2 Playground¶
📚 Task 2: Evaluate the perplexity. Ensure that you evaluate both the English and Chinese test data we provided. You are encouraged to collect more diverse text data and discuss your findings regarding the language understanding capacity of the base model.
Note: If you want to reuse the evaluation codes for JSONL data, please structure the content as follows:
{"text": "one data"}
{"text": "two data."}
...
You may find that the PPL value for Chinese text is significantly higher than that for English text. This is evidence that the base model cannot generate a Chinese story at the end of the last task.
0%| | 0/1 [00:00<?, ?it/s]
Perplexity: 10.68
0%| | 0/63 [00:00<?, ?it/s]
(English Text) Test Perplexity: 4.14
0%| | 0/63 [00:00<?, ?it/s]
(Chinese Text) Test Perplexity: 70030.42
Task 3: Continual Pre-training (in Chinese or in another language you are proficient in)¶
Currently, our base English LM is proficient in English but lacks the capability to generate or comprehend other languages (e.g., Chinese). The objective of this task is to enhance a base English LM by continually pre-training it with text in another language. This process aims to enable the model to understand and generate mini-story in another language.
We have provided 10,000 Chinese training samples. The training process for any language is the same. We have included useful resource links (in Assignment description PDF) to help you create additional data. If you encounter any issues in creating a dataset in another language, please do not hesitate to contact us.
We have implemented data preprocessing and the training pipeline, so you are not required to optimize these components. Instead, focus on tuning the training hyperparameters and observe the changes in model performance.
⚠️ Please make sure to run the following cell first to pre-process data.
😄 You do not need to check the details of whole pipeline construction! Please pay attention to the hyper-parameters of trainer
.
Preprocess Data¶
Here, we preprocess (tokenize and group) the text for the subsequent evaluation and pre-training phases.
Load prepared Chinese dataset from Google drive (or local disk).
DatasetDict({ train: Dataset({ features: ['text'], num_rows: 10000 }) validation: Dataset({ features: ['text'], num_rows: 500 }) test: Dataset({ features: ['text'], num_rows: 1000 }) }) 从前,有一个小女孩名叫莉莉。她喜欢和家人一起去度假。有一天,她的家人决定去海边旅行。莉莉非常兴奋,她跳起来又跳下去,像发了疯一样。 当他们到达海滩时,他们搭起了遮阳伞和毯子。莉莉想立刻去游泳,但她的父母告诉她要等吃完午餐再说。莉莉感到很不耐烦,她说:“我现在就想去游泳!”她妈妈回答:“莉莉,我们需要先吃东西。游泳需要能量。” 莉莉意识到妈妈说得对,于是耐心地等待午餐结束。她学会了有时候要克制自己的激动情绪,并听从父母的意见。从那天起,莉莉变得更善于倾听,也更加享受她的假期时光。
We tokenize the raw text using Llama-2's tokenizer and group the tokenized text as inputs.
💡Tutorials: TrainingArguments().¶
Important Training Hyper-parameters
- learning_rate: The initial learning rate for optimizer.
- num_train_epochs: Total number of training epochs to perform (if not an integer, will perform the decimal part percents of the last epoch before stopping training).
- *_strategy: The evaluation/saving strategy to adopt during training. Possible values are:
"no"
: No evaluation/saving is done during training."steps"
: Evaluation/saving is done (and logged) everyeval_steps
."epoch"
: Evaluation/saving is done at the end of each epoch.
- per_device_train_batch_size: The batch size per GPU/XPU/TPU/MPS/NPU core/CPU for training.
- per_device_eval_batch_size: The batch size per GPU/XPU/TPU/MPS/NPU core/CPU for evaluation.
- save_total_limit: If a value is passed, will limit the total amount of checkpoints.
If you do not understand AdamW
optimizer and learning scheduler, you may use default settings.
Optimizer Hyper-parameters
- weight_decay: The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in [
AdamW
] optimizer. - adam_beta1: The beta1 hyperparameter for the [
AdamW
] optimizer. - adam_beta2: The beta2 hyperparameter for the [
AdamW
] optimizer.
Learning schedule
- lr_scheduler: The scheduler type to use.
- warmup_ratio: Ratio of total training steps used for a linear warmup from 0 to
learning_rate
.
Task 3 Playground¶
📚 Please just run the following code to do continual pre-training. Please try your best to tune the hyperparameters or collect more data to improve model performance.
Step | Training Loss | Validation Loss |
---|---|---|
200 | 3.298300 | 3.243898 |
400 | 1.910000 | 1.898938 |
600 | 1.548700 | 1.571377 |
800 | 1.381600 | 1.436444 |
1000 | 1.313700 | 1.372024 |
1200 | 1.252300 | 1.322885 |
1400 | 1.179800 | 1.296618 |
1600 | 1.129800 | 1.278112 |
1800 | 1.088200 | 1.270051 |
2000 | 1.106700 | 1.262663 |
TrainOutput(global_step=2000, training_loss=1.7592032198905945, metrics={'train_runtime': 733.814, 'train_samples_per_second': 86.878, 'train_steps_per_second': 2.725, 'total_flos': 4956004181606400.0, 'train_loss': 1.7592032198905945, 'epoch': 8.0})
Load pre-trained model and try to generate mini-story in another language.
Device type: cuda
Evaluate the PPL on Chinese text (or another language) again.
You will notice that we actually achieve a much lower PPL after continual pre-training.
Validation Perplexity: 3.53 Test Perplexity: 3.52
The original English base model was pre-trained on 2 million data samples. Considering we are using only 10,000 training samples (0.5% of the original pre-training data), the model can generate a few fluent sentences but may still struggle with long-text generation or common sense of other languages. You can try using more data or training steps depending on your computational resources.
<s> 从前,有一只叫做汤姆的猫。汤姆和他的朋友们一起玩。他们喜欢在公园里玩耍。有一天,汤姆和他的朋友们决定去公园玩。 在公园里,汤姆看到了一个大滑梯。他想玩滑梯。他跑去滑梯,但不小心掉在了地上。汤姆很伤心。 汤姆很高兴他的朋友们能帮助他。他们一起玩滑梯,度过了很多快乐时光。