Skip to content

Commit 943dd9a

Browse files
authored
Merge pull request #109 from ricj/master
Preliminary
2 parents c0f0622 + 5d35906 commit 943dd9a

File tree

2 files changed

+332
-135
lines changed

2 files changed

+332
-135
lines changed

_pages/dat450/assignment2.md

Lines changed: 43 additions & 135 deletions
Original file line numberDiff line numberDiff line change
@@ -1,190 +1,99 @@
11
---
22
layout: page
3-
title: 'DAT450/DIT247: Programming Assignment 2: Generating text from a language model'
3+
title: 'DAT450/DIT247: Programming Assignment 2: Transformer language models'
44
permalink: /courses/dat450/assignment2/
55
description:
66
nav: false
77
nav_order: 4
88
---
99

10-
# DAT450/DIT247: Programming Assignment 2: Generating text from a language model
10+
# DAT450/DIT247: Programming Assignment 2: Transformer language models
1111

1212
In this assignment, we extend the models we investigated in the previous assignment in two different ways:
1313
- In the previous assignment, we used a model that takes a fixed number of previous words into account. Now, we will use a model capable of considering a variable number of previous words: a *recurrent neural network*. (Optionally, you can also investigate *Transformers*.)
1414
- In this assignment, we will also use our language model to generate texts.
1515

1616
### Pedagogical purposes of this assignment
17-
- Investigating more capable neural network architectures for language modeling.
17+
- Understanding the Transformer architecture in details, when used for language modeling.
1818
- Understanding text-generating algorithms.
1919

2020
### Requirements
2121

22-
Please submit your solution in [Canvas](https://chalmers.instructure.com/courses/31739/assignments/98455). **Submission deadline**: November 18.
22+
Please submit your solution in [Canvas](https://chalmers.instructure.com/courses/XX/assignments/YY). **Submission deadline**: November XX.
2323

24-
Submit a notebook containing your solution to the programming tasks described below. This is a pure programming assignment and you do not have to write a technical report or explain details of your solution in the notebook: there will be a separate individual assignment where you will answer some conceptual questions about what you have been doing here.
24+
Submit a XX
2525

2626
## Step 0: Preliminaries
2727

2828
Make sure you have access to your solution for Programming Assignment 1 since you will reuse some parts.
2929

30-
Copy the tokenization and integer encoding part into a new notebook.
30+
Copy the skeleton from SOMEWHERE.
3131

32-
## Step 1: Adapting your code for RNNs
32+
## Step 1: Setting up a Transformer neural network
3333

34-
### Adapting the preprocessing
34+
To be fully compatible with the Olmo 2 implementation, note that `nn.Linear` inside of all layers are bias-free (`bias=False`).
3535

36-
In the previous assignment, you developed preprocessing tools that extracted fixed-length sequences from the training data. You will now adapt the preprocessing so that you can deal with inputs of variable length.
36+
![Olmo2 overview](olmo2_overview.svg)
3737

38-
**Splitting**: While we will deal with longer sequences than in the previous assignment, we'll still have to control the maximal sequence length (or we'll run out of GPU memory). Define a hyperparameter `max_sequence_length` and split your sequences into pieces that are at most of that length. (Side note: in RNN training, limiting the sequence length is called <a href="https://d2l.ai/chapter_recurrent-neural-networks/bptt.html"><em>truncated backpropagation through time</em></a>.)
3938

40-
**Padding**: In the previous assignment, you developed a tool that finds the most frequent words in order to build a vocabulary. In this vocabulary, you defined special symbols to cover a number of corner cases: the beginning and end of text passages, and when a word is previously unseen or too infrequent.
41-
Now, change your vocabulary builder to include a new special symbol that we will call *padding*: this will be used when our batches contain texts of different lengths.
39+
### Configuration
4240

43-
After these changes, preprocess the text and build the vocabulary as in the previous assignment. Store the integer-encoded paragraphs in two lists, corresponding to the training and validation sets.
41+
### MLP layer
4442

45-
**Sanity check**: You should have around 147,000 training paragraphs and 18,000 validation paragraphs. However, since you split the sequences, you will in the end get a larger number of training and validation instances. (The exact numbers depend on `max_sequence_length`.)
43+
Olmo 2 uses an MLP architecture called SwiGLU, which was introduced in [this paper](https://arxiv.org/pdf/2002.05202).
4644

47-
### Adapting the batcher
45+
**Sanity check.**
4846

49-
In the previous assignment, you implemented some function to create training batches: that is, to put some number of training instances into a PyTorch tensor.
47+
### Normalization
5048

51-
Now, change your batching function so that it can deal with sequences of variable lengths.
52-
Since the output of the batching function are rectangular tensors, you need to *pad* sequences so they are of the same length.
53-
So for each instance that is shorter than the longest instance in the batch, you should append the padding symbol until it has the right length.
49+
To stabilize gradients during training, deep learning models with many layers often include some *normalization* (such as batch normalization or layer normalization). Transformers typically includes normalization layers at several places in the stack.
5450

55-
**Sanity check**: Inspect a few batches. Make sure that they are 2-dimensional integer tensors with *B* rows, where *B* is the batch size you defined. The number of columns probably varies from batch to batch, but should never be longer than `max_sequence_length` you defined previously.
56-
The integer-encoded padding symbol should only occur at the end of sequences.
51+
Olmo 2 uses a type of normalization called [Root Mean Square layer normalization](https://arxiv.org/pdf/1910.07467).
5752

58-
## Step 2: Designing a language model using a recurrent neural network
53+
Here, you can either implement your own normalization layer, or use the built-in [`RMSNorm`](https://docs.pytorch.org/docs/stable/generated/torch.nn.RMSNorm.html) from PyTorch. In the PyTorch implementation, `eps` corresponds to `rms_norm_eps` from our model configuration, while `normalized_shape` should be equal to the hidden layer size. The hyperparameter `elementwise_affine` should be set to `True`, meaning that we include some learnable weights in this layer instead of a pure normalization.
5954

60-
### Setting up the neural network structure
55+
If you want to make your own layer, the PyTorch documentation shows the formula you will have to implement. (The $\gamma_i$ parameters are the learnable weights.)
6156

62-
Define a neural network that implements an RNN-based language model. It should include the following layers:
57+
**Sanity check.**
6358

64-
- an *embedding layer* that maps token integers to floating-point vectors,
65-
- an *recurrent layer* implementing some RNN variant (we suggest [`nn.LSTM`](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) or [`nn.GRU`](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html)),
66-
- an *output layer* that computes (the logits of) a probability distribution over the vocabulary.
59+
### Multi-head attention
6760

68-
You will have to define some hyperparameters such as the embedding size (as in the previous assignment) and the size of the RNN's hidden state.
61+
Let's take the trickiest part first!
6962

70-
<details>
71-
<summary><b>Hint</b>: If you are doing the batching as recommended above, you should set <code>batch_first=True</code> when declaring the RNN.</summary>
72-
<div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;">
73-
The input to an RNN is a 3-dimensional tensor. If we set <code>batch_first=True</code>, then we assume that the input tensor is arranged as (<em>B</em>, <em>N</em>, <em>E</em>) where <em>B</em> is the batch size, <em>N</em> is the sequence length, and <em>E</em> the embedding dimensionality. In this case, the RNN "walks" along the second dimension: that is, over the sequence of tokens.
74-
75-
If on the other hand you set <code>batch_first=False</code>, then the RNN walks along the first dimension of the input tensor and it is assumed to be arranged as (<em>N</em>, <em>B</em>, <em>E</em>).
76-
</div>
77-
</details>
78-
79-
<details>
80-
<summary><b>Hint</b>: How to apply RNNs in PyTorch.</summary>
81-
<div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;">
82-
<p>
83-
Take a look at the documentation of one of the RNN types in PyTorch. For instance, here is the documentation of <a href="https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html"><code>nn.LSTM</code></a>. In particular, look at the section called <b>Outputs</b>. It is important to note here that all types of RNNs return <b>two</b> outputs when you call them in the forward pass. In this assignment, you will need the <b>first</b> of these outputs, which correspond to the RNN's output for each <em>token</em>. (The other outputs are the <em>layer-wise</em> outputs.)
84-
</p>
85-
<p>
86-
As we discussed in the previous assignment, PyTorch allows users to set up neural networks in different ways: the more compact approach using <code>nn.Sequential</code>, and the more powerful approach by inheriting from <code>nn.Module</code>.
87-
</p>
88-
89-
<p>
90-
If you implement your language model by inheriting from <code>nn.Module</code>, just remember that the RNN gives two outputs in the forward pass, and that you just need the first of them.
91-
</p>
92-
<pre>
93-
class MyRNNBasedLanguageModel(nn.Module):
94-
def __init__(self, ... ):
95-
super().__init__()
96-
... initialize model components here ...
97-
98-
def forward(self, batch):
99-
embedded = ... apply the embedding layer ...
100-
rnn_out, _ = self.rnn(embedded)
101-
... do the rest ...
102-
</pre>
103-
104-
<p>
105-
If you define your model using a <code>nn.Sequential</code>, we need a workaround to deal with the complication that the RNN returns two outputs. Here is one way to do it.
106-
</p>
107-
<pre>
108-
class RNNOutputExtractor(nn.Module):
109-
def __init__(self):
110-
super().__init__()
111-
112-
def forward(self, rnn_out):
113-
return rnn_out[0]
114-
</pre>
115-
<p>
116-
The <code>RNNOutputExtractor</code> can then be put after the RNN in your list of layers.
117-
</p>
118-
</div>
119-
</details>
63+
It is OK to use PyTorch's [`scaled_dot_product_attention`](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) to compute the final step. (In that case, set `is_causal=True`.)
12064

121-
**Sanity check**: carry out the following steps:
122-
- Create an integer tensor of shape 1x*N* where *N* is the length of the sequence. It doesn't matter what the integers are except that they should be less than the vocabulary size. (Alternatively, take one instance from your training set.)
123-
- Apply the model to this input tensor. It shouldn't crash here.
124-
- Make sure that the shape of the returned output tensor is 1x*N*x*V* where *V* is the size of the vocabulary. This output corresponds to the logits of the next-token probability distribution, but it is useless at this point because we haven't yet trained the model.
65+
If you want to use your own implementation, the [documentation of the PyTorch implementation](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) includes a piece of code that you can start from.
12566

126-
### Training the model
67+
**Sanity check.**
12768

128-
Adapt your training loop from the previous assignment, with the following changes
69+
### The full Transformer block
12970

130-
<details>
131-
<summary><b>Hint</b>: the output tensor is the input tensor, shifted one step to the right.</summary>
132-
<div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;">
133-
For instance, let's say our training text is <em>This is great !</em> (in practice, the words will be integer-coded).
134-
That means that at the first word (<em>This</em>), we want the model to predict the second word (<em>is</em>). At the second word, the goal is to predict <em>great</em>, and so on.
71+
**Sanity check.**
13572

136-
So when you process a batch in the training loop, you should probably split it into an input and an output part:
137-
<pre>
138-
input_tokens = batch[:, :-1]
139-
output_tokens = batch[:, 1:]
140-
</pre>
141-
</div>
142-
This means that the input consists of all the columns in the batch except the last one, and the output of all the columns except the first one.
143-
</details>
73+
### The complete Transformer stack
14474

145-
<details>
146-
<summary><b>Hint</b>: how to apply the loss function when training a language model.</summary>
147-
<div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;">
148-
The loss function (<a href="https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html"><code>CrossEntropyLoss</code></a>) expects two input tensors:
149-
<ul>
150-
<li>the <em>logits</em> (that is: the unnormalized log probabilities) of the predictions,</li>
151-
<li>the <em>targets</em>, that is the true output values we want the model to predict.</li>
152-
</ul>
75+
The embedding and unembedding layers will be identical to what you had in Programming Assignment 1 (except that the unembedding layer should be bias-free, as mentioned above).
15376

154-
Here, the tensor is expected to be one-dimensional (of length <em>B</em>, where <em>B</em> is the batch size) and the logits tensor to be two-dimensional (of shape (<em>B</em>, <em>V</em>) where <em>V</em> is the number of choices).
77+
## Step 2: Training the language model
15578

156-
In our case, the loss function's expected input format requires a small trick, since our targets tensor is two-dimensional (<em>B</em>, <em>N</em>) where <em>N</em> is the maximal text length in the batch. Analogously, the logits tensor is three-dimensional (<em>B</em>, <em>N</em>, <em>V</em>). To deal with this, you need to reshape the tensors before applying the loss function.
157-
<pre>
158-
targets = targets.view(-1) # 2-dimensional -> 1-dimensional
159-
logits = logits.view(-1, logits.shape[-1]) # 3-dimensional -> 2-dimensional
160-
</pre>
161-
</div>
162-
</details>
163-
164-
<details>
165-
<summary><b>Hint</b>: take padding into account when defining the loss.</summary>
166-
<div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;">
167-
When the loss is computed, we don't want to include the positions where we have inserted the dummy padding tokens.
168-
<a href="https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html"><code>CrossEntropyLoss</code></a> has a parameter <code>ignore_index</code> that you can set to the integer you use to represent the padding tokens.
169-
</div>
170-
</details>
79+
**Alternative solution.** Use a HuggingFace Trainer.
17180

17281
Run the training function and compute the perplexity on the validation set as in the previous assignment.
17382

17483
## Step 3: Generating text
17584

17685
### Predicting the next word
17786

178-
As a starting point, we'll repeat the exercise from the first assignment where we see what the model predicts as the next word of a given sequence. For instance, for the sequence `he lives in san`, a well-trained model will typically predic the word `francisco`. The steps will typically be something like the following:
87+
As a starting point, we'll repeat the exercise from the first assignment where we see what the model predicts as the next word of a given sequence. For instance, for the sequence `he lives in san`, a well-trained model will typically predict the word `francisco`. The steps will typically be something like the following:
17988

18089
- Apply the model to the integer-encoded input text.
181-
- Take the model's output at the last position.
90+
- Take the model's output at the last position (but make sure that you avoid an end-of-sentence dummy here).
18291
- Use <a href="https://pytorch.org/docs/stable/generated/torch.argmax.html"><code>argmax</code></a> to find the index of the highest-scoring item.
183-
- Apply the inverse vocabulary encoder (that you created in Step 2) so that you can understand what words the model thinks are the most likely in this context.
92+
- Apply the inverse vocabulary encoder so that you can understand what words the model thinks are the most likely in this context.
18493

18594
### Generating texts
18695

187-
Implement a random sampling algorithm as described in the recording ([video](https://youtu.be/QtwpM-OGOew), [pdf](http://www.cse.chalmers.se/~richajo/dat450/lectures/l4/m4_3.pdf)). The function should take the following inputs:
96+
Implement a random sampling algorithm as described in the recording ([video](https://youtu.be/QtwpM-OGOew), [pdf](http://www.cse.chalmers.se/~richajo/dat450/lectures/l3/l3_generating.pdf)). The function should take the following inputs:
18897

18998
- `model`: the language model that we use to predict the next token.
19099
- `prompt`: the prompt that initializes the text generation.
@@ -226,16 +135,15 @@ Run your generation algorithm with some different prompts and input parameters,
226135

227136
**Sanity check**: There are two ways to make this random sampling algorithm behave like *greedy decoding* (that is: there is no randomness, and the most likely next word is selected in each step). Run the function in these two ways and make sure you get the same output in both cases.
228137

229-
## Optional tasks
230-
231-
These tasks can be done if you are curious but will not affect your score.
232-
233-
### Dealing with repetition
234-
235-
As you might have observed, it is a common problem when generating from an autoregressive language model that some words or phrases are repeated over and over, in particular if you use greedy decoding (or beam search) or random sampling with a low temperature.
138+
### Comparing to a pre-trained Transformer
236139

237-
Implement some trick to try to reduce the amount of repetition, for instance by penalizing the generation algorithm if it wants to generate words that it has already generated.
140+
```
141+
from transformers import AutoTokenizer, AutoModelForCausalLM
142+
local_dir = '/data/courses/2025_dat450_dit247/models/OLMo-2-0425-1B'
143+
tokenizer = AutoTokenizer.from_pretrained(local_dir, local_files_only=True)
144+
model = AutoModelForCausalLM.from_pretrained(local_dir, local_files_only=True)
145+
```
238146

239-
### Transformer language models
147+
Note that this
240148

241-
Compare the RNN-based language model to an autoregressive Transformer. See the PyTorch tutorial for an example of how to set up a Transformer-based language model using PyTorch's Transformer implementation.
149+
**Optional task.** To verify that your implementation is identical to the Olmo 2 model, copy the weight tensors from the pre-trained model into an instance of your own implementation, and verify that you get exactly the same results.

0 commit comments

Comments
 (0)