You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _pages/dat450/assignment2.md
+27-14Lines changed: 27 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,15 +19,17 @@ In this assignment, we extend the models we investigated in the previous assignm
19
19
20
20
### Requirements
21
21
22
-
Please submit your solution in [Canvas](https://chalmers.instructure.com/courses/XX/assignments/YY). **Submission deadline**: November 17.
22
+
Please submit your solution in [Canvas](https://canvas.chalmers.se/courses/36909/assignments/117615). **Submission deadline**: November 17.
23
23
24
-
Submit a XX
24
+
Submit Python files containing your solution to the programming tasks described below. In addition, to save time for the people who grade your submission, please submit a text file containing the outputs printed out by your Python program; read the instructions carefully so that the right outputs are included. (Most importantly: the perplexity evaluated on the validation set, and the generated texts you have created in the last section.)
25
25
26
+
This is a pure programming assignment and you do not have to write a technical report or explain details of your solution: there will be a separate individual assignment where you will answer some conceptual questions about what you have been doing here.
26
27
## Step 0: Preliminaries
27
28
28
-
Make sure you have access to your solution for Programming Assignment 1 since you will reuse the training loop. (Optionally, use HuggingFace's `Trainer` instead.)
29
+
Make sure you have access to your solution for Programming Assignment 1 since you will reuse the tokenization and the training loop. (Optionally, use HuggingFace's `Trainer` instead.)
29
30
30
-
Copy the skeleton from SOMEWHERE.
31
+
On Minerva, copy the skeleton from `/data/courses/2025_dat450_dit247/assignments/a2/A2_skeleton.py`.
32
+
This skeleton contains stub classes for all Transformer components, as well as a complete implementation of the RoPE positional representation (copied and somewhat simplified from the HuggingFace library).
31
33
32
34
## Step 1: Setting up a Transformer neural network
33
35
@@ -37,10 +39,12 @@ The figure below shows the design of the OLMo 2 Transformer. We will reimplement
**Implementation note:** To be fully compatible with the OLMo 2 implementation, note that all the `nn.Linear` inside of all layers are bias-free (`bias=False`). This includes Q, K, V, and O projections inside attention layers, all parts of the MLP layers, and the unembedding layer. If you solve the optional task at the end where you copy the weights of a pre-trained model into your implementation, then it is important that all layers are identical in structure.
42
+
**Implementation note:** To be 100% compatible with the OLMo 2 implementation, note that all the `nn.Linear` inside of all layers are without bias terms (`bias=False`). This includes query, key, value, and output projections inside attention layers, all parts of the MLP layers, and the unembedding layer. If you solve the optional task at the end where you copy the weights of a pre-trained model into your implementation, then it is important that all layers are identical in structure.
41
43
42
44
### Configuration
43
45
46
+
Similarly to Assignment 1, the model hyperparameters you need for this assignment will be stored in a configuration object `A2ModelConfig`, which inherits from HuggingFace's `PretrainedConfig`. This configuration will be passed into `__init__` of all the Transformer's components.
47
+
44
48
### MLP layer
45
49
46
50
OLMo 2 uses an MLP architecture called SwiGLU, which was introduced in [this paper](https://arxiv.org/pdf/2002.05202). (In the paper, this type of network is referred to as FFN<sub>SwiGLU</sub>, described on page 2, Equation 6. Swish<sub>1</sub> corresponds to PyTorch's [SiLU](https://docs.pytorch.org/docs/stable/generated/torch.nn.SiLU.html) activation.) The figure below shows the architecture visually; in the diagram, the ⊗ symbol refers to element-wise multiplication.
@@ -80,7 +84,7 @@ The figure below shows what we will have to implement.
80
84
81
85
**Defining MHA components.** In `__init__`, define the `nn.Linear` components (square matrices) that compute query, key, and value representations, and the final outputs. (They correspond to what we called $W_Q$, $W_K$, $W_V$, and $W_O$ in [the lecture on Transformers](https://www.cse.chalmers.se/~richajo/dat450/lectures/l4/m4_2.pdf).) OLMo 2 also applies layer normalizers after the query and key representations.
82
86
83
-
**MHA computation, step 1.** The `forward` method takes two inputs `hidden_states` and `position_embedding`.
87
+
**MHA computation, step 1.** The `forward` method takes two inputs `hidden_states` and `rope_rotations`. The latter contains the precomputed rotations required for RoPE; the last step of the transformer explains how to compute them.
84
88
85
89
Continuing to work in `forward`, now compute query, key, and value representations; don't forget the normalizers after the query and key representations.
86
90
@@ -146,15 +150,15 @@ As shown in the figure, a Transformer layer should include an attention layer an
146
150
<details>
147
151
<summary><b>Hint</b>: Residual connections in PyTorch.</summary>
Put all the Transformer blocks in a <code>ModuleList</code> instead of a plain Python list. The <code>ModuleList</code> makes sure your parameters are registered so that they are included when you compute the gradients.
168
-
</pre>
169
172
</div>
170
173
</details>
171
174
172
175
<details>
173
176
<summary><b>Hint</b>: Creating and applying the RoPE embeddings.</summary>
Create the <code>A2RotaryEmbedding</code> in <code>__init__</code>, as already indicated in the code skeleton. Then in <code>forward</code>, first create the rotations (again, already included in the skeleton). Then pass the rotations when you apply each Transformer layer.
176
-
</pre>
177
179
</div>
178
180
</details>
179
181
180
182
**Sanity check.** Now, the language model should be complete and you can test this in the same way as in Programming Assignment 1. Create a 2-dimensional *integer* tensor and apply your Transformer to it. The result should be a 3-dimensional tensor where the last dimension is equal to the vocabulary size.
181
183
182
184
## Step 2: Training the language model
183
185
184
-
In Assignment 1, you implemented a utility to handle training and validation. Your Transformer language model should be possible to use as a drop-in replacement for the RNN-based model you had in that assignment.
186
+
In Assignment 1, you implemented utilities to tokenize the text, load the documents, and to handle training and validation. Your Transformer language model should be possible to use as a drop-in replacement for the RNN-based model you had in that assignment.
185
187
186
188
**Alternative solution.** Use a HuggingFace Trainer.
187
189
188
190
Select some suitable hyperparameters (number of Transformer layers, hidden layer size, number of attention heads).
191
+
For this assignment, you are recommended to use a small Transformer (e.g. a couple of layers).
189
192
Then run the training function and compute the perplexity on the validation set as in the previous assignment.
190
193
191
-
192
194
## Step 3: Generating text
193
195
194
196
### Predicting the next word
@@ -242,17 +244,28 @@ This function takes a tensor as input and returns the <em>k</em> highest scores
242
244
243
245
Run your generation algorithm with some different prompts and input parameters, and try to investigate the effects. In the reflection questions, you will be asked to summarize your impression of how texts are generated with different prompts and input parameters.
244
246
245
-
**Sanity check**: There are two ways to make this random sampling algorithm behave like *greedy decoding* (that is: there is no randomness, and the most likely next word is selected in each step). Run the function in these two ways and make sure you get the same output in both cases.
247
+
Here are a few example prompts that could be interesting to try:
248
+
<pre>
249
+
'In natural language processing, a Transformer'
250
+
'Is Stockholm the capital of Sweden? Answer yes or no. The answer is'
251
+
'Write a Python program that reverses a list.'
252
+
</pre>
246
253
247
254
### Comparing to a pre-trained Transformer
248
255
256
+
Your language model will probable be able to generate texts that look somewhat like English, but rather bland and nonsensical. As an alternative, let's load the pre-trained OLMo-2 model (the 1 billion-parameter version). We have downloaded a copy to Minerva to save you some download time. Here,
257
+
249
258
```
250
259
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(local_dir, local_files_only=True)
254
263
```
255
264
256
-
Note that this
265
+
**Note:** when you apply this model, the return value is a [`CausalLMOutputWithPast`](https://huggingface.co/docs/transformers/main_classes/output#transformers.modeling_outputs.CausalLMOutputWithPast) object, not just the logits. This object has a field called `logits`. Otherwise, you should be able to use the pre-trained model in your generation algorithm.
266
+
267
+
Try the test examples once again with the pre-trained model and note the differences. In the reflection questions, there will be some questions about these differences.
268
+
269
+
Note that this is a pure language model (like the one you trained) and it has not been *instruction-tuned*. That is: it has not been post-trained to allow interactive chatting.
257
270
258
271
**Optional task.** To verify that your implementation is identical to the Olmo 2 model, copy the weight tensors from the pre-trained model into an instance of your own implementation, and verify that you get exactly the same results.
0 commit comments