Skip to content

Commit ffa2943

Browse files
authored
Merge pull request #121 from ricj/master
Completed Assignment 2.
2 parents b69b8f0 + 6d7dcf4 commit ffa2943

File tree

1 file changed

+27
-14
lines changed

1 file changed

+27
-14
lines changed

_pages/dat450/assignment2.md

Lines changed: 27 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -19,15 +19,17 @@ In this assignment, we extend the models we investigated in the previous assignm
1919

2020
### Requirements
2121

22-
Please submit your solution in [Canvas](https://chalmers.instructure.com/courses/XX/assignments/YY). **Submission deadline**: November 17.
22+
Please submit your solution in [Canvas](https://canvas.chalmers.se/courses/36909/assignments/117615). **Submission deadline**: November 17.
2323

24-
Submit a XX
24+
Submit Python files containing your solution to the programming tasks described below. In addition, to save time for the people who grade your submission, please submit a text file containing the outputs printed out by your Python program; read the instructions carefully so that the right outputs are included. (Most importantly: the perplexity evaluated on the validation set, and the generated texts you have created in the last section.)
2525

26+
This is a pure programming assignment and you do not have to write a technical report or explain details of your solution: there will be a separate individual assignment where you will answer some conceptual questions about what you have been doing here.
2627
## Step 0: Preliminaries
2728

28-
Make sure you have access to your solution for Programming Assignment 1 since you will reuse the training loop. (Optionally, use HuggingFace's `Trainer` instead.)
29+
Make sure you have access to your solution for Programming Assignment 1 since you will reuse the tokenization and the training loop. (Optionally, use HuggingFace's `Trainer` instead.)
2930

30-
Copy the skeleton from SOMEWHERE.
31+
On Minerva, copy the skeleton from `/data/courses/2025_dat450_dit247/assignments/a2/A2_skeleton.py`.
32+
This skeleton contains stub classes for all Transformer components, as well as a complete implementation of the RoPE positional representation (copied and somewhat simplified from the HuggingFace library).
3133

3234
## Step 1: Setting up a Transformer neural network
3335

@@ -37,10 +39,12 @@ The figure below shows the design of the OLMo 2 Transformer. We will reimplement
3739

3840
<img src="https://raw.githubusercontent.com/ricj/dsai-nlp.github.io/refs/heads/master/_pages/dat450/olmo2_overview.svg" alt="Olmo2 overview" style="width:10%; height:auto;">
3941

40-
**Implementation note:** To be fully compatible with the OLMo 2 implementation, note that all the `nn.Linear` inside of all layers are bias-free (`bias=False`). This includes Q, K, V, and O projections inside attention layers, all parts of the MLP layers, and the unembedding layer. If you solve the optional task at the end where you copy the weights of a pre-trained model into your implementation, then it is important that all layers are identical in structure.
42+
**Implementation note:** To be 100% compatible with the OLMo 2 implementation, note that all the `nn.Linear` inside of all layers are without bias terms (`bias=False`). This includes query, key, value, and output projections inside attention layers, all parts of the MLP layers, and the unembedding layer. If you solve the optional task at the end where you copy the weights of a pre-trained model into your implementation, then it is important that all layers are identical in structure.
4143

4244
### Configuration
4345

46+
Similarly to Assignment 1, the model hyperparameters you need for this assignment will be stored in a configuration object `A2ModelConfig`, which inherits from HuggingFace's `PretrainedConfig`. This configuration will be passed into `__init__` of all the Transformer's components.
47+
4448
### MLP layer
4549

4650
OLMo 2 uses an MLP architecture called SwiGLU, which was introduced in [this paper](https://arxiv.org/pdf/2002.05202). (In the paper, this type of network is referred to as FFN<sub>SwiGLU</sub>, described on page 2, Equation 6. Swish<sub>1</sub> corresponds to PyTorch's [SiLU](https://docs.pytorch.org/docs/stable/generated/torch.nn.SiLU.html) activation.) The figure below shows the architecture visually; in the diagram, the ⊗ symbol refers to element-wise multiplication.
@@ -80,7 +84,7 @@ The figure below shows what we will have to implement.
8084

8185
**Defining MHA components.** In `__init__`, define the `nn.Linear` components (square matrices) that compute query, key, and value representations, and the final outputs. (They correspond to what we called $W_Q$, $W_K$, $W_V$, and $W_O$ in [the lecture on Transformers](https://www.cse.chalmers.se/~richajo/dat450/lectures/l4/m4_2.pdf).) OLMo 2 also applies layer normalizers after the query and key representations.
8286

83-
**MHA computation, step 1.** The `forward` method takes two inputs `hidden_states` and `position_embedding`.
87+
**MHA computation, step 1.** The `forward` method takes two inputs `hidden_states` and `rope_rotations`. The latter contains the precomputed rotations required for RoPE; the last step of the transformer explains how to compute them.
8488

8589
Continuing to work in `forward`, now compute query, key, and value representations; don't forget the normalizers after the query and key representations.
8690

@@ -146,15 +150,15 @@ As shown in the figure, a Transformer layer should include an attention layer an
146150
<details>
147151
<summary><b>Hint</b>: Residual connections in PyTorch.</summary>
148152
<div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;">
149-
Assuming your
153+
Assuming your input is called <code>h_old</code>, then a residual connection is implemented via a straightforward addition.
150154
<pre>
151155
h_new = do_something(h_old)
152156
out = h_new + h_old
153157
</pre>
154158
</div>
155159
</details>
156160

157-
**Sanity check.** Carry out the usual sanity check to see that the shapes are right and there are no crashes.
161+
**Sanity check.** Carry out the usual sanity check to see that the shapes are correct and there are no crashes.
158162

159163
### The complete Transformer stack
160164

@@ -165,30 +169,28 @@ The embedding and unembedding layers will be identical to what you had in Progra
165169
<summary><b>Hint</b>: Use a <a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.ModuleList.html"><code>ModuleList</code></a>.</summary>
166170
<div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;">
167171
Put all the Transformer blocks in a <code>ModuleList</code> instead of a plain Python list. The <code>ModuleList</code> makes sure your parameters are registered so that they are included when you compute the gradients.
168-
</pre>
169172
</div>
170173
</details>
171174

172175
<details>
173176
<summary><b>Hint</b>: Creating and applying the RoPE embeddings.</summary>
174177
<div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;">
175178
Create the <code>A2RotaryEmbedding</code> in <code>__init__</code>, as already indicated in the code skeleton. Then in <code>forward</code>, first create the rotations (again, already included in the skeleton). Then pass the rotations when you apply each Transformer layer.
176-
</pre>
177179
</div>
178180
</details>
179181

180182
**Sanity check.** Now, the language model should be complete and you can test this in the same way as in Programming Assignment 1. Create a 2-dimensional *integer* tensor and apply your Transformer to it. The result should be a 3-dimensional tensor where the last dimension is equal to the vocabulary size.
181183

182184
## Step 2: Training the language model
183185

184-
In Assignment 1, you implemented a utility to handle training and validation. Your Transformer language model should be possible to use as a drop-in replacement for the RNN-based model you had in that assignment.
186+
In Assignment 1, you implemented utilities to tokenize the text, load the documents, and to handle training and validation. Your Transformer language model should be possible to use as a drop-in replacement for the RNN-based model you had in that assignment.
185187

186188
**Alternative solution.** Use a HuggingFace Trainer.
187189

188190
Select some suitable hyperparameters (number of Transformer layers, hidden layer size, number of attention heads).
191+
For this assignment, you are recommended to use a small Transformer (e.g. a couple of layers).
189192
Then run the training function and compute the perplexity on the validation set as in the previous assignment.
190193

191-
192194
## Step 3: Generating text
193195

194196
### Predicting the next word
@@ -242,17 +244,28 @@ This function takes a tensor as input and returns the <em>k</em> highest scores
242244

243245
Run your generation algorithm with some different prompts and input parameters, and try to investigate the effects. In the reflection questions, you will be asked to summarize your impression of how texts are generated with different prompts and input parameters.
244246

245-
**Sanity check**: There are two ways to make this random sampling algorithm behave like *greedy decoding* (that is: there is no randomness, and the most likely next word is selected in each step). Run the function in these two ways and make sure you get the same output in both cases.
247+
Here are a few example prompts that could be interesting to try:
248+
<pre>
249+
'In natural language processing, a Transformer'
250+
'Is Stockholm the capital of Sweden? Answer yes or no. The answer is'
251+
'Write a Python program that reverses a list.'
252+
</pre>
246253

247254
### Comparing to a pre-trained Transformer
248255

256+
Your language model will probable be able to generate texts that look somewhat like English, but rather bland and nonsensical. As an alternative, let's load the pre-trained OLMo-2 model (the 1 billion-parameter version). We have downloaded a copy to Minerva to save you some download time. Here,
257+
249258
```
250259
from transformers import AutoTokenizer, AutoModelForCausalLM
251260
local_dir = '/data/courses/2025_dat450_dit247/models/OLMo-2-0425-1B'
252261
tokenizer = AutoTokenizer.from_pretrained(local_dir, local_files_only=True)
253262
model = AutoModelForCausalLM.from_pretrained(local_dir, local_files_only=True)
254263
```
255264

256-
Note that this
265+
**Note:** when you apply this model, the return value is a [`CausalLMOutputWithPast`](https://huggingface.co/docs/transformers/main_classes/output#transformers.modeling_outputs.CausalLMOutputWithPast) object, not just the logits. This object has a field called `logits`. Otherwise, you should be able to use the pre-trained model in your generation algorithm.
266+
267+
Try the test examples once again with the pre-trained model and note the differences. In the reflection questions, there will be some questions about these differences.
268+
269+
Note that this is a pure language model (like the one you trained) and it has not been *instruction-tuned*. That is: it has not been post-trained to allow interactive chatting.
257270

258271
**Optional task.** To verify that your implementation is identical to the Olmo 2 model, copy the weight tensors from the pre-trained model into an instance of your own implementation, and verify that you get exactly the same results.

0 commit comments

Comments
 (0)