Skip to content

Commit e862e8a

Browse files
authored
Merge pull request #115 from ricj/master
preliminary
2 parents c0b7397 + 1e7ae0e commit e862e8a

File tree

1 file changed

+52
-6
lines changed

1 file changed

+52
-6
lines changed

_pages/dat450/assignment2.md

Lines changed: 52 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,9 @@ $$
109109
\alpha(q, k) = \frac{q \cdot k^{\top}}{\sqrt{d_h}}
110110
$$
111111

112-
Second, add a *causal mask* to the pre-activations. This mask is necessary for autoregressive (left-to-right) language models: this is so that the attention heads can only consider tokens before the current one. The mask should have the shape $(m, m)$; its lower triangle including the diagonal should be 0 and the upper triangle $-\infty$. Pytorch's <a href="https://docs.pytorch.org/docs/stable/generated/torch.tril.html"><code>tril</code></a> can be convenient here.
112+
The transposition of the key tensor can be carried out by calling <code>k.transpose(-2, -1)</code>.
113+
114+
Second, add a *causal mask* to the pre-activations. This mask is necessary for autoregressive (left-to-right) language models: this is so that the attention heads can only consider tokens before the current one. The mask should have the shape $(m, m)$; its lower triangle including the diagonal should be 0 and the upper triangle $-\infty$. Pytorch's <a href="https://docs.pytorch.org/docs/stable/generated/torch.tril.html"><code>tril</code></a> or <a href="https://docs.pytorch.org/docs/stable/generated/torch.triu.html"><code>triu</code></a> can be convenient here.
113115

114116
Then apply the softmax to get the attention weights.
115117

@@ -125,21 +127,65 @@ $$
125127
</div>
126128
</details>
127129

128-
**Sanity check step 2.**
130+
**MHA computation, step 3.** Now, we need to combine the results from the individual attention heads. We first flip the second and third dimensions of the tensor (so that the first two dimensions correspond to the batch length and text length), and then reshape into the right shape.
131+
```
132+
attn_out = attn_out.transpose(1, 2).reshape(b, m, d)
133+
```
134+
Then compute the final output representation (by applying the linear layer we called $W_O$ above) and return the result.
129135

130-
### The full Transformer block
136+
**Sanity check steps 2 and 3.**
137+
Once again create a MHA layer for testing and apply it to an input tensor of the same shape as before. Assuming you don't get any crashes here, the output should be of the same shape as the input. If it crashes or your output has the wrong shape, insert `print` statements along the way, or use an editor with step-by-step debugging, to check the shapes at each step.
131138

132-
**Sanity check.**
139+
### The full Transformer decoder layer
140+
141+
After coding up the multi-head attention, everything else is just a simple assembly of pieces!
142+
143+
In the constructor `__init__`, create the components in this block, taking the model configuration into account.
144+
As shown in the figure, a Transformer layer should include an attention layer and an MLP, with normalizers. In `forward`, connect the components to each other; remember to put residual connections at the right places.
145+
146+
<details>
147+
<summary><b>Hint</b>: Residual connections in PyTorch.</summary>
148+
<div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;">
149+
Assuming your
150+
<pre>
151+
h_new = do_something(h_old)
152+
out = h_new + h_old
153+
</pre>
154+
</div>
155+
</details>
156+
157+
**Sanity check.** Carry out the usual sanity check to see that the shapes are right and there are no crashes.
133158

134159
### The complete Transformer stack
135160

136-
The embedding and unembedding layers will be identical to what you had in Programming Assignment 1 (except that the unembedding layer should be bias-free, as mentioned above).
161+
Now, set up the complete Transformer stack including embedding and unembedding layers.
162+
The embedding and unembedding layers will be identical to what you had in Programming Assignment 1 (except that the unembedding layer should be bias-free, as mentioned in the beginning).
163+
164+
<details>
165+
<summary><b>Hint</b>: Use a <a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.ModuleList.html"><code>ModuleList</code></a>.</summary>
166+
<div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;">
167+
Put all the Transformer blocks in a <code>ModuleList</code> instead of a plain Python list. The <code>ModuleList</code> makes sure your parameters are registered so that they are included when you compute the gradients.
168+
</pre>
169+
</div>
170+
</details>
171+
172+
<details>
173+
<summary><b>Hint</b>: Creating the RoPE embeddings.</summary>
174+
<div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;">
175+
Xxx.
176+
</pre>
177+
</div>
178+
</details>
137179

138180
## Step 2: Training the language model
139181

182+
In Assignment 1, you implemented a utility to handle training and validation. Your Transformer language model should be possible to use as a drop-in replacement for the RNN-based model you had in that assignment.
183+
140184
**Alternative solution.** Use a HuggingFace Trainer.
141185

142-
Run the training function and compute the perplexity on the validation set as in the previous assignment.
186+
Select some suitable hyperparameters (number of Transformer layers, hidden layer size, number of attention heads).
187+
Then run the training function and compute the perplexity on the validation set as in the previous assignment.
188+
143189

144190
## Step 3: Generating text
145191

0 commit comments

Comments
 (0)