|
109 | 109 | \alpha(q, k) = \frac{q \cdot k^{\top}}{\sqrt{d_h}} |
110 | 110 | $$ |
111 | 111 |
|
112 | | -Second, add a *causal mask* to the pre-activations. This mask is necessary for autoregressive (left-to-right) language models: this is so that the attention heads can only consider tokens before the current one. The mask should have the shape $(m, m)$; its lower triangle including the diagonal should be 0 and the upper triangle $-\infty$. Pytorch's <a href="https://docs.pytorch.org/docs/stable/generated/torch.tril.html"><code>tril</code></a> can be convenient here. |
| 112 | +The transposition of the key tensor can be carried out by calling <code>k.transpose(-2, -1)</code>. |
| 113 | + |
| 114 | +Second, add a *causal mask* to the pre-activations. This mask is necessary for autoregressive (left-to-right) language models: this is so that the attention heads can only consider tokens before the current one. The mask should have the shape $(m, m)$; its lower triangle including the diagonal should be 0 and the upper triangle $-\infty$. Pytorch's <a href="https://docs.pytorch.org/docs/stable/generated/torch.tril.html"><code>tril</code></a> or <a href="https://docs.pytorch.org/docs/stable/generated/torch.triu.html"><code>triu</code></a> can be convenient here. |
113 | 115 |
|
114 | 116 | Then apply the softmax to get the attention weights. |
115 | 117 |
|
|
125 | 127 | </div> |
126 | 128 | </details> |
127 | 129 |
|
128 | | -**Sanity check step 2.** |
| 130 | +**MHA computation, step 3.** Now, we need to combine the results from the individual attention heads. We first flip the second and third dimensions of the tensor (so that the first two dimensions correspond to the batch length and text length), and then reshape into the right shape. |
| 131 | +``` |
| 132 | +attn_out = attn_out.transpose(1, 2).reshape(b, m, d) |
| 133 | +``` |
| 134 | +Then compute the final output representation (by applying the linear layer we called $W_O$ above) and return the result. |
129 | 135 |
|
130 | | -### The full Transformer block |
| 136 | +**Sanity check steps 2 and 3.** |
| 137 | +Once again create a MHA layer for testing and apply it to an input tensor of the same shape as before. Assuming you don't get any crashes here, the output should be of the same shape as the input. If it crashes or your output has the wrong shape, insert `print` statements along the way, or use an editor with step-by-step debugging, to check the shapes at each step. |
131 | 138 |
|
132 | | -**Sanity check.** |
| 139 | +### The full Transformer decoder layer |
| 140 | + |
| 141 | +After coding up the multi-head attention, everything else is just a simple assembly of pieces! |
| 142 | + |
| 143 | +In the constructor `__init__`, create the components in this block, taking the model configuration into account. |
| 144 | +As shown in the figure, a Transformer layer should include an attention layer and an MLP, with normalizers. In `forward`, connect the components to each other; remember to put residual connections at the right places. |
| 145 | + |
| 146 | +<details> |
| 147 | +<summary><b>Hint</b>: Residual connections in PyTorch.</summary> |
| 148 | +<div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;"> |
| 149 | +Assuming your |
| 150 | +<pre> |
| 151 | +h_new = do_something(h_old) |
| 152 | +out = h_new + h_old |
| 153 | +</pre> |
| 154 | +</div> |
| 155 | +</details> |
| 156 | + |
| 157 | +**Sanity check.** Carry out the usual sanity check to see that the shapes are right and there are no crashes. |
133 | 158 |
|
134 | 159 | ### The complete Transformer stack |
135 | 160 |
|
136 | | -The embedding and unembedding layers will be identical to what you had in Programming Assignment 1 (except that the unembedding layer should be bias-free, as mentioned above). |
| 161 | +Now, set up the complete Transformer stack including embedding and unembedding layers. |
| 162 | +The embedding and unembedding layers will be identical to what you had in Programming Assignment 1 (except that the unembedding layer should be bias-free, as mentioned in the beginning). |
| 163 | + |
| 164 | +<details> |
| 165 | +<summary><b>Hint</b>: Use a <a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.ModuleList.html"><code>ModuleList</code></a>.</summary> |
| 166 | +<div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;"> |
| 167 | +Put all the Transformer blocks in a <code>ModuleList</code> instead of a plain Python list. The <code>ModuleList</code> makes sure your parameters are registered so that they are included when you compute the gradients. |
| 168 | +</pre> |
| 169 | +</div> |
| 170 | +</details> |
| 171 | + |
| 172 | +<details> |
| 173 | +<summary><b>Hint</b>: Creating the RoPE embeddings.</summary> |
| 174 | +<div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;"> |
| 175 | +Xxx. |
| 176 | +</pre> |
| 177 | +</div> |
| 178 | +</details> |
137 | 179 |
|
138 | 180 | ## Step 2: Training the language model |
139 | 181 |
|
| 182 | +In Assignment 1, you implemented a utility to handle training and validation. Your Transformer language model should be possible to use as a drop-in replacement for the RNN-based model you had in that assignment. |
| 183 | + |
140 | 184 | **Alternative solution.** Use a HuggingFace Trainer. |
141 | 185 |
|
142 | | -Run the training function and compute the perplexity on the validation set as in the previous assignment. |
| 186 | +Select some suitable hyperparameters (number of Transformer layers, hidden layer size, number of attention heads). |
| 187 | +Then run the training function and compute the perplexity on the validation set as in the previous assignment. |
| 188 | + |
143 | 189 |
|
144 | 190 | ## Step 3: Generating text |
145 | 191 |
|
|
0 commit comments