Skip to content

Commit b570140

Browse files
authored
Merge pull request #122 from ricj/master
Figures and typos.
2 parents ffa2943 + 70b8198 commit b570140

File tree

4 files changed

+1231
-237
lines changed

4 files changed

+1231
-237
lines changed

_pages/dat450/assignment2.md

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ OLMo 2 uses a type of normalization called [Root Mean Square layer normalization
6464

6565
You can either implement your own normalization layer, or use the built-in [`RMSNorm`](https://docs.pytorch.org/docs/stable/generated/torch.nn.RMSNorm.html) from PyTorch. In the PyTorch implementation, `eps` corresponds to `rms_norm_eps` from our model configuration, while `normalized_shape` should be equal to the hidden layer size. The hyperparameter `elementwise_affine` should be set to `True`, meaning that we include some learnable weights in this layer instead of a pure normalization.
6666

67-
If you want to make your own layer, the PyTorch documentation shows the formula you should implement. (The $\gamma_i$ parameters are the learnable weights.)
67+
If you want to make your own layer, the PyTorch documentation shows the formula you should implement. (The $$\gamma_i$$ parameters are the learnable weights.)
6868

6969
**Sanity check.**
7070

@@ -76,15 +76,17 @@ Now, let's turn to the tricky part!
7676

7777
The smaller versions of the OLMo 2 model, which we will follow here, use the same implementation of *multi-head attention* as the original Transformer, plus a couple of additional normalizers. (The bigger OLMo 2 models use [grouped-query attention](https://sebastianraschka.com/llms-from-scratch/ch04/04_gqa/) rather than standard MHA; GQA is also used in various Llama, Qwen and some other popular LLMs.)
7878

79-
The figure below shows what we will have to implement.
79+
The figure below shows a high-level overview of the pieces we will have to put together. (In the figure, the four *W* blocks are `nn.Linear`, and RN means RMSNorm.)
80+
81+
<img src="https://raw.githubusercontent.com/ricj/dsai-nlp.github.io/refs/heads/master/_pages/dat450/mha.svg" alt="MHA" style="width:10%; height:auto;">
8082

8183
**Hyperparameters:** The hyperparameters you will need to consider when implementing the MHA are
8284
`hidden_size` which defines the input dimensionality as in the MLP and normalizer above, and
83-
`num_attention_heads` which defines the number of attention heads. **Note** that `hidden_size` has to be evenly divisible by `num_attention_heads`. (Below, we will refer to `hidden_size // num_attention_heads` as the head dimensionality $d_h$.)
85+
`num_attention_heads` which defines the number of attention heads. **Note** that `hidden_size` has to be evenly divisible by `num_attention_heads`. (Below, we will refer to `hidden_size // num_attention_heads` as the head dimensionality $$d_h$$.)
8486

85-
**Defining MHA components.** In `__init__`, define the `nn.Linear` components (square matrices) that compute query, key, and value representations, and the final outputs. (They correspond to what we called $W_Q$, $W_K$, $W_V$, and $W_O$ in [the lecture on Transformers](https://www.cse.chalmers.se/~richajo/dat450/lectures/l4/m4_2.pdf).) OLMo 2 also applies layer normalizers after the query and key representations.
87+
**Defining MHA components.** In `__init__`, define the `nn.Linear` components (square matrices) that compute query, key, and value representations, and the final outputs. (They correspond to what we called $$W_Q$$, $$W_K$$, $$W_V$$, and $$W_O$$ in [the lecture on Transformers](https://www.cse.chalmers.se/~richajo/dat450/lectures/l4/m4_2.pdf).) OLMo 2 also applies layer normalizers after the query and key representations.
8688

87-
**MHA computation, step 1.** The `forward` method takes two inputs `hidden_states` and `rope_rotations`. The latter contains the precomputed rotations required for RoPE; the last step of the transformer explains how to compute them.
89+
**MHA computation, step 1.** The `forward` method takes two inputs `hidden_states` and `rope_rotations`. The latter contains the precomputed rotations required for RoPE. (The section **The complete Transformer stack** below explains where they come from.)
8890

8991
Continuing to work in `forward`, now compute query, key, and value representations; don't forget the normalizers after the query and key representations.
9092

@@ -143,6 +145,9 @@ Once again create a MHA layer for testing and apply it to an input tensor of the
143145
### The full Transformer decoder layer
144146

145147
After coding up the multi-head attention, everything else is just a simple assembly of pieces!
148+
The figure below shows the required components in a single Transformer decoder layer.
149+
150+
<img src="https://raw.githubusercontent.com/ricj/dsai-nlp.github.io/refs/heads/master/_pages/dat450/fullblock.svg" alt="fullblock" style="width:10%; height:auto;">
146151

147152
In the constructor `__init__`, create the components in this block, taking the model configuration into account.
148153
As shown in the figure, a Transformer layer should include an attention layer and an MLP, with normalizers. In `forward`, connect the components to each other; remember to put residual connections at the right places.
@@ -162,7 +167,7 @@ out = h_new + h_old
162167

163168
### The complete Transformer stack
164169

165-
Now, set up the complete Transformer stack including embedding, top-level normalizer, and unembedding layers.
170+
Now, set up the complete Transformer stack including embedding, top-level normalizer, and unembedding layers. (You may look at the figure presented previously.)
166171
The embedding and unembedding layers will be identical to what you had in Programming Assignment 1 (except that the unembedding layer should not use bias terms, as mentioned in the beginning).
167172

168173
<details>

0 commit comments

Comments
 (0)