Skip to content

Commit 88989e5

Browse files
authored
Merge pull request #113 from ricj/master
MLP section
2 parents 8693580 + f6ba468 commit 88989e5

File tree

1 file changed

+8
-2
lines changed

1 file changed

+8
-2
lines changed

_pages/dat450/assignment2.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,24 +43,30 @@ The figure below shows the design of the OLMo 2 Transformer. We will reimplement
4343

4444
### MLP layer
4545

46-
Olmo 2 uses an MLP architecture called SwiGLU, which was introduced in [this paper](https://arxiv.org/pdf/2002.05202). (In the paper, this type of network is referred to as FFN<sub>SwiGLU</sub>, described on page 2, Equation 6. Swish<sub>1</sub> corresponds to PyTorch's [SiLU](https://docs.pytorch.org/docs/stable/generated/torch.nn.SiLU.html) activation.) The figure below shows the architecture visually.
46+
OLMo 2 uses an MLP architecture called SwiGLU, which was introduced in [this paper](https://arxiv.org/pdf/2002.05202). (In the paper, this type of network is referred to as FFN<sub>SwiGLU</sub>, described on page 2, Equation 6. Swish<sub>1</sub> corresponds to PyTorch's [SiLU](https://docs.pytorch.org/docs/stable/generated/torch.nn.SiLU.html) activation.) The figure below shows the architecture visually; in the diagram, the ⊗ symbol refers to element-wise multiplication.
4747

4848
<img src="https://raw.githubusercontent.com/ricj/dsai-nlp.github.io/refs/heads/master/_pages/dat450/swiglu.svg" alt="SwiGLU" style="width:10%; height:auto;">
4949

50+
The relevant hyperparameters you need to take into account here are `hidden_size` (the dimension of the input and output) and `intermediate_size` (the dimension of the intermediate representations).
51+
5052
**Sanity check.**
5153

54+
Create an untrained MLP layer. Create some 3-dimensional tensor where the last dimension has the same size as `hidden_size` in your MLP. If you apply the MLP to the test tensor, the output should have the same shape as the input.
55+
5256
### Normalization
5357

5458
To stabilize gradients during training, deep learning models with many layers often include some *normalization* (such as batch normalization or layer normalization). Transformers typically includes normalization layers at several places in the stack.
5559

56-
Olmo 2 uses a type of normalization called [Root Mean Square layer normalization](https://arxiv.org/pdf/1910.07467).
60+
OLMo 2 uses a type of normalization called [Root Mean Square layer normalization](https://arxiv.org/pdf/1910.07467).
5761

5862
Here, you can either implement your own normalization layer, or use the built-in [`RMSNorm`](https://docs.pytorch.org/docs/stable/generated/torch.nn.RMSNorm.html) from PyTorch. In the PyTorch implementation, `eps` corresponds to `rms_norm_eps` from our model configuration, while `normalized_shape` should be equal to the hidden layer size. The hyperparameter `elementwise_affine` should be set to `True`, meaning that we include some learnable weights in this layer instead of a pure normalization.
5963

6064
If you want to make your own layer, the PyTorch documentation shows the formula you will have to implement. (The $\gamma_i$ parameters are the learnable weights.)
6165

6266
**Sanity check.**
6367

68+
You can test this in the same way as you tested the MLP previously.
69+
6470
### Multi-head attention
6571

6672
Let's take the trickiest part first!

0 commit comments

Comments
 (0)