Skip to content

Commit ab09617

Browse files
committed
mlp section
1 parent ae153ac commit ab09617

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

_pages/dat450/assignment2.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,10 +43,12 @@ The figure below shows the design of the OLMo 2 Transformer. We will reimplement
4343

4444
### MLP layer
4545

46-
Olmo 2 uses an MLP architecture called SwiGLU, which was introduced in [this paper](https://arxiv.org/pdf/2002.05202). (In the paper, this type of network is referred to as FFN<sub>SwiGLU</sub>, described on page 2, Equation 6. Swish<sub>1</sub> corresponds to PyTorch's [SiLU](https://docs.pytorch.org/docs/stable/generated/torch.nn.SiLU.html) activation.) The figure below shows the architecture visually.
46+
Olmo 2 uses an MLP architecture called SwiGLU, which was introduced in [this paper](https://arxiv.org/pdf/2002.05202). (In the paper, this type of network is referred to as FFN<sub>SwiGLU</sub>, described on page 2, Equation 6. Swish<sub>1</sub> corresponds to PyTorch's [SiLU](https://docs.pytorch.org/docs/stable/generated/torch.nn.SiLU.html) activation.) The figure below shows the architecture visually; in the diagram, the ⓧ symbol refers to element-wise multiplication.
4747

4848
<img src="https://raw.githubusercontent.com/ricj/dsai-nlp.github.io/refs/heads/master/_pages/dat450/swiglu.svg" alt="SwiGLU" style="width:10%; height:auto;">
4949

50+
The relevant hyperparameters you need to take into account here are `hidden_size` (the dimension of the input and output) and `intermediate_size` (the dimension of the intermediate representations).
51+
5052
**Sanity check.**
5153

5254
### Normalization

0 commit comments

Comments
 (0)