Skip to content

Commit 237d55d

Browse files
authored
Merge pull request #116 from ricj/master
tried to fix math
2 parents e862e8a + fed538d commit 237d55d

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

_pages/dat450/assignment2.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ The figure below shows what we will have to implement.
8484

8585
Continuing to work in `forward`, now compute query, key, and value representations; don't forget the normalizers after the query and key representations.
8686

87-
Now, we need to reshape the query, key, and value tensors so that the individual attention heads are stored separately. Assume your tensors have the shape $(b, m, d)$, where $b$ is the batch size, $m$ the text length, and $d$ the hidden layer size. We now need to reshape and transpose so that we get $(b, n_h, m, d_h)$ where $n_h$ is the number of attention heads and $d_h$ the attention head dimensionality. Your code could be something like the following (apply this to queries, keys, and values):
87+
Now, we need to reshape the query, key, and value tensors so that the individual attention heads are stored separately. Assume your tensors have the shape \((b, m, d)\), where \(b\) is the batch size, \(m\) the text length, and \(d\) the hidden layer size. We now need to reshape and transpose so that we get \((b, n_h, m, d_h)\) where \(n_h\) is the number of attention heads and \(d_h\) the attention head dimensionality. Your code could be something like the following (apply this to queries, keys, and values):
8888

8989
```
9090
q = q.view(b, m, n_h, d_h).transpose(1, 2)
@@ -103,15 +103,15 @@ We will explain the exact computations in the hint below, but conveniently enoug
103103
<div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;">
104104
In that case, the <a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html">documentation of the PyTorch implementation</a> includes a piece of code that can give you some inspiration and that you can simplify somewhat.
105105

106-
Assuming your query, key, and value tensors are called $q$, $k$, and $v$, then the computations you should carry out are the following. First, we compute the *attention pre-activations*, which are compute by multiplying query and key representations, and scaling:
106+
Assuming your query, key, and value tensors are called \(q\), \(k\), and \(v\), then the computations you should carry out are the following. First, we compute the *attention pre-activations*, which are compute by multiplying query and key representations, and scaling:
107107

108108
$$
109109
\alpha(q, k) = \frac{q \cdot k^{\top}}{\sqrt{d_h}}
110110
$$
111111

112112
The transposition of the key tensor can be carried out by calling <code>k.transpose(-2, -1)</code>.
113113

114-
Second, add a *causal mask* to the pre-activations. This mask is necessary for autoregressive (left-to-right) language models: this is so that the attention heads can only consider tokens before the current one. The mask should have the shape $(m, m)$; its lower triangle including the diagonal should be 0 and the upper triangle $-\infty$. Pytorch's <a href="https://docs.pytorch.org/docs/stable/generated/torch.tril.html"><code>tril</code></a> or <a href="https://docs.pytorch.org/docs/stable/generated/torch.triu.html"><code>triu</code></a> can be convenient here.
114+
Second, add a *causal mask* to the pre-activations. This mask is necessary for autoregressive (left-to-right) language models: this is so that the attention heads can only consider tokens before the current one. The mask should have the shape \((m, m)\); its lower triangle including the diagonal should be 0 and the upper triangle \(-\infty\). Pytorch's <a href="https://docs.pytorch.org/docs/stable/generated/torch.tril.html"><code>tril</code></a> or <a href="https://docs.pytorch.org/docs/stable/generated/torch.triu.html"><code>triu</code></a> can be convenient here.
115115

116116
Then apply the softmax to get the attention weights.
117117

@@ -131,7 +131,7 @@ $$
131131
```
132132
attn_out = attn_out.transpose(1, 2).reshape(b, m, d)
133133
```
134-
Then compute the final output representation (by applying the linear layer we called $W_O$ above) and return the result.
134+
Then compute the final output representation (by applying the linear layer we called \(W_O\) above) and return the result.
135135

136136
**Sanity check steps 2 and 3.**
137137
Once again create a MHA layer for testing and apply it to an input tensor of the same shape as before. Assuming you don't get any crashes here, the output should be of the same shape as the input. If it crashes or your output has the wrong shape, insert `print` statements along the way, or use an editor with step-by-step debugging, to check the shapes at each step.

0 commit comments

Comments
 (0)