preliminary

ricj · ricj · commit 86bd6c1d334b · 2025-11-10T15:06:20.000+01:00
diff --git a/_pages/dat450/assignment2.md b/_pages/dat450/assignment2.md
@@ -84,7 +84,7 @@ The figure below shows what we will have to implement.
 
 Continuing to work in  `forward`, now compute query, key, and value representations; don't forget the normalizers after the query and key representations.
 
-Now, we need to reshape the query, key, and value tensors so that the individual attention heads are stored separately. Assume your tensors have the shape \((b, m, d)\), where \(b\) is the batch size, \(m\) the text length, and \(d\) the hidden layer size. We now need to reshape and transpose so that we get \((b, n_h, m, d_h)\) where \(n_h\) is the number of attention heads and \(d_h\) the attention head dimensionality. Your code could be something like the following (apply this to queries, keys, and values):
+Now, we need to reshape the query, key, and value tensors so that the individual attention heads are stored separately. Assume your tensors have the shape \( (b, m, d) \), where \( b \) is the batch size, \( m \) the text length, and \( d \) the hidden layer size. We now need to reshape and transpose so that we get \( (b, n_h, m, d_h) \) where \( n_h \) is the number of attention heads and \( d_h \) the attention head dimensionality. Your code could be something like the following (apply this to queries, keys, and values):
 
 ```
 q = q.view(b, m, n_h, d_h).transpose(1, 2)
@@ -103,7 +103,7 @@ We will explain the exact computations in the hint below, but conveniently enoug
 <div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;">
 In that case, the <a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html">documentation of the PyTorch implementation</a> includes a piece of code that can give you some inspiration and that you can simplify somewhat.
 
-Assuming your query, key, and value tensors are called \(q\), \(k\), and \(v\), then the computations you should carry out are the following. First, we compute the *attention pre-activations*, which are compute by multiplying query and key representations, and scaling:
+Assuming your query, key, and value tensors are called \(q\), \(k\), and \(v\), then the computations you should carry out are the following. First, we compute the <em>attention pre-activations</em>, which are compute by multiplying query and key representations, and scaling:
 
 $$
 \alpha(q, k) = \frac{q \cdot k^{\top}}{\sqrt{d_h}}