Skip to content

Commit 4e32572

Browse files
committed
preliminary
1 parent 86bd6c1 commit 4e32572

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

_pages/dat450/assignment2.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ nav_order: 4
1010
# DAT450/DIT247: Programming Assignment 2: Transformer language models
1111

1212
In this assignment, we extend the models we investigated in the previous assignment in two different ways:
13-
- In the previous assignment, we used a model that takes a fixed number of previous words into account. Now, we will use a model capable of considering a variable number of previous words: a *recurrent neural network*. (Optionally, you can also investigate *Transformers*.)
13+
- We will now use a *Transformer* instead of the recurrent neural network we had previously.
1414
- In this assignment, we will also use our language model to generate texts.
1515

1616
### Pedagogical purposes of this assignment
@@ -19,13 +19,13 @@ In this assignment, we extend the models we investigated in the previous assignm
1919

2020
### Requirements
2121

22-
Please submit your solution in [Canvas](https://chalmers.instructure.com/courses/XX/assignments/YY). **Submission deadline**: November XX.
22+
Please submit your solution in [Canvas](https://chalmers.instructure.com/courses/XX/assignments/YY). **Submission deadline**: November 17.
2323

2424
Submit a XX
2525

2626
## Step 0: Preliminaries
2727

28-
Make sure you have access to your solution for Programming Assignment 1 since you will reuse some parts.
28+
Make sure you have access to your solution for Programming Assignment 1 since you will reuse the training loop. (Optionally, use HuggingFace's `Trainer` instead.)
2929

3030
Copy the skeleton from SOMEWHERE.
3131

@@ -84,7 +84,7 @@ The figure below shows what we will have to implement.
8484

8585
Continuing to work in `forward`, now compute query, key, and value representations; don't forget the normalizers after the query and key representations.
8686

87-
Now, we need to reshape the query, key, and value tensors so that the individual attention heads are stored separately. Assume your tensors have the shape \( (b, m, d) \), where \( b \) is the batch size, \( m \) the text length, and \( d \) the hidden layer size. We now need to reshape and transpose so that we get \( (b, n_h, m, d_h) \) where \( n_h \) is the number of attention heads and \( d_h \) the attention head dimensionality. Your code could be something like the following (apply this to queries, keys, and values):
87+
Now, we need to reshape the query, key, and value tensors so that the individual attention heads are stored separately. Assume your tensors have the shape $$ (b, m, d) $$, where $$ b $$ is the batch size, $$ m $$ the text length, and $$ d $$ the hidden layer size. We now need to reshape and transpose so that we get $$ (b, n_h, m, d_h) $$ where $$ n_h $$ is the number of attention heads and $$ d_h $$ the attention head dimensionality. Your code could be something like the following (apply this to queries, keys, and values):
8888

8989
```
9090
q = q.view(b, m, n_h, d_h).transpose(1, 2)
@@ -103,7 +103,7 @@ We will explain the exact computations in the hint below, but conveniently enoug
103103
<div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;">
104104
In that case, the <a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html">documentation of the PyTorch implementation</a> includes a piece of code that can give you some inspiration and that you can simplify somewhat.
105105

106-
Assuming your query, key, and value tensors are called \(q\), \(k\), and \(v\), then the computations you should carry out are the following. First, we compute the <em>attention pre-activations</em>, which are compute by multiplying query and key representations, and scaling:
106+
Assuming your query, key, and value tensors are called $$q$$, $$k$$, and $$v$$, then the computations you should carry out are the following. First, we compute the <em>attention pre-activations</em>, which are compute by multiplying query and key representations, and scaling:
107107

108108
$$
109109
\alpha(q, k) = \frac{q \cdot k^{\top}}{\sqrt{d_h}}

0 commit comments

Comments
 (0)