Skip to content

Commit 3419dfd

Browse files
authored
Merge pull request #103 from ricj/master
First complete version.
2 parents ec3f25d + d1ceca6 commit 3419dfd

File tree

1 file changed

+78
-22
lines changed

1 file changed

+78
-22
lines changed

_pages/dat450/assignment1.md

Lines changed: 78 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,18 @@ nav_order: 4
99

1010
# DAT450/DIT247: Programming Assignment 1: Introduction to language modeling
1111

12-
## <span style="color:red">[Still under construction as of Oct. 29]</span>
13-
1412
*Language modeling* is the foundation that recent advances in NLP technlogies build on. In essence, language modeling means that we learn how to imitate the language that we observe in the wild. More formally, we want to train a system that models the statistical distribution of natural language. Solving this task is exactly what the famous commercial large language models do (with some additional post-hoc tweaking to make the systems more interactive and avoid generating provocative outputs).
1513

1614
In the course, we will cover a variety of technical solutions to this fundamental task (in most cases, various types of Transformers). In this first assignment of the course, we are going to build a neural network-based language model that uses *recurrent* neural networks (RNNs) to model the interaction between words.
1715

1816
However, setting up the neural network itself is a small part of this assignment, and the main focus is on all the other steps we have to carry out in order to train a language model. That is: we need to process the text files, manage the vocabulary, run the training loop, and evaluate the trained models.
1917

18+
### About this document
19+
20+
The work for your submission is described in **Part 1&ndash;Part 4** below.
21+
22+
There are **Hints** at various places in the instructions. You can click on these **Hints** to expand them to get some additional advice.
23+
2024
### Pedagogical purposes of this assignment
2125
- Introducing the task of language modeling,
2226
- Getting experience of preprocessing text,
@@ -34,30 +38,77 @@ On the practical side, you will need to understand the basics of PyTorch such as
3438

3539
### Submission requirements
3640

37-
Please submit your solution in [Canvas](https://chalmers.instructure.com/courses/YYYYY/assignments/XXXXX). **Submission deadline**: November XX.
41+
Please submit your solution in [Canvas](https://canvas.chalmers.se/courses/36909/assignments/117614).
42+
43+
**Submission deadline: November 10**.
3844

39-
Submit a XXXX containing your solution to the programming tasks described below. This is a pure programming assignment and you do not have to write a technical report or explain details of your solution in the XXX: there will be a separate individual assignment where you will answer some conceptual questions about what you have been doing here.
45+
Submit Python files containing your solution to the programming tasks described below.
46+
In addition, to save time for the people who grade your submission, please submit a text file containing the outputs printed out by your Python program; read the instructions carefully so that the right outputs are included. (Most importantly: the perplexity evaluated on the validation set, and the next-word predictions.)
47+
48+
This is a pure programming assignment and you do not have to write a technical report or explain details of your solution: there will be a separate individual assignment where you will answer some conceptual questions about what you have been doing here.
4049

4150
## Part 0: Preliminaries
4251

43-
### Installing libraries
44-
If you are working on your own machine, make sure that the following libraries are installed:
45-
- [NLTK](https://www.nltk.org/install.html) or [SpaCy](https://spacy.io/usage) for tokenization,
46-
- [PyTorch](https://pytorch.org/get-started/locally/) for building and training the models,
47-
- [Transformers](https://pytorch.org/get-started/locally/) and Datasets from HuggingFace,
48-
- Optional: [Matplotlib](https://matplotlib.org/stable/users/getting_started/) and [scikit-learn](https://scikit-learn.org/stable/install.html) for the embedding visualization in the last step.
49-
If you are using a Colab notebook, these libraries are already installed.
52+
### Accessing the Minerva compute cluster
53+
54+
You can in principle solve this assignment on a regular laptop but it will be boring to train the full language model on a machine that does not have a GPU available. For this reason, we recommend to use the CSE department's compute cluster for education, called [Minerva](https://git.chalmers.se/karppa/minerva/-/blob/main/README.md). If you haven't used Minerva in previous courses, please read the instructions on the linked page.
55+
56+
In particular, read carefully the section called [**Python environments**](https://git.chalmers.se/karppa/minerva/-/blob/main/README.md#python-environments). For the assignments in the course, you can use an environment we have prepared for this course: `/data/courses/2025_dat450_dit247/venvs/dat450_venv`. (So to activate this environment, you type `source /data/courses/2025_dat450_dit247/venvs/dat450_venv/bin/activate`.)
57+
58+
The directory `/data/courses/2025_dat450_dit247/assignments/a1` on Minerva contains two text files (`train.txt` and `val.txt`), which have been created from Wikipedia articles converted into raw text, with Wiki markup removed. In addition, there is a code skeleton (`A1_skeleton.py`) that contains stub implementations of the main pieces you need for your solution; for your own solution, you can copy this skeleton to your own directory.
5059

51-
### Downloading the files
60+
### Suggested working approach with the cluster
61+
62+
Note that GPUs cannot be accessed from the JupyterHub notebooks, so you must submit SLURM jobs for your final deliverable.
63+
64+
<details>
65+
<summary><b>Hint</b>: If you like to use VS Code, you have the option of connecting it to the cluster.</summary>
66+
<div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;">
67+
<ul>
68+
<li>Install the <a href="https://code.visualstudio.com/docs/remote/ssh">Remote SSH extension</a>.</li>
69+
<li>In the bottom left corner, you should have a small green button. Press this button. Alternatively, press Ctrl+Shift+P (Cmd+Shift+P on Mac) to open the command palette.</li>
70+
<li>Select <tt>Connect to Host...</tt> or <tt>Remote SSH: Connect to Host...</tt></li>
71+
<li>Type <code>[email protected]</code> and press enter. Enter your password if prompted.</li>
72+
<li>Open your home folder from the menu File > Open folder. The home folder should be called <code>/data/users/YOUR_CID</code>.</li>
73+
<li>If you want to use any extensions, they need to be installed separately on the VS Code server that is running on the cluster. Open the extension tab to install the extensions you need, e.g. the Python extension.</li>
74+
</ul>
75+
</div>
76+
</details>
5277

53-
TODO DESCRIBE HOW TO DOWNLOAD SKELETON
78+
<details>
79+
<summary><b>Hint</b>: While developing, you may optionally want to use interactive notebooks for a faster workflow. (But see the comment above about GPUs!)</summary>
80+
<div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;">
81+
<ul>
82+
<li>Read about <a href="https://git.chalmers.se/karppa/minerva/-/blob/main/README.md?ref_type=heads#jupyterhub">Minerva's JupyterHub</a></li>
83+
<li>To make the course's Python environment available in notebooks, take the following steps:</li>
84+
<ol>
85+
<li>Log in on Minerva and activate the course environment.</li>
86+
<li>Enter <code>python -m ipykernel install --user --name DAT450_venv --display-name "Python (DAT450_venv)"</code></li>
87+
<li>If JupyterHub is running, restart it. Otherwise, start it now.</li>
88+
<li>In the Launcher, you should now see an option called <code>Python (DAT450_venv)</code>.</li>
89+
<li>If you create a notebook, you should be able to import libraries needed for the assignment, e.g. <code>import transformers</code></li>
90+
</ol></li>
91+
<li>If you keep your code in a Python file copied from <tt>A1_skeleton.py</tt>, then add the following somewhere in your notebook:
92+
<pre>%load_ext autoreload
93+
%autoreload 2
94+
import your_a1_solution</pre>
95+
By enabling auto-reloading, you won't have to restart the notebook every time you update the code in the Python file. Note that auto-reloading in notebooks does not work if you do <code>from your_a1_solution import ...</code>.
96+
</li>
97+
</div>
98+
</details>
5499

55-
Download and extract [this archive](https://www.cse.chalmers.se/~richajo/diverse/lmdemo.zip), which contains three text files. The files have been created from Wikipedia articles converted into raw text, with all Wiki markup removed. (We'll actually just use the training and validation sets, and you can ignore the test file.)
56100

57-
### Accessing the compute cluster
101+
If you have questions about how to work with the cluster, please ask in the related [discussion thread](https://canvas.chalmers.se/courses/36909/discussion_topics/221739).
58102

59-
TODO DESCRIBE HOW TO ACCESS MINERVA VENV
103+
### Optional: Working on some other machine
104+
If you are working on your own machine, make sure that the following libraries are installed:
105+
- [NLTK](https://www.nltk.org/install.html) or [SpaCy](https://spacy.io/usage) for word splitting,
106+
- [PyTorch](https://pytorch.org/get-started/locally/) for building and training the models,
107+
- [Transformers](https://pytorch.org/get-started/locally/) and Datasets from HuggingFace,
108+
- Optional: [Matplotlib](https://matplotlib.org/stable/users/getting_started/) and [scikit-learn](https://scikit-learn.org/stable/install.html) for the embedding visualization in the last step.
109+
If you are using a Colab notebook, these libraries are already installed.
60110

111+
Then download and extract [this archive](https://www.cse.chalmers.se/~richajo/dat450/assignments/a1/a1.zip). It contains the text files and the code skeleton mentioned above.
61112

62113
## Part 1: Tokenization
63114

@@ -94,7 +145,6 @@ The total size of the vocabulary (including the 4 symbols) should be at most `ma
94145
<summary><b>Hint</b>: A <a href="https://docs.python.org/3/library/collections.html#collections.Counter"><code>Counter</code></a> can be convenient when computing the frequencies.</summary>
95146
<div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;">A <code>Counter</code> is like a regular Python dictionary, with some additional functionality for computing frequencies. For instance, you can go through each paragraph and call <a href="https://docs.python.org/3/library/collections.html#collections.Counter.update"><code>update</code></a>. After building the <code>Counter</code> on your dataset, <a href="https://docs.python.org/3/library/collections.html#collections.Counter.most_common"><code>most_common</code></a> gives the most frequent items.</div>
96147
</details>
97-
&nbsp;
98148

99149
Also create some utility that allows you to go back from the integer to the original word token. This will only be used in the final part of the assignment, where we look at model outputs and word embedding neighbors.
100150

@@ -147,6 +197,8 @@ Verify that at least the `input_ids` tensor corresponds to what you expect. (As
147197

148198
## Part 2: Loading the text files and creating batches
149199

200+
(This part just introduces some functionalities you may find useful when processing the data: it functions as a stepping stone for what you will do in Part 4. You do not have to include solutions to this part in your submission.)
201+
150202
**Loading the texts.** We will use the [HuggingFace Datasets](https://huggingface.co/docs/datasets/index) library to load the texts from the training and validation text files. (You may feel that we are overdoing it, since these are simple text files, but once again we want to introduce you to the standard ecosystem used in NLP.)
151203

152204
```
@@ -209,7 +261,7 @@ Define a neural network that implements an RNN-based language model. Use the ske
209261

210262
- an *embedding layer* that maps token integers to floating-point vectors,
211263
- an *recurrent layer* implementing some RNN variant (we suggest [`nn.LSTM`](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) or [`nn.GRU`](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html), and it is best to avoid the "basic" `nn.RNN`),
212-
- an *output layer* that computes (the logits of) a probability distribution over the vocabulary.
264+
- an *output layer* (or *unembedding layer*) that computes (the logits of) a probability distribution over the vocabulary.
213265

214266
Once again, we base our implementation on the HuggingFace Transformers library, to exemplify how models are defined when we use this library. Specifically, note that
215267
- The model hyperparameters are stored in a configuration object `A1RNNModelConfig` that inherits from HuggingFace's `PretrainedConfig`;
@@ -322,6 +374,8 @@ Take some example context window and use the model to predict the next word.
322374
- Use <a href="https://pytorch.org/docs/stable/generated/torch.argmax.html"><code>argmax</code></a> to find the index of the highest-scoring item, or <a href="https://pytorch.org/docs/stable/generated/torch.topk.html"><code>topk</code></a> to find the indices and scores of the *k* highest-scoring items.
323375
- Apply the inverse vocabulary encoder (that you created in Step 2) so that you can understand what words the model thinks are the most likely in this context.
324376

377+
**Make sure that one or more examples of next-word prediction is printed by your Python program and included in the submitted output file.**
378+
325379
### Quantitative evaluation
326380

327381
The most common way to evaluate language models quantitatively is the [perplexity](https://huggingface.co/docs/transformers/perplexity) score on a test dataset. The better the model is at predicting the actually occurring words, the lower the perplexity. This quantity is formally defined as follows:
@@ -342,13 +396,15 @@ The perplexity is traditionally defined in terms of logarithms of base 2. Howeve
342396
</div>
343397
</details>
344398

345-
If you have time for exploration, investigate the effect of the context window size *N* (and possibly other hyperparameters such as embedding dimensionality) on the model's perplexity.
399+
If you have time for exploration, investigate the effect of model hyperparameters and training settings on the model's perplexity.
400+
401+
**Make sure that the perplexity computed on the validation set is printed by your Python program and included in the submitted output file.**
346402

347-
### Inspecting the word embeddings
403+
### Optional task: Inspecting the learned word embeddings
348404

349405
It is common to say that neural networks are "black boxes" and that we cannot fully understand their internal mechanics, especially as they grow larger and structurally more complex. The research area of model interpretability aims to develop methods to help us reason about the high-level functions the models implement.
350406

351-
In this assignment, we will briefly investigate the [embeddings](https://en.wikipedia.org/wiki/Word_embedding) that your model learned while you trained it.
407+
We will briefly investigate the [embeddings](https://en.wikipedia.org/wiki/Word_embedding) that your model learned while you trained it.
352408
If we have successfully trained a word embedding model, an embedding vector stores a crude representation of "word meaning", so we can reason about the learned meaning representations by investigating the geometry of the vector space of word embeddings.
353409
The most common way to do this is to look at nearest neighbors in the vector space: intuitively, if we look at some example word, its neighbors should correspond to words that have a similar meaning.
354410

@@ -381,7 +437,7 @@ def nearest_neighbors(emb, voc, inv_voc, word, n_neighbors=5):
381437
</div>
382438
</details>
383439

384-
Optionally, you may visualize some word embeddings in a two-dimensional plot.
440+
Optionally, you may visualize some word embeddings in a two-dimensional plot (use a notebook while plotting or save the generated plot to a file via `plt.savefig`).
385441
<details>
386442
<summary><b>Hint</b>: Example code for PCA-based embedding scatterplot.</summary>
387443
<div style="margin-left: 10px; border-radius: 4px; background: #ddfff0; border: 1px solid black; padding: 5px;">

0 commit comments

Comments
 (0)