Skip to content

Commit c26183e

Browse files
authored
Merge pull request #130 from ricj/master
Updated for 2025.
2 parents b35c9d9 + 537ae72 commit c26183e

File tree

1 file changed

+191
-46
lines changed

1 file changed

+191
-46
lines changed

_pages/dat450/assignment5.md

Lines changed: 191 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -10,34 +10,53 @@ nav_order: 4
1010
# DAT450/DIT247: Programming Assignment 5: Retrieval-augmented text generation
1111
In this assignment we will build our own RAG pipeline using LangChain.
1212

13-
### Pedagogical purposes of this assignment
13+
## Pedagogical purposes of this assignment
1414
- Get an understanding of how RAG can be used within NLP.
1515
- Learn how to use LangChain to build NLP applications.
1616
- Get an understanding for the challenges and use cases of RAG.
1717

18-
### Requirements
19-
Please submit your solution in [Canvas](https://chalmers.instructure.com/courses/31739/assignments/98457). **Submission deadline:** December 13.
18+
## Requirements
19+
Please submit your solution in Canvas. **Submission deadline: December 8.**
2020

21-
Submit a notebook containing your solution to the programming tasks described below. This is a pure programming assignment and you do not have to write a technical report: there will be a separate individual assignment where you will answer some conceptual questions about what you have been doing here. However, you are welcome to write down your thoughts in this notebook, while you will not be assessed on them here.
21+
Submit Python files containing your solution to the programming tasks described below. In addition, to save time for the people who grade your submission, please submit a text file containing the outputs printed out by your Python program; read the instructions carefully so that the right outputs are included.
2222

23-
## Step 0: Get the datasets
23+
This is a pure programming assignment and you do not have to write a technical report: there will be a separate individual assignment where you will answer some conceptual questions about what you have been doing here.
24+
25+
## Step 0: Preliminaries
26+
27+
For the assignments in the course, you can use an environment we have prepared for this course: `/data/courses/2025_dat450_dit247/venvs/dat450_venv`. (So to activate this environment, you type `source /data/courses/2025_dat450_dit247/venvs/dat450_venv/bin/activate`.)
28+
29+
If you are running on Colab or your own environment, make sure the following packages are installed:
30+
```bash
31+
pip install langchain
32+
pip install langchain-community
33+
pip install langchain-huggingface
34+
pip install langchain-core
35+
pip install sentence_transformers
36+
pip install langchain-chroma
37+
```
38+
39+
## Step 1: Get the dataset
2440
You will be working with the [PubMedQA dataset](https://github.com/pubmedqa/pubmedqa) described in this [paper](https://aclanthology.org/D19-1259.pdf). The dataset has been created based on medical research papers from [PubMed](https://pubmed.ncbi.nlm.nih.gov/), you can read more about it in the linked paper.
2541

26-
Use the following code to get the dataset for the assignment.
42+
Use the following code to get the dataset for the assignment.
43+
44+
If you are running on Minerva or your own environment, run the following command in your command line. Otherwise if you are using notebook e.g. Colab, you can write the following command in a code block with an extra `!` before and run the code block.
2745

2846
```bash
2947
wget https://raw.githubusercontent.com/pubmedqa/pubmedqa/refs/heads/master/data/ori_pqal.json
3048
```
3149

32-
### Collect two datasets from the downloaded data
33-
34-
We collect two datasets:
50+
### Collect two datasets
51+
You will collect two datasets from the downloaded file:
3552
- 'questions': the questions with corresponding gold long answer, gold document ID, and year.
3653
- 'documents': the abstracts (contexts+long_answer concatenated), and year.
3754

55+
You can run the following codes to collet these two datasets.
56+
3857
```python
3958
import pandas as pd
40-
tmp_data = pd.read_json("/content/ori_pqal.json").T
59+
tmp_data = pd.read_json("ori_pqal.json").T
4160
# some labels have been defined as "maybe", only keep the yes/no answers
4261
tmp_data = tmp_data[tmp_data.final_decision.isin(["yes", "no"])]
4362

@@ -50,51 +69,74 @@ questions = pd.DataFrame({"question": tmp_data.QUESTION,
5069
"gold_document_id": documents.index})
5170
```
5271

53-
For an example of a query:
72+
**Sanity check:** You can print out some of the data in the dataset.
5473

55-
```python
74+
An example of a question our RAG pipeline should answer:
75+
```
5676
questions.iloc[0].question
5777
```
5878

59-
For an example of a document to leverage for the queries:
79+
An example of a document the pipeline can leverage to answer the questions:
6080

61-
```python
81+
```
6282
documents.iloc[0].abstract
6383
```
6484

65-
> Note that we will increase the difficulty of the pipeline in the sense that it needs to find the relevant document on its own. E.g. for question 0 we will not directly give the model abstract 0.
6685

67-
## Step 1: Configure your LangChain LM
86+
## Step 2: Configure your LangChain LM
87+
88+
### Step 2.1: Find a language model from HuggingFace
89+
90+
Define a language model that will act as the generative model in your RAG pipeline. You can browse for different Hugging Face models on their [webpage](https://huggingface.co/models).
91+
92+
93+
94+
> Some interesting models (e.g. Llama 3.2) may require that you apply for access. This process is usually quite fast, while it may require that you create an account on Hugging Face (it is free). To use a gated model you need to generate a personal HF token and put it as a secret in your notebook (if using Colab). Make sure that the token has enabled "Read access to contents of all public gated repos you can access".
95+
96+
<details>
97+
<summary><b>Hint:</b> How to set up HuggingFace Token when using Minerva</summary>
98+
99+
If you need to use the huggingface token and you are using Minerva, one way to do it is to add the global parameter in your bash file: `export HF_TOKEM = your_{token}`, and then refer to it in your python code: `hf_token = os.getenv('HF_token')`. Also, to avoid your token being misused, remember to remove the actual token you are using from your submission.
100+
101+
</details>
68102

69-
Define a language model that will act as the generative model in your RAG pipeline. You can for example use the [HuggingFacePipeline](https://python.langchain.com/docs/integrations/llms/huggingface_pipelines/) in LangChain to run models on your GPU. You can browse for different Hugging Face models on their [webpage](https://huggingface.co/models). A general guide on how to set up RAG pipelines in LangChain can be found [here](https://python.langchain.com/v0.1/docs/use_cases/chatbots/retrieval/#creating-a-retriever).
70103

71-
> You should be able to fit a model of at least a size of 1B parameters on the T4 GPUs available in Colab.
104+
### Step 2.2 Load the language model
105+
106+
You can load the HuggingFace language model using `HuggingFacePipeline.from_model_id`
107+
108+
When calling `HuggingFacePipeline`, set `return_full_text=False` to only return the assistant's response, and call `model.invoke(your_prompt)` to retrieve the text of the output.
72109

73-
> Some interesting models (e.g. Llama 3.2) may require that you apply for access. This process is usually quite fast, while it may require that you create an account on Hugging Face (it is free). To use a gated model you need to generate a personal HF token and put it as a secret in your notebook (if using Colab). Make sure that the token has enabled "Read access to contents of all public gated repos you can access".
74110

75111
**Sanity check:** Prompt your LangChain model and confirm that it returns a reasonable output.
76112

77-
## Step 2: Set up the document database and retriever
113+
**Include the prompt and the output of this model in your output file.**
78114

79-
### Step 2.1: Embedding model
80-
First, you need a model to embed the documents in the retrieval corpus. Here, we recommend using the [HuggingFaceEmbeddings](https://api.python.langchain.com/en/latest/huggingface/embeddings/langchain_huggingface.embeddings.huggingface.HuggingFaceEmbeddings.html) function.
115+
## Step 3: Set up the document database
81116

82-
**Sanity check:** Pass a text passage to the embedding model and evaluate its shape. It should be of the shape (1, embedding_dim).
117+
### Step 3.1: Embedding model
118+
First, you need a model to embed the documents in the retrieval corpus. Here, we recommend using the [HuggingFaceEmbeddings](https://docs.langchain.com/oss/python/integrations/text_embedding/huggingfacehub) function.
83119

120+
**Sanity check:** Pass a text passage to the embedding model by calling `embed_query` and evaluate its shape. It should be of the shape (embedding_dim,).
84121

85-
### Step 2.2: Chunking
86-
Second, you need to chunk the documents in your retrieval corpus, as some likely are too long for the embedding model. Here, you can use the [RecursiveCharacterTextSplitter](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/recursive_text_splitter/) as a start. The retrieval corpus is given by `documents.abstract`.
87122

88-
**Sanity check:** Print some samples from the result and check that it makes sense. This way, you might be able to get a feeling for a good chunk size.
123+
### Step 3.2: Chunking
124+
Second, you need to chunk the documents in your retrieval corpus, as some likely are too long for the embedding model. Here, you can use the [RecursiveCharacterTextSplitter](https://docs.langchain.com/oss/python/integrations/splitters/recursive_text_splitter) as a start. The retrieval corpus is given by `documents.abstract`, so you can use `create_documents` on the text splitter with the retrieval corpus to create LangChain `Document` objects, and then use `split_documents` to create text chunks that will be used in creating the vector store.
89125

90-
### Step 2.3: Define a vector store and retriever
91-
Third, you need a vector store to store the documents and corresponding embeddings (indeces). There are many document databases and retrievers to play around with. As a start, you can use the [Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma/) vector store with cosine similarity as the distance metric. You can then define the retriever using something like the following:
126+
For evaluation in Step 5, we recommend saving the document id as `metadatas` when creating the document:
92127

93-
```python
94-
retriever = vector_store.as_retriever(...)
128+
```python
129+
metadatas = [{"id": idx} for idx in documents.index]
130+
texts = text_splitter.create_documents(texts=documents.abstract.tolist(), metadata=metadatas)
95131
```
96132

97-
As a start, you might want the retriever to fetch only one document per prompt.
133+
**Sanity check:** Print some samples from the text chunks and check that it makes sense. This way, you might be able to get a feeling for a good chunk size.
134+
135+
### Step 3.3: Define a vector store
136+
Third, you need a vector store to store the documents and corresponding embeddings. There are many document databases and retrievers to play around with. As a start, you can use the [Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma/) vector store with cosine similarity as the distance metric.
137+
138+
When building your vector store, pass the embedding model in [Step 3.1](#step-31-embedding-model) as the embedding model and use the text chunks in [Step 3.2](#step-32-chunking) as the documents in the vector store. To add documents in the vector store, you can Use `Chroma.from_documents` when creating the vector store or use `vector_store.add_documents` after creating the vector store.
139+
98140

99141
**Sanity check:** Query your vector store as follows and check that the results make sense:
100142
```python
@@ -105,26 +147,129 @@ for res, score in results:
105147
print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")
106148
```
107149

108-
## Step 3: Define the full RAG pipeline
109150

110-
We are now ready to combine all previously defined components into a complete RAG pipeline. Define a prompt and set up your chain with the retriever and generator LM. Here, you might want to define a chain that also returns what document was retrieved, the [RunnableParallel](https://python.langchain.com/v0.1/docs/expression_language/primitives/parallel/) function can be used for this.
151+
152+
## Step 4: Define the full RAG pipeline
153+
154+
In this and the following steps, we will guadually build a RAG chain.
155+
156+
There could be two options of building a RAG chain, and you can choose either **one** of them to build your own RAG:
157+
158+
[Option A](#option-a-build-a-rag-agent-based-on-the-official-langchain-guide): Build a RAG agent based on the official LangChain guide: [here](https://docs.langchain.com/oss/python/langchain/rag). Here we will use a two-step chain, in which we will run a search in the vector store, and incorporate the result as context for LLM queries.
159+
160+
[Option B](#option-b-build-a-rag-chain-based-on-langchain-open-tutorial): Build a RAG chain using LangChain Expression Language (LCEL) based on a LangChain Open Tutorial: [here](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/13-LangChain-Expression-Language/05-RunnableParallel.ipynb#scrollTo=635d8ebb). Here we will use the [RunnableParallel](https://reference.langchain.com/python/langchain_core/runnables/?h=runnablepara#langchain_core.runnables.base.RunnableParallel) class to build a RAG chain that will also return the retrieved document.
161+
162+
### Option A: Build a RAG agent based on the official LangChain guide
163+
164+
Here, we will define a custom prompt while incorporating the retrieval step.
165+
166+
In order to access the documents retrieved, we can create the prompt in a way that it will [return the source documents](https://docs.langchain.com/oss/python/langchain/rag#returning-source-documents).
167+
168+
```python
169+
from typing import Any
170+
from langchain_core.documents import Document
171+
from langchain.agents.middleware import AgentMiddleware, AgentState
172+
173+
174+
class State(AgentState):
175+
context: list[Document]
176+
177+
178+
class RetrieveDocumentsMiddleware(AgentMiddleware[State]):
179+
state_schema = State
180+
181+
def __init__(self, vector_store):
182+
self.vector_store = vector_store
183+
184+
def before_model(self, state: AgentState) -> dict[str, Any] | None:
185+
last_message = state["messages"][-1] # get the user input query
186+
retrieved_docs = self.vector_store.similarity_search(last_message.text) # search for documents
187+
188+
docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)
189+
190+
augmented_message_content = (
191+
# Put your prompt here
192+
)
193+
return {
194+
"messages": [last_message.model_copy(update={"content": augmented_message_content})],
195+
"context": retrieved_docs,
196+
}
197+
198+
```
199+
200+
As a start, you might want to fetch only one document per prompt.
201+
202+
<details>
203+
<summary><b>Hint:</b> Prompt model for classification later</summary>
204+
205+
In Step 5, we will be using the RAG agent to evaluate whether the model can correctly answer the questions with "Yes" or "No". For evaluation, you may want to prompt the model in a way that it will return only "Yes" or "No" or at least lead the answer with "Yes" or "No".
206+
207+
</details>
208+
209+
We are now ready to create a RAG agent. In this step, we can use `create_agent` to build a RAG agent, and use a `RetrieveDocumentsMiddleware` object to act as the middleware.
210+
211+
**Sanity check:** Take a question from your dataset and check whether the model seems to retrieve a relevant document, and answer in a reasonable fashion.
212+
213+
To print out the results prettily, you can use the solution given by Langchain:
214+
215+
```python
216+
for step in agent.stream(
217+
{"messages": [{"role": "user", "content": your_query}]},
218+
stream_mode="values",
219+
):
220+
step["messages"][-1].pretty_print()
221+
```
222+
223+
**Include the prompt and the output of this model in your output file.**
224+
225+
### Option B: Build a RAG chain based on LangChain Open Tutorial
226+
227+
Here, we will firstly define a retriever on the vector store to retrieve documents:
228+
229+
```python
230+
retriever = vectorstore.as_retriever()
231+
```
232+
233+
As a start, you might want the retriever to fetch only one document per prompt.
234+
235+
Then, define your template and use `ChatPromptTemplate.from_template` to create a Chat Prompt.
236+
237+
With the retriever and the prompt, you should be able to define the RAG chain. In order to return the retrieved context as well as the answers for further evaluation, firstly we can define a `RunnableParallel` object that can take the context and the question, then we can define a chain that only generate text outputs like this:
238+
239+
```python
240+
# Construct the retrieval chain
241+
chain = (
242+
prompt
243+
| model
244+
| StrOutputParser()
245+
)
246+
```
247+
Lastly, combine the `RunnableParallel` object with the chain using the [`assign`](https://reference.langchain.com/python/langchain_core/runnables/?h=runnablepara#langchain_core.runnables.base.RunnableParallel.assign) method.
248+
249+
```python
250+
rag_chain = runnable_parallel_object.assign(answer=chain)
251+
```
252+
253+
Then you should be able to access the retrieved documents with `answer["context"]`.
111254

112255
**Sanity check:** Take a question from your dataset and check whether the model seems to retrieve a relevant document, and answer in a reasonable fashion.
113256

114-
## Step 4: Evaluate the RAG pipeline on the dataset
257+
**Include the prompt and the output of this model in your output file.**
258+
259+
## Step 5: Evaluate RAG on the dataset
260+
261+
Here we will do 4 evaluation tasks to evaluate the RAG agent with the given dataset.
262+
263+
1. Evaluate your full RAG pipeline on the medical questions (`questions.question`) and corresponding gold labels (`questions.gold_label`).
264+
265+
Since the gold labels can be casted to a binary variable (yes/no) you may use the f1 and/or accuracy metrics.
266+
267+
We expect the model to give answers of "Yes" or "No", but it can happen that the model gives random answers. In this case, one way to perform the evaluation is to keep track of the number of valid answers and do evaluation only on the valid answers.
115268

116-
- Evaluate your full RAG pipeline on the medical questions (`questions.question`) and corresponding gold labels (`questions.gold_label`). Since the gold labels can be casted to a binary variable (yes/no) you may use the f1 metric.
117-
- Also evaluate your retriever by checking whether it managed to fetch passages from the gold document with ID given by `questions.gold_document_id`.
118-
- As a baseline, run the same LM without context and compare the performance of the two setups. Did the retrieval help?
119-
- Also, inspect some retrieved documents and corresponding model answers. Does the pipeline seem to work as intended?
269+
2. As a baseline, run the same LM without context and compare the performance of the two setups. You can use the same evaluation method as the previous RAG evaluation. Did the retrieval help?
120270

121-
## Step 5: Make improvements
271+
3. Also evaluate whether the gold documents are fetched for each question. You can compare the retrieved document id with the gold document with ID given by `questions.gold_document_id`.
122272

123-
After having observed the performance of your pipeline, you might have some ideas on how to improve it. Thanks to the abstraction level in LangChain, it should be quite easy to experiment with different improvements. Experiment with at least two types of improvements to your RAG pipeline that you find interesting. Make sure to document your experiments and the corresponding results.
273+
4. Also, inspect some retrieved documents and corresponding model answers. Does the pipeline seem to work as intended?
124274

125-
Aspects that can be tinkered with are for example:
126-
- the document chunker: some alternatives can be found [here](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/),
127-
- prompt: a guide on prompt tuning can be found [here](https://www.pinecone.io/learn/series/langchain/langchain-prompt-templates/),
128-
- retriever: some alternatives can be found [here](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/),
129-
- embedding model: some alternatives can be found [here](https://python.langchain.com/docs/integrations/text_embedding/),
130-
- etc...
275+
**Include the evaluation results in your output file.**

0 commit comments

Comments
 (0)