Skip to content

Commit dfd731c

Browse files
committed
remove-unused
1 parent cf3d1c0 commit dfd731c

File tree

5 files changed

+32
-71
lines changed

5 files changed

+32
-71
lines changed

PROJECT_EXPLANATION.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10,18 +10,18 @@ The primary goal is to provide students with a tool to explore and understand th
1010

1111
## Key Processes and Components
1212

13-
1. **Data Ingestion and Processing (`pdf_chunker.py`, `embed_store.py`):**
13+
1. **Data Ingestion and Processing (`pdf_chunker.py`, `faiss_store.py`):**
1414
* **Loading:** The system first reads the `grade-11-history-text-book.pdf`.
1515
* **Chunking:** The text content is broken down into smaller, manageable chunks (paragraphs or sections). This is crucial for effective retrieval.
1616
* **Embedding:** Each text chunk is converted into a numerical representation called an "embedding" using a machine learning model (likely via `gemini_utils.py` or a dedicated embedding model). Embeddings capture the semantic meaning of the text.
1717
* **Vector Store Creation:** These embeddings (vectors) and their corresponding text chunks are stored in a specialized database called a vector store. This project uses FAISS (`faiss_store.py`), which allows for very fast searching of similar vectors. The store consists of `faiss_index.index` (for the vectors) and `faiss_metadata.pkl` (linking vectors back to text and metadata). This step only needs to be done once unless the source PDF changes.
1818

19-
2. **User Interaction (Web: `app.py`, `templates/index.html`; CLI: `cli_chat.py`):**
19+
2. **User Interaction (Web: `web.py`, `templates/index.html`; CLI: `cli_chat.py`):**
2020
* Users can interact with the chatbot through either a web-based graphical interface or a simple command-line interface.
2121
* The user types a question (e.g., "What were the main causes of World War 1 according to the textbook?").
2222

23-
3. **Retrieval-Augmented Generation (RAG) Pipeline (`query_answer.py`, `agents/` directory):**
24-
* **Query Analysis (`agents/query_analyzer.py` - potentially):** The user's query might be analyzed or rephrased for better retrieval.
23+
3. **Retrieval-Augmented Generation (RAG) Pipeline (`agents/` directory):**
24+
* **Query Analysis (`agents/query_analyzer.py`):** The user's query is analyzed using basic regex to extract keywords and entities.
2525
* **Query Embedding:** The user's question is also converted into an embedding using the same model as the document chunks.
2626
* **Retrieval (`agents/retriever.py`, `faiss_store.py`):** The system searches the FAISS vector store for the text chunks whose embeddings are most similar (closest in vector space) to the query embedding. These retrieved chunks are considered the most relevant context from the textbook.
2727
* **Context Expansion (`agents/context_expander.py` - potentially):** The retrieved context might be expanded or refined.
@@ -32,7 +32,7 @@ The primary goal is to provide students with a tool to explore and understand th
3232
* Instructions for the AI (e.g., "Answer the user's question using *only* the provided context.").
3333
* **Generation (`agents/generator.py`, `gemini_utils.py`):** The constructed prompt is sent to the Google Gemini API. Gemini reads the prompt, understands the question and the provided context, and generates a natural language answer.
3434
* **Reference Tracking (`agents/reference_tracker.py` - potentially):** The system might track which parts of the retrieved context were used to generate the answer, potentially for citation purposes (though this isn't explicitly shown in the UI).
35-
* **Orchestration (`agents/orchestrator.py`):** This component likely manages the flow between the different agents (retriever, generator, etc.) ensuring they work together correctly.
35+
* **Orchestration (`agents/orchestrator.py`):** This component manages the flow between the different agents (retriever, generator, etc.) ensuring they work together correctly.
3636

3737
4. **Response Delivery:**
3838
* The generated answer is sent back to the user interface (web or CLI) and displayed.
@@ -41,9 +41,9 @@ The primary goal is to provide students with a tool to explore and understand th
4141
## Technology Stack
4242

4343
* **Language:** Python
44-
* **Web Framework:** Flask (`app.py`)
44+
* **Web Framework:** Flask (`web.py`)
4545
* **Generative AI:** Google Gemini (`gemini_utils.py`)
4646
* **Vector Store:** FAISS (`faiss_store.py`)
47-
* **PDF Processing:** PyPDF (likely used in `pdf_chunker.py`)
47+
* **PDF Processing:** PyMuPDF (via `fitz` in `pdf_chunker.py`)
4848
* **Frontend:** HTML, CSS, JavaScript (`templates/`, `static/`)
49-
* **Potential Libraries:** LangChain (often used for RAG orchestration), python-dotenv (for environment variables), NumPy.
49+
* **Core Dependencies:** `python-dotenv` (for environment variables), `numpy`.

README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Yuhasa - History Tutor Chatbot
22

3-
This project implements a Retrieval-Augmented Generation (RAG) chatbot focused on answering questions about a Grade 11 History textbook. It uses Google's Gemini AI for language understanding and generation, and FAISS for efficient information retrieval from the textbook content.
3+
This project implements a Retrieval-Augmented Generation (RAG) chatbot focused on answering questions about a Grade 11 History textbook. It uses Google's Gemini AI for language understanding and generation, and FAISS for efficient information retrieval from the textbook content. The project intentionally uses minimal dependencies (Flask, FAISS, Gemini, PyPDF) for simplicity, speed, and maintainability.
44

55
## Prerequisites
66

@@ -28,7 +28,7 @@ This project implements a Retrieval-Augmented Generation (RAG) chatbot focused o
2828
2929
pip install -r requirements.txt
3030
```
31-
*(Note: A `requirements.txt` file might need to be created if it doesn't exist. Based on the project files, likely dependencies include: `flask`, `google-generativeai`, `faiss-cpu` or `faiss-gpu`, `langchain` (potentially), `pypdf`, `python-dotenv`, `numpy`, `spacy`)*
31+
*(Note: The `requirements.txt` file lists the necessary dependencies: `flask`, `google-generativeai`, `faiss-cpu`, `pypdf`, `python-dotenv`, `numpy`)*
3232
3333
3. **Configure API Key:**
3434
* Create a file named `.env` in the root project directory.
@@ -76,3 +76,6 @@ There seem to be multiple ways to interact with the chatbot:
7676
* `chats/`: Stores conversation history (JSON files).
7777
* `grade-11-history-text-book.pdf`: The source document.
7878
* `faiss_index.index`, `faiss_metadata.pkl`: The generated vector store files.
79+
* `requirements.txt`: Lists the project dependencies.
80+
* `README.md`: This file.
81+
* `PROJECT_EXPLANATION.md`: Detailed explanation of the project architecture.

agents/query_analyzer.py

Lines changed: 19 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,30 @@
11
# agents/query_analyzer.py
2-
import spacy
3-
import logging # Added import
2+
import logging
43
from .base import BaseAgent
54
import re
65
import time
76

8-
# Load the spaCy model once when the class is instantiated
9-
try:
10-
nlp = spacy.load("en_core_web_sm")
11-
print("✅ spaCy model 'en_core_web_sm' loaded successfully.")
12-
except OSError:
13-
print("❌ Error loading spaCy model 'en_core_web_sm'.")
14-
print(" Please run: python -m spacy download en_core_web_sm")
15-
nlp = None # Set nlp to None if loading fails
16-
177
logger = logging.getLogger(__name__) # Get a logger for this module
188

199
class QueryAnalyzerAgent(BaseAgent):
20-
"""Agent responsible for analyzing the user query."""
10+
"""Agent responsible for analyzing the user query using simple regex methods."""
2111
def run(self, query: str, chat_history: list = None) -> dict: # Add chat_history parameter
2212
start_time = time.time()
2313
logger.debug(f"Analyzing query: '{query}' with history: {chat_history is not None}") # Log if history is present
24-
if not nlp:
25-
logger.warning("spaCy model not loaded, falling back to basic analysis.")
26-
# Fallback basic extraction (similar to previous web.py logic)
27-
keywords = re.findall(r'"(.*?)"|\b[A-Z][a-zA-Z]+\b', query)
28-
entities = re.findall(r'\b[A-Z][a-zA-Z]+(?:\s+[A-Z][a-zA-Z]+)*\b', query)
29-
keywords = list(set([k.strip().lower() for k in keywords if k]))
30-
entities = list(set([e.strip() for e in entities if len(e.split()) > 1 or e in keywords]))
31-
else:
32-
# TODO: Incorporate chat_history into spaCy analysis if needed
33-
# For now, just process the current query
34-
doc = nlp(query)
35-
36-
# Extract Named Entities (GPE, PERSON, ORG, LOC, EVENT, DATE etc.)
37-
entities = list(set([ent.text.strip() for ent in doc.ents if ent.label_ in ["GPE", "PERSON", "ORG", "LOC", "EVENT", "DATE", "FAC", "PRODUCT", "WORK_OF_ART"]]))
38-
39-
# Extract Keywords (Noun chunks and Proper Nouns)
40-
keywords = list(set([chunk.text.lower().strip() for chunk in doc.noun_chunks]))
41-
# Add proper nouns that might not be part of chunks or recognized entities
42-
keywords.extend([token.text.lower().strip() for token in doc if token.pos_ == "PROPN" and token.text not in entities])
43-
# Remove duplicates that might exist between entities and keywords after lowercasing
44-
keywords = list(set(keywords))
45-
# Optional: Remove very short keywords if needed
46-
# keywords = [kw for kw in keywords if len(kw) > 2]
14+
15+
# Use basic regex extraction (similar to previous fallback logic)
16+
# Extract potential keywords (quoted phrases or capitalized words)
17+
keywords = re.findall(r'"(.*?)"|\b[A-Z][a-zA-Z]+\b', query)
18+
# Extract potential entities (multi-word capitalized phrases)
19+
entities = re.findall(r'\b[A-Z][a-zA-Z]+(?:\s+[A-Z][a-zA-Z]+)*\b', query)
20+
21+
# Clean up keywords: lowercase and remove duplicates/empty strings
22+
keywords = list(set([k.strip().lower() for k in keywords if k]))
23+
# Clean up entities: remove duplicates and single words already in keywords
24+
entities = list(set([e.strip() for e in entities if len(e.split()) > 1 or e in keywords]))
25+
# Further refine keywords: remove any that are now part of multi-word entities
26+
entity_words = set(word.lower() for entity in entities for word in entity.split())
27+
keywords = [kw for kw in keywords if kw not in entity_words and kw not in [e.lower() for e in entities]]
4728

4829
# Determine Query Type (Keep existing logic)
4930
query_lower = query.lower()
@@ -63,12 +44,12 @@ def run(self, query: str, chat_history: list = None) -> dict: # Add chat_history
6344
"entities": entities,
6445
"query_type": query_type,
6546
# Optionally include history info if used
66-
# "history_considered": chat_history is not None
47+
# "history_considered": chat_history is not None
6748
}
68-
49+
6950
end_time = time.time()
7051
# Log the extracted information
7152
logger.debug(f"Analysis Results: Keywords: {analysis['keywords']}, Entities: {analysis['entities']}, Query Type: {analysis['query_type']}")
7253
logger.debug(f"Analysis Time: {end_time - start_time:.4f}s")
73-
54+
7455
return analysis

requirements.txt

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,6 @@
11
flask
22
google-generativeai
33
faiss-cpu
4-
langchain
54
pypdf
65
python-dotenv
76
numpy
8-
spacy
9-
nltk
10-
gtts

web.py

Lines changed: 0 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@
66
import time
77
from agents.orchestrator import OrchestratorAgent
88
import traceback
9-
from gtts import gTTS
109
from flask import send_file
1110
import io
1211

@@ -177,24 +176,6 @@ def api_delete_chat(chat_id):
177176
os.remove(path)
178177
return jsonify({'success': True})
179178

180-
@app.route('/tts', methods=['POST'])
181-
def text_to_speech():
182-
data = request.get_json()
183-
text = data.get('text', '')
184-
if not text:
185-
return jsonify({'error': 'No text provided'}), 400
186-
try:
187-
# Generate speech using gTTS
188-
tts = gTTS(text=text, lang='en')
189-
mp3_fp = io.BytesIO()
190-
tts.write_to_fp(mp3_fp)
191-
mp3_fp.seek(0)
192-
# Return the audio file as response
193-
return send_file(mp3_fp, mimetype='audio/mpeg', as_attachment=False, download_name='speech.mp3')
194-
except Exception as e:
195-
logger.error(f"Error generating speech: {e}", exc_info=True)
196-
return jsonify({'error': 'Failed to generate speech'}), 500
197-
198179
if __name__ == '__main__':
199180
if orchestrator:
200181
logger.info("Starting Flask development server...")

0 commit comments

Comments
 (0)