remove-unused

Ncode01 · Ncode01 · commit dfd731ccdb83 · 2025-04-20T00:16:28.000+05:30
diff --git a/PROJECT_EXPLANATION.md b/PROJECT_EXPLANATION.md
@@ -10,18 +10,18 @@ The primary goal is to provide students with a tool to explore and understand th
 
 ## Key Processes and Components
 
-1.  **Data Ingestion and Processing (`pdf_chunker.py`, `embed_store.py`):**
+1.  **Data Ingestion and Processing (`pdf_chunker.py`, `faiss_store.py`):**
     *   **Loading:** The system first reads the `grade-11-history-text-book.pdf`.
     *   **Chunking:** The text content is broken down into smaller, manageable chunks (paragraphs or sections). This is crucial for effective retrieval.
     *   **Embedding:** Each text chunk is converted into a numerical representation called an "embedding" using a machine learning model (likely via `gemini_utils.py` or a dedicated embedding model). Embeddings capture the semantic meaning of the text.
     *   **Vector Store Creation:** These embeddings (vectors) and their corresponding text chunks are stored in a specialized database called a vector store. This project uses FAISS (`faiss_store.py`), which allows for very fast searching of similar vectors. The store consists of `faiss_index.index` (for the vectors) and `faiss_metadata.pkl` (linking vectors back to text and metadata). This step only needs to be done once unless the source PDF changes.
 
-2.  **User Interaction (Web: `app.py`, `templates/index.html`; CLI: `cli_chat.py`):**
+2.  **User Interaction (Web: `web.py`, `templates/index.html`; CLI: `cli_chat.py`):**
     *   Users can interact with the chatbot through either a web-based graphical interface or a simple command-line interface.
     *   The user types a question (e.g., "What were the main causes of World War 1 according to the textbook?").
 
-3.  **Retrieval-Augmented Generation (RAG) Pipeline (`query_answer.py`, `agents/` directory):**
-    *   **Query Analysis (`agents/query_analyzer.py` - potentially):** The user's query might be analyzed or rephrased for better retrieval.
+3.  **Retrieval-Augmented Generation (RAG) Pipeline (`agents/` directory):**
+    *   **Query Analysis (`agents/query_analyzer.py`):** The user's query is analyzed using basic regex to extract keywords and entities.
     *   **Query Embedding:** The user's question is also converted into an embedding using the same model as the document chunks.
     *   **Retrieval (`agents/retriever.py`, `faiss_store.py`):** The system searches the FAISS vector store for the text chunks whose embeddings are most similar (closest in vector space) to the query embedding. These retrieved chunks are considered the most relevant context from the textbook.
     *   **Context Expansion (`agents/context_expander.py` - potentially):** The retrieved context might be expanded or refined.
@@ -32,7 +32,7 @@ The primary goal is to provide students with a tool to explore and understand th
         *   Instructions for the AI (e.g., "Answer the user's question using *only* the provided context.").
     *   **Generation (`agents/generator.py`, `gemini_utils.py`):** The constructed prompt is sent to the Google Gemini API. Gemini reads the prompt, understands the question and the provided context, and generates a natural language answer.
     *   **Reference Tracking (`agents/reference_tracker.py` - potentially):** The system might track which parts of the retrieved context were used to generate the answer, potentially for citation purposes (though this isn't explicitly shown in the UI).
-    *   **Orchestration (`agents/orchestrator.py`):** This component likely manages the flow between the different agents (retriever, generator, etc.) ensuring they work together correctly.
+    *   **Orchestration (`agents/orchestrator.py`):** This component manages the flow between the different agents (retriever, generator, etc.) ensuring they work together correctly.
 
 4.  **Response Delivery:**
     *   The generated answer is sent back to the user interface (web or CLI) and displayed.
@@ -41,9 +41,9 @@ The primary goal is to provide students with a tool to explore and understand th
 ## Technology Stack
 
 *   **Language:** Python
-*   **Web Framework:** Flask (`app.py`)
+*   **Web Framework:** Flask (`web.py`)
 *   **Generative AI:** Google Gemini (`gemini_utils.py`)
 *   **Vector Store:** FAISS (`faiss_store.py`)
-*   **PDF Processing:** PyPDF (likely used in `pdf_chunker.py`)
+*   **PDF Processing:** PyMuPDF (via `fitz` in `pdf_chunker.py`)
 *   **Frontend:** HTML, CSS, JavaScript (`templates/`, `static/`)
-*   **Potential Libraries:** LangChain (often used for RAG orchestration), python-dotenv (for environment variables), NumPy.
+*   **Core Dependencies:** `python-dotenv` (for environment variables), `numpy`.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # Yuhasa - History Tutor Chatbot
 
-This project implements a Retrieval-Augmented Generation (RAG) chatbot focused on answering questions about a Grade 11 History textbook. It uses Google's Gemini AI for language understanding and generation, and FAISS for efficient information retrieval from the textbook content.
+This project implements a Retrieval-Augmented Generation (RAG) chatbot focused on answering questions about a Grade 11 History textbook. It uses Google's Gemini AI for language understanding and generation, and FAISS for efficient information retrieval from the textbook content. The project intentionally uses minimal dependencies (Flask, FAISS, Gemini, PyPDF) for simplicity, speed, and maintainability.
 
 ## Prerequisites
 
@@ -28,7 +28,7 @@ This project implements a Retrieval-Augmented Generation (RAG) chatbot focused o
 
     pip install -r requirements.txt
     ```
-    *(Note: A `requirements.txt` file might need to be created if it doesn't exist. Based on the project files, likely dependencies include: `flask`, `google-generativeai`, `faiss-cpu` or `faiss-gpu`, `langchain` (potentially), `pypdf`, `python-dotenv`, `numpy`, `spacy`)*
+    *(Note: The `requirements.txt` file lists the necessary dependencies: `flask`, `google-generativeai`, `faiss-cpu`, `pypdf`, `python-dotenv`, `numpy`)*
 
 3.  **Configure API Key:**
     *   Create a file named `.env` in the root project directory.
@@ -76,3 +76,6 @@ There seem to be multiple ways to interact with the chatbot:
 *   `chats/`: Stores conversation history (JSON files).
 *   `grade-11-history-text-book.pdf`: The source document.
 *   `faiss_index.index`, `faiss_metadata.pkl`: The generated vector store files.
+*   `requirements.txt`: Lists the project dependencies.
+*   `README.md`: This file.
+*   `PROJECT_EXPLANATION.md`: Detailed explanation of the project architecture.
diff --git a/agents/query_analyzer.py b/agents/query_analyzer.py
@@ -1,49 +1,30 @@
 # agents/query_analyzer.py
-import spacy
-import logging # Added import
+import logging
 from .base import BaseAgent
 import re
 import time
 
-# Load the spaCy model once when the class is instantiated
-try:
-    nlp = spacy.load("en_core_web_sm")
-    print("✅ spaCy model 'en_core_web_sm' loaded successfully.")
-except OSError:
-    print("❌ Error loading spaCy model 'en_core_web_sm'.")
-    print("   Please run: python -m spacy download en_core_web_sm")
-    nlp = None # Set nlp to None if loading fails
-
 logger = logging.getLogger(__name__) # Get a logger for this module
 
 class QueryAnalyzerAgent(BaseAgent):
-    """Agent responsible for analyzing the user query."""
+    """Agent responsible for analyzing the user query using simple regex methods."""
     def run(self, query: str, chat_history: list = None) -> dict: # Add chat_history parameter
         start_time = time.time()
         logger.debug(f"Analyzing query: '{query}' with history: {chat_history is not None}") # Log if history is present
-        if not nlp:
-            logger.warning("spaCy model not loaded, falling back to basic analysis.")
-            # Fallback basic extraction (similar to previous web.py logic)
-            keywords = re.findall(r'"(.*?)"|\b[A-Z][a-zA-Z]+\b', query)
-            entities = re.findall(r'\b[A-Z][a-zA-Z]+(?:\s+[A-Z][a-zA-Z]+)*\b', query)
-            keywords = list(set([k.strip().lower() for k in keywords if k]))
-            entities = list(set([e.strip() for e in entities if len(e.split()) > 1 or e in keywords]))
-        else:
-            # TODO: Incorporate chat_history into spaCy analysis if needed
-            # For now, just process the current query
-            doc = nlp(query)
-            
-            # Extract Named Entities (GPE, PERSON, ORG, LOC, EVENT, DATE etc.)
-            entities = list(set([ent.text.strip() for ent in doc.ents if ent.label_ in ["GPE", "PERSON", "ORG", "LOC", "EVENT", "DATE", "FAC", "PRODUCT", "WORK_OF_ART"]]))
-            
-            # Extract Keywords (Noun chunks and Proper Nouns)
-            keywords = list(set([chunk.text.lower().strip() for chunk in doc.noun_chunks]))
-            # Add proper nouns that might not be part of chunks or recognized entities
-            keywords.extend([token.text.lower().strip() for token in doc if token.pos_ == "PROPN" and token.text not in entities])
-            # Remove duplicates that might exist between entities and keywords after lowercasing
-            keywords = list(set(keywords)) 
-            # Optional: Remove very short keywords if needed
-            # keywords = [kw for kw in keywords if len(kw) > 2]
+
+        # Use basic regex extraction (similar to previous fallback logic)
+        # Extract potential keywords (quoted phrases or capitalized words)
+        keywords = re.findall(r'"(.*?)"|\b[A-Z][a-zA-Z]+\b', query)
+        # Extract potential entities (multi-word capitalized phrases)
+        entities = re.findall(r'\b[A-Z][a-zA-Z]+(?:\s+[A-Z][a-zA-Z]+)*\b', query)
+        
+        # Clean up keywords: lowercase and remove duplicates/empty strings
+        keywords = list(set([k.strip().lower() for k in keywords if k]))
+        # Clean up entities: remove duplicates and single words already in keywords
+        entities = list(set([e.strip() for e in entities if len(e.split()) > 1 or e in keywords]))
+        # Further refine keywords: remove any that are now part of multi-word entities
+        entity_words = set(word.lower() for entity in entities for word in entity.split())
+        keywords = [kw for kw in keywords if kw not in entity_words and kw not in [e.lower() for e in entities]]
 
         # Determine Query Type (Keep existing logic)
         query_lower = query.lower()
@@ -63,12 +44,12 @@ def run(self, query: str, chat_history: list = None) -> dict: # Add chat_history
             "entities": entities,
             "query_type": query_type,
             # Optionally include history info if used
-            # "history_considered": chat_history is not None 
+            # "history_considered": chat_history is not None
         }
-        
+
         end_time = time.time()
         # Log the extracted information
         logger.debug(f"Analysis Results: Keywords: {analysis['keywords']}, Entities: {analysis['entities']}, Query Type: {analysis['query_type']}")
         logger.debug(f"Analysis Time: {end_time - start_time:.4f}s")
-        
+
         return analysis
diff --git a/requirements.txt b/requirements.txt
@@ -1,10 +1,6 @@
 flask
 google-generativeai
 faiss-cpu
-langchain
 pypdf
 python-dotenv
 numpy
-spacy
-nltk
-gtts
diff --git a/web.py b/web.py
@@ -6,7 +6,6 @@
 import time
 from agents.orchestrator import OrchestratorAgent
 import traceback
-from gtts import gTTS
 from flask import send_file
 import io
 
@@ -177,24 +176,6 @@ def api_delete_chat(chat_id):
     os.remove(path)
     return jsonify({'success': True})
 
-@app.route('/tts', methods=['POST'])
-def text_to_speech():
-    data = request.get_json()
-    text = data.get('text', '')
-    if not text:
-        return jsonify({'error': 'No text provided'}), 400
-    try:
-        # Generate speech using gTTS
-        tts = gTTS(text=text, lang='en')
-        mp3_fp = io.BytesIO()
-        tts.write_to_fp(mp3_fp)
-        mp3_fp.seek(0)
-        # Return the audio file as response
-        return send_file(mp3_fp, mimetype='audio/mpeg', as_attachment=False, download_name='speech.mp3')
-    except Exception as e:
-        logger.error(f"Error generating speech: {e}", exc_info=True)
-        return jsonify({'error': 'Failed to generate speech'}), 500
-
 if __name__ == '__main__':
     if orchestrator:
         logger.info("Starting Flask development server...")