Features β’ Architecture β’ Installation β’ Usage
A specialized cybersecurity chatbot that combines Google's Gemini 1.5 Flash model with a Retrieval-Augmented Generation (RAG) engine. Unlike standard AI models, this tool can ingest your specific security documents (PDFs, text files, OWASP guides) into a Pinecone vector database to provide accurate, source-cited answers about vulnerabilities, remediation, and company-specific policies.
|
|
|
|
The system moves away from purely local models to a hybrid cloud approach for performance and scalability:
- Ingestion: Python script (
ingest.py) reads text files -> Converts to Vectors (SentenceTransformers) -> Uploads to Pinecone. - Retrieval: User asks a question -> System finds top 5 relevant chunks from Pinecone.
- Generation: Relevant chunks + User Question are sent to Gemini 1.5 Flash to generate a unified, accurate answer.
OWASP_BERT/
βββ app.py # Main Streamlit Web Application
βββ ingest.py # Script to upload your documents to Pinecone
βββ chatbot_modules/ # Core Logic
β βββ chatbot.py # Orchestrator
β βββ components.py # Initializes Pinecone/Models
β βββ retrieval.py # RAG Logic (Search & Rerank)
β βββ llm_service.py # Google Gemini API Handler
β βββ config.py # Settings (API Keys, Paths)
β βββ ...
βββ data/ # Folder where you put your .txt files to be ingested
βββ Prompt_Templates/ # System prompts for the LLM
βββ legacy_code/ # (Optional) Old notebooks/BERT experiments
- Python 3.9+
- A Google Gemini API Key (Free tier available at aistudio.google.com)
- A Pinecone API Key (Free tier available at pinecone.io)
Clone the repository and install dependencies:
# Install required packages
pip install -r requirements.txt
# Ensure Pinecone and Streamlit are installed
pip install pinecone streamlit google-generativeai sentence-transformers python-dotenvCreate a .env file in the root directory:
GEMINI_API_KEY=your_gemini_key_here
PINECONE_API_KEY=your_pinecone_key_here
PINECONE_ENVIRONMENT=us-east-1The bot starts with an empty brain. You must feed it data.
- Create a folder named
datain the root directory. - Add
.txtfiles containing security info (e.g.,OWASP_Top_10.txt,Company_Policy.txt). - Run the ingestion script:
python ingest.pyOutput: "β Ingestion Complete! Your bot now has knowledge."
Launch the web interface:
python -m streamlit run app.pyThe app will open in your browser at http://localhost:8501.
You can tweak the bot's behavior in chatbot_modules/config.py:
CONFIG = {
"USE_PINECONE": True, # Enable RAG
"USE_LOCAL_LLM": False, # Keep False (We are using Gemini)
"EMBEDDING_MODEL_PATH": "all-MiniLM-L6-v2",
"MAX_CONTEXT_TOKENS": 2500, # How much text to read from documents
"TEMPERATURE": 0.7, # Creativity (0.0 = Strict, 1.0 = Creative)
}- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Open a Pull Request