Jira Knowledge Base Chatbot (EN/AR, RAG, Citations, Permissions)
A private, bilingual (English + Arabic) knowledge base chatbot that answers employee questions using your Jira tickets as the single source of truth. It pulls and cleans issues from Jira (or a CSV/XLSX export), indexes them for hybrid retrieval (keyword + semantic embeddings), and returns grounded answers with ticket citations—respecting user permissions. Built to be simple to deploy and easy to trust.
*Highlights Ask in natural language (EN/AR): “Resume CR”, “What are the fees?”, “ما هي الرسوم؟”
Grounded answers with sources: Each reply cites the exact Jira keys (e.g., SW-9491) and links to Jira.
Hybrid search: BM25 keyword + semantic embeddings + (optional) re-ranker.
Acronyms & typos: Handles CR → change request, pluralization, minor misspellings.
Attachments (OCR): Optional OCR of PDFs/images so content is discoverable.
Permissions-aware: Enforces Jira-based visibility (no leaks).
Bilingual UI + RTL: Arabic interface with right-side menu; English with left-side menu.
Two ingestion modes:
CSV/XLSX import (fastest to start)
Direct Jira Sync backend (secure token; webhooks for freshness)
Why not just use Jira search? Chat answers vs. issue lists: The bot summarizes and cites; Jira lists require manual reading.
Smarter matching: Semantic understanding, synonyms, acronyms, typos, EN/AR.
Conversational refinement: “only resolved”, “filter to Vijandra”, “last 30 days”.
Broader coverage: Titles, descriptions, comments, and (optionally) attachment text. Architecture
Ingestion: Pull issues, descriptions, comments, labels, custom fields, (optional) attachments→OCR.
Indexing: Normalize Jira wiki markup, chunk text, create embeddings, hybrid searchable corpus.
Answering: Retrieve top-k, fuse scores, enforce grounding & citations, apply permissions.
UI: Next.js/React (or Base44), Arabic RTL with right-anchored menu; English LTR with left menu.
Data schema (CSV/XLSX import)
Required columns
jira_key (e.g., SW-9491)
project
summary
Optional
description, status, priority, assignee, labels
Cleaning rules
Remove Jira wiki image tags: ![...]!
Convert [text|url] → text (url)
Flatten newlines; trim whitespace.
Tip: Invalid keys (not matching ^[A-Z][A-Z0-9]+-\d+$) are skipped and logged.
🔐 Direct Jira Sync backend (optional)
For secure, automated syncing from Jira (pagination, retries, cleaning) use the included FastAPI backend:
Endpoints
POST /sync/start { jql?: string } → starts sync job
GET /sync/status?job_id=... → job state, rows, csv_url
GET /export/csv → latest export
POST /webhook/jira → accept Jira webhooks (for deltas)
Env
JIRA_BASE = https://your-domain.atlassian.net JIRA_EMAIL = [email protected] JIRA_TOKEN = <api_token> # keep secret; never expose to frontend EXPORT_CSV_PATH = ./data/export.csv EXPORT_CSV_FIELDS = jira_key,project,summary,description,status,priority,assignee,labels
Security: protect the backend (VPN/IP allowlist/API key), store tokens in secret manager, consider OAuth 3LO.
🌍 Internationalization & RTL
Language switch (EN/AR) persists (cookie/localStorage).
Sets for Arabic: side menu on the right; flips directional icons.
Logical CSS utilities (text-start/end, ps/pe, ltr:/rtl:) to avoid left/right bugs.
Numbers/dates localized via Intl (ar-EG/en-US).
Bot replies in the current UI language; citations (ticket keys) remain unchanged.
🧠 Retrieval & Answering
Hybrid search:
BM25 over summary, labels, description (field boosts, phrase queries).
Embeddings (OpenAI text-embedding-3-large or your choice).
Fused score: e.g., 0.6bm25 + 0.4cosine.
Query rewrite:
Acronyms & synonyms (e.g., “resume” → resume|reactivate|restore|reopen).
Stemming/pluralization (fee↔fees) and Arabic equivalents (e.g., رسوم ↔ fees).
Guardrails:
“Answer from snippets only; always cite keys”.
Confidence gating → if low, show “Closest matches” list (not a fake answer).
Per-user filtering via Jira permissions mapping / RLS.
🚀 Quick Start A) CSV/XLSX mode (fastest)
Prepare an export with the schema above.
Start the app (Base44/Next.js) and upload the file.
Ask: “Resume CR” → you should see top tickets with citations.
B) Direct Jira mode (secure & fresh)
Run the FastAPI backend (see /backend or provided zip).
From the UI, open Sync from Jira → enter JQL (e.g., ORDER BY updated DESC).
Poll progress; when ready, the UI imports the generated CSV automatically.
(Optional) Configure Jira webhook to /webhook/jira for near-real-time updates.
🧪 Pilot & Quality
Suggested pilot (2 weeks)
Users: Testing team.
Must-pass queries: “Resume CR”, “fees”, “UBO”, Arabic variants, typos.
Metrics (targets):
Top-3 hit rate ≥ 85%
“No info” rate ≤ 10%
P95 latency ≤ 2.0s
100% answers include ≥1 citation
0 permission leaks
RAG eval set Add verify/eval/qrels.jsonl with ~40 Q→expected keys and a script to measure Top-k accuracy.
🛠️ Local Development
pnpm install pnpm dev
pip install -r backend/requirements.txt export JIRA_BASE=... export JIRA_EMAIL=... export JIRA_TOKEN=... uvicorn app.main:app --host 0.0.0.0 --port 8080
Config
BACKEND_URL = http://localhost:8080 EMBEDDING_MODEL = text-embedding-3-large TOP_K = 8
🧰 Troubleshooting
“Invalid Jira key format in row X” Row is missing/invalid jira_key or a multiline description broke parsing. Fix: clean with regex ![^!]*! (remove image tags), flatten newlines, ensure proper quoting.
“No information found” but ticket exists Add synonyms/acronyms, enable phrase queries, increase field boosts for summary, check language detection.
Arabic layout wrong Ensure is set and use logical CSS (ltr:/rtl: variants, text-start/end).
Permission leaks Verify per-user/project filters / RLS policies are applied before retrieval.
🗺️ Roadmap
Optional direct-to-vector mode (skip CSV; write to pgvector).
Re-ranker (cross-encoder) for tighter Top-1.
Admin synonym editor (EN/AR).
Per-project dashboards (usage, accuracy, gaps).
OAuth 3LO for delegated Jira access.
🤝 Contributing
Fork & branch from main.
Use clear commit messages.
Add tests for retrieval and Arabic/RTL.
Open a PR with a brief description and screenshots (if UI).
📄 License
MIT (or your preferred license)
📷 Screenshots (placeholders)
Chat (EN) – LTR with left menu
Chat (AR) – RTL with right menu
Sync from Jira modal
“Why shown” panel with field matches & scores