DAVE is an AI-powered framework designed for assisted analysis of document collections in knowledge-intensive domains. It enables domain experts to efficiently explore and analyze large text corpora through a combination of:
- Entity-driven faceted search for structured information retrieval
- Conversational AI interface for natural language queries
- Interactive entity annotation and correction for improved knowledge management
DAVE is particularly useful in domains such as law, healthcare, finance, and real estate, where factual data is closely tied to entities and their relationships.
- Search & Filter: Retrieve documents using keyword-based and entity-driven faceted search.
- Explore: Navigate documents based on extracted entities and metadata.
- Conversational AI: Ask natural language questions and receive relevant document-based answers.
- Knowledge Consolidation: Review and refine extracted annotations with user corrections.
- Human-in-the-loop (HITL) Approach: Users can continuously refine system-generated annotations.
Before you begin, ensure you have the following installed on your system:
- Docker (version 20.10 or higher)
- Docker Compose (version 1.29 or higher)
- Python (version 3.8 or higher)
- NVIDIA GPU (optional, but recommended for text generation and vectorization services)
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USERDownload and install Docker Desktop for Mac
Download and install Docker Desktop for Windows
Verify installation:
docker --version
docker compose versionDAVE/
├── backend/
│ ├── documents/ # Document service
│ ├── qavectorizer/ # Question-answering vectorizer service
│ └── text-generation/ # Text generation service
├── frontend/ # Next.js UI application
├── scripts/
│ ├── input_data/ # Place your GateNLP format documents here
│ ├── upload_mongo.py # Script to upload documents to MongoDB
│ └── insert_elastic.py # Script to index documents in Elasticsearch
├── mongo/ # MongoDB data and initialization scripts
├── elasticsearch/ # Elasticsearch data
├── models/ # ML models directory
├── docker-compose.yml # Docker Compose configuration
├── requirements.txt # Python dependencies
└── .env # Environment variables (create from .env.sample)
Copy the sample environment file and configure it:
cp .env.sample .envEdit the .env file and populate the following variables:
UI_ACCESS_USERNAME- Username for UI authentication (default:admin)UI_ACCESS_PASSWORD- Password for UI authentication (default:password)UI_NEXTAUTH_SECRET- Secret key for NextAuth.js session encryption (generate a random string)UI_NEXTAUTH_URL- NextAuth callback URL (default:http://127.0.0.1:3000/dave/api/auth)UI_NEXT_PUBLIC_BASE_PATH- Base path for the UI (default:/dave)UI_NEXT_PUBLIC_FULL_PATH- Full path URL for the UI (default:http://127.0.0.1:3000/dave)UI_API_LLM- Internal URL for text generation serviceUI_API_INDEXER- Internal URL for indexer serviceUI_VARIANT- UI variant configuration (default:default)LISTEN_UI- Port for UI service (default:3000)
MONGO_ROOT_PASSWORD- Root password for MongoDB (change this!)MONGO_PASSWORD- Application user password for MongoDB (change this!)MONGO- MongoDB connection string (uses variables above)
ELASTIC_INDEX- Name of the Elasticsearch index (default:dave)
TEXT_GENERATION_ADDR- Internal URL for text generation serviceTEXT_GENERATION_GPU_LAYERS- Number of GPU layers to use (default:35)
HOST_BASE_URL- Base URL for the host (default:http://0.0.0.0)QAVECTORIZER_ADDR- Port for QA vectorizer service (default:7863)SENTENCE_TRANSFORMER_EMBEDDING_MODEL- Hugging Face model for embeddingsSENTENCE_TRANSFORMER_DEVICE- Device for inference (cudafor GPU,cpufor CPU)OGG2NAME_INDEX- Index name for object-to-name mapping
RESTART_POLICY- Docker restart policy (default:unless-stopped)
From the root folder of the project, install the required Python packages:
pip install -r requirements.txtOr using a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txtLaunch all services using Docker Compose:
docker compose up -dThis command will:
- Build and start MongoDB
- Build and start Elasticsearch
- Build and start the document service
- Build and start the QA vectorizer service
- Build and start the text generation service
- Build and start the UI service
Check that all services are running:
docker compose psWait for all services to be healthy (especially Elasticsearch, which may take a minute to initialize).
Place your GateNLP format documents in the following directory:
scripts/input_data/From the root folder, run the MongoDB upload script:
python scripts/upload_mongo.pyThis script will:
- Read GateNLP documents from
scripts/input_data/ - Parse and validate the document format
- Upload documents to MongoDB
From the root folder, run the Elasticsearch indexing script:
python scripts/insert_elastic.pyThis script will:
- Retrieve documents from MongoDB
- Process and vectorize the documents
- Index documents in Elasticsearch for fast searching
Once all services are running and data is ingested, you can access the UI at:
http://127.0.0.1:3000/dave
Note: If you experience connection issues, use 127.0.0.1 instead of localhost. Some systems have DNS resolution issues with localhost that can prevent proper connectivity.
The following services are exposed on these ports:
- UI:
3000(http://127.0.0.1:3000/dave) - MongoDB:
27018(internal: 27017) - Document Service:
3001 - Elasticsearch:
9200 - QA Vectorizer:
7863 - Text Generation:
7862,8000
Check logs for a specific service:
docker compose logs -f <service_name>Example:
docker compose logs -f ui
docker compose logs -f es
docker compose logs -f mongoIf Elasticsearch fails to start, you may need to increase Docker's memory allocation:
- Docker Desktop: Go to Settings → Resources → Memory (set to at least 8GB)
Verify MongoDB is running and accepting connections:
docker compose exec mongo mongosh -u root -p <MONGO_ROOT_PASSWORD>Ensure you're running scripts from the root folder:
# ✓ Correct
python scripts/upload_mongo.py
# ✗ Incorrect
cd scripts && python upload_mongo.pyIf you have an NVIDIA GPU but it's not being used:
- Install NVIDIA Docker runtime
- Verify with:
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
To stop all services:
docker compose downTo stop and remove all data (volumes):
docker compose down -vTo update the services after code changes:
docker compose down
docker compose build --no-cache
docker compose up -dFor issues or questions, please refer to the project documentation or contact the development team.
DAVE is open-source and released under the Apache-2.0 License.
Check out our demonstration video:
Agazzi, R., Alva Principe, R., Pozzi, R., Ripamonti, M., & Palmonari, M. (2025). DAVE: A Framework for Assisted Analysis of Document Collections in Knowledge-Intensive Domains. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25), Demo Track. https://doi.org/10.24963/ijcai.2025/1246
