DAVE: Document Assistant for Validation and Exploration

DAVE is an AI-powered framework designed for assisted analysis of document collections in knowledge-intensive domains. It enables domain experts to efficiently explore and analyze large text corpora through a combination of:

Entity-driven faceted search for structured information retrieval
Conversational AI interface for natural language queries
Interactive entity annotation and correction for improved knowledge management

DAVE is particularly useful in domains such as law, healthcare, finance, and real estate, where factual data is closely tied to entities and their relationships.

Top Features

Search & Filter: Retrieve documents using keyword-based and entity-driven faceted search.
Explore: Navigate documents based on extracted entities and metadata.
Conversational AI: Ask natural language questions and receive relevant document-based answers.
Knowledge Consolidation: Review and refine extracted annotations with user corrections.
Human-in-the-loop (HITL) Approach: Users can continuously refine system-generated annotations.

Architecture

Prerequisites

Before you begin, ensure you have the following installed on your system:

Docker (version 20.10 or higher)
Docker Compose (version 1.29 or higher)
Python (version 3.8 or higher)
NVIDIA GPU (optional, but recommended for text generation and vectorization services)

Installing Docker and Docker Compose

On Linux:

curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER

On macOS:

Download and install Docker Desktop for Mac

On Windows:

Download and install Docker Desktop for Windows

Verify installation:

docker --version
docker compose version

Project Structure

DAVE/
├── backend/
│   ├── documents/          # Document service
│   ├── qavectorizer/       # Question-answering vectorizer service
│   └── text-generation/    # Text generation service
├── frontend/               # Next.js UI application
├── scripts/
│   ├── input_data/         # Place your GateNLP format documents here
│   ├── upload_mongo.py     # Script to upload documents to MongoDB
│   └── insert_elastic.py   # Script to index documents in Elasticsearch
├── mongo/                  # MongoDB data and initialization scripts
├── elasticsearch/          # Elasticsearch data
├── models/                 # ML models directory
├── docker-compose.yml      # Docker Compose configuration
├── requirements.txt        # Python dependencies
└── .env                    # Environment variables (create from .env.sample)

Setup Instructions

1. Configure Environment Variables

Copy the sample environment file and configure it:

cp .env.sample .env

Edit the .env file and populate the following variables:

UI Configuration

UI_ACCESS_USERNAME - Username for UI authentication (default: admin)
UI_ACCESS_PASSWORD - Password for UI authentication (default: password)
UI_NEXTAUTH_SECRET - Secret key for NextAuth.js session encryption (generate a random string)
UI_NEXTAUTH_URL - NextAuth callback URL (default: http://127.0.0.1:3000/dave/api/auth)
UI_NEXT_PUBLIC_BASE_PATH - Base path for the UI (default: /dave)
UI_NEXT_PUBLIC_FULL_PATH - Full path URL for the UI (default: http://127.0.0.1:3000/dave)
UI_API_LLM - Internal URL for text generation service
UI_API_INDEXER - Internal URL for indexer service
UI_VARIANT - UI variant configuration (default: default)
LISTEN_UI - Port for UI service (default: 3000)

MongoDB Configuration

MONGO_ROOT_PASSWORD - Root password for MongoDB (change this!)
MONGO_PASSWORD - Application user password for MongoDB (change this!)
MONGO - MongoDB connection string (uses variables above)

Elasticsearch Configuration

ELASTIC_INDEX - Name of the Elasticsearch index (default: dave)

Text Generation Configuration

TEXT_GENERATION_ADDR - Internal URL for text generation service
TEXT_GENERATION_GPU_LAYERS - Number of GPU layers to use (default: 35)

QA Vectorizer Configuration

HOST_BASE_URL - Base URL for the host (default: http://0.0.0.0)
QAVECTORIZER_ADDR - Port for QA vectorizer service (default: 7863)
SENTENCE_TRANSFORMER_EMBEDDING_MODEL - Hugging Face model for embeddings
SENTENCE_TRANSFORMER_DEVICE - Device for inference (cuda for GPU, cpu for CPU)
OGG2NAME_INDEX - Index name for object-to-name mapping

General Configuration

RESTART_POLICY - Docker restart policy (default: unless-stopped)

2. Install Python Dependencies

From the root folder of the project, install the required Python packages:

pip install -r requirements.txt

Or using a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

3. Start Docker Services

Launch all services using Docker Compose:

docker compose up -d

This command will:

Build and start MongoDB
Build and start Elasticsearch
Build and start the document service
Build and start the QA vectorizer service
Build and start the text generation service
Build and start the UI service

Check that all services are running:

docker compose ps

Wait for all services to be healthy (especially Elasticsearch, which may take a minute to initialize).

Data Ingestion

4. Prepare Your Documents

Place your GateNLP format documents in the following directory:

scripts/input_data/

5. Upload Documents to MongoDB

From the root folder, run the MongoDB upload script:

python scripts/upload_mongo.py

This script will:

Read GateNLP documents from scripts/input_data/
Parse and validate the document format
Upload documents to MongoDB

6. Index Documents in Elasticsearch

From the root folder, run the Elasticsearch indexing script:

python scripts/insert_elastic.py

This script will:

Retrieve documents from MongoDB
Process and vectorize the documents
Index documents in Elasticsearch for fast searching

Accessing the Application

Once all services are running and data is ingested, you can access the UI at:

http://127.0.0.1:3000/dave

Important Note on Localhost

Note: If you experience connection issues, use 127.0.0.1 instead of localhost. Some systems have DNS resolution issues with localhost that can prevent proper connectivity.

Service Ports

The following services are exposed on these ports:

UI: 3000 (http://127.0.0.1:3000/dave)
MongoDB: 27018 (internal: 27017)
Document Service: 3001
Elasticsearch: 9200
QA Vectorizer: 7863
Text Generation: 7862, 8000

Troubleshooting

Services Not Starting

Check logs for a specific service:

docker compose logs -f <service_name>

Example:

docker compose logs -f ui
docker compose logs -f es
docker compose logs -f mongo

Elasticsearch Memory Issues

If Elasticsearch fails to start, you may need to increase Docker's memory allocation:

Docker Desktop: Go to Settings → Resources → Memory (set to at least 8GB)

MongoDB Connection Issues

Verify MongoDB is running and accepting connections:

docker compose exec mongo mongosh -u root -p <MONGO_ROOT_PASSWORD>

Python Script Errors

Ensure you're running scripts from the root folder:

# ✓ Correct
python scripts/upload_mongo.py

# ✗ Incorrect
cd scripts && python upload_mongo.py

GPU Not Detected

If you have an NVIDIA GPU but it's not being used:

Install NVIDIA Docker runtime
Verify with: docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

Stopping Services

To stop all services:

docker compose down

To stop and remove all data (volumes):

docker compose down -v

Updating

To update the services after code changes:

docker compose down
docker compose build --no-cache
docker compose up -d

Support

For issues or questions, please refer to the project documentation or contact the development team.

License

DAVE is open-source and released under the Apache-2.0 License.

Additional Links

Check out our demonstration video:

Cite this Work

Agazzi, R., Alva Principe, R., Pozzi, R., Ripamonti, M., & Palmonari, M. (2025). DAVE: A Framework for Assisted Analysis of Document Collections in Knowledge-Intensive Domains. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25), Demo Track. https://doi.org/10.24963/ijcai.2025/1246

Name		Name	Last commit message	Last commit date
Latest commit History 377 Commits
backend		backend
docs		docs
frontend		frontend
scripts		scripts
.env.sample		.env.sample
.gitignore		.gitignore
README.md		README.md
docker-compose-dev.yml		docker-compose-dev.yml
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

unimib-datAI/DAVE

Folders and files

Latest commit

History

Repository files navigation

DAVE: Document Assistant for Validation and Exploration

Top Features

Architecture

Prerequisites

Installing Docker and Docker Compose

On Linux:

On macOS:

On Windows:

Project Structure

Setup Instructions

1. Configure Environment Variables

UI Configuration

MongoDB Configuration

Elasticsearch Configuration

Text Generation Configuration

QA Vectorizer Configuration

General Configuration

2. Install Python Dependencies

3. Start Docker Services

Data Ingestion

4. Prepare Your Documents

5. Upload Documents to MongoDB

6. Index Documents in Elasticsearch

Accessing the Application

Important Note on Localhost

Service Ports

Troubleshooting

Services Not Starting

Elasticsearch Memory Issues

MongoDB Connection Issues

Python Script Errors

GPU Not Detected

Stopping Services

Updating

Support

License

Additional Links

Cite this Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages