GitHub - dev3mike/legal-doc-assistant-ai: A NestJS-powered AI API (Rest & GraphQL) that ingests legal documents, extracts key data with LLMs, and stores everything in a vector database for fast, intelligent search.

Legal Document Analyser

Legal Document Analyser is a NestJS service that ingests legal PDFs/HTML files, extracts structured metadata with OpenAI + LangChain, stores semantic vectors in Milvus, and exposes both REST and GraphQL APIs for querying processed documents.

Key Features

Upload HTML or PDF files over REST with server-side validation and disk storage in _tmp/.
Background processing pipeline (BullMQ + Redis) that publishes two jobs per upload: metadata extraction and long-form summarisation.
AI-powered jobs rely on LangChain, OpenAI LLM, and Milvus(Vector Store) to run Retrieval-Augmented Generation (RAG) workflows.
Prisma-powered PostgreSQL database keeps document records and job history.
GraphQL schema and REST controllers expose the same read models, including pagination and job status.

Folder Structure

src
├─ app.module.ts        # Bootstraps Config, GraphQL, Prisma, Docs, Queue modules
├─ common/              # Shared DTOs (pagination helpers, constants)
├─ config/              # zod-based env validation + helpers
├─ docs/                # REST + GraphQL endpoints, DTOs, pipes, resolvers
├─ prisma/              # Prisma module + service
├─ queue/               # BullMQ queues, processors, helpers, interfaces
├─ schema.gql           # Auto-generated GraphQL schema (kept in repo)
└─ main.ts              # Nest bootstrap file

System Diagram

flowchart LR
  subgraph Client Apps
    REST[REST upload\n/docs/upload]
    GQL[GraphQL queries\n/documents]
  end

  REST -->|PDF/HTML + metadata flag| C[DocsController]
  C -->|persist file info| DB[(PostgreSQL)]
  C -->|publish jobs| Q[BullMQ + Redis]
  Q -->|METADATA job| M[Metadata Processor\nLangChain + Milvus]
  Q -->|SUMMARY job| S[Summarizer Processor\nLangChain]
  M -->|vectors| V[(Milvus)]
  M -->|metadata| DB
  S -->|summary| DB
  DB -->|read models| GQL
  DB -->|REST listings| REST

Execution Flow

Upload flow: DocsController validates the file (PDF/HTML), stores it on disk, creates a Document + Jobs rows via Prisma, then publishes two BullMQ jobs through QueueService.
Metadata job (MetadataExtractorProcessor): loads the file, chunks it, creates embeddings with OpenAI, stores them in Milvus, runs a RAG prompt to extract fields (title, court, etc.), and updates the document plus job status in PostgreSQL.
Summary job (DocumentSummarizerProcessor): runs a MapReduce summarisation prompt on large chunks and stores the final summary back in PostgreSQL.
Read APIs: DocsController (REST) and DocsResolver (GraphQL) both call DocsService, which pages through documents, maps DTOs, and returns job history so clients can track processing state.

Requirements

Node.js 20+ and npm 10+
Docker (recommended) to start PostgreSQL, Redis, Milvus (etcd + MinIO).
OpenAI API key with access to gpt-4o-mini.
Local ports (default): 3006 for Nest, 5401 Postgres, 6301 Redis, 19530/9091 Milvus, 9000/9001 MinIO.

Environment Variables

Name	Description	Example
`NODE_ENV`	`development`, `test`, or `production`	`development`
`PORT`	HTTP port for Nest	`3000`
`ENABLE_SWAGGER`	Enable REST docs (`true` / `false`)	`true`
`DATABASE_URL`	PostgreSQL connection string	`postgresql://postgres:postgres@localhost:5401/legal-doc-assistant`
`MILVUS_URL`	Milvus gRPC endpoint	`http://localhost:19530`
`REDIS_HOST`	Redis hostname	`localhost`
`REDIS_PORT`	Redis port	`6301`
`OPENAI_API_KEY`	OpenAI key used by LangChain	`sk-...`

Create a .env in the project root from the example file (same level as package.json) that provides every required variable. The ConfigModule validates the file at startup using Zod; missing or invalid values stop the app early.

Getting Started

Install dependencies
```
npm install
```
Start infrastructure (recommended)
```
docker compose up -d
```
Generate Prisma client & schema
```
npx prisma generate
```
Apply database migrations
```
npx prisma migrate dev
```
Run the Nest app
```
npm run start:dev
```
Open the APIs
- GraphQL Playground: http://localhost:3006/graphql
- (Optional) Swagger UI: http://localhost:3006/api/swagger when ENABLE_SWAGGER=true

REST Endpoints

POST /docs/upload
- Form-data fields: file (PDF/HTML), is_scanned_document (optional boolean).
- Response includes id, job_id (metadata job), created_at.
GET /docs
- Query params: page, limit (defaults 1/10).
- Returns paginated documents with metadata + jobs.
GET /docs/:id
- Returns a single document including job history.

GraphQL Queries

Open the playground (/graphql) and run:

query Documents($page: Int = 1, $limit: Int = 10) {
  documents(page: $page, limit: $limit) {
    data {
      id
      title
      summary
      jobs {
        id
        status
        type
      }
    }
    meta {
      currentPage
      itemsPerPage
      hasNextPage
    }
  }
}

query Document($id: ID!) {
  document(id: $id) {
    id
    title
    court
    summary
    jobs {
      id
      status
    }
  }
}

Background Processing & Tools

BullMQ queues are configured in QueueModule; each job retries up to 3 times with exponential backoff.
LangChain helpers live in src/queue/helpers/queue.helper.ts, handling file validation, PDF/HTML parsing, and OpenAI client creation.
Milvus maintenance: use npm run milvus:reset (runs ts-node scripts/reset-milvus.ts) to clear the vector collection during local development.
Temporary files: uploads are stored in _tmp/ by multer. Clean it periodically if disk usage grows.

Testing

Unit tests
```
npm test
```
Runs Jest with the default config under src/**/*.spec.ts.
E2E tests
```
npm run test:e2e
```
Uses tests/jest-e2e.json to boot the full Nest app. Ensure the .env (or exported vars) points to reachable services—Postgres, Redis, etc.
LLM-as-judge test
```
npm run llm:test:metadata
```
Runs an e2e-style metadata extraction scenario (LLM-as-judge) against the real MetadataExtractorProcessor. Requirements:
- OPENAI_API_KEY with access to the chosen models
- MILVUS_URL pointing to a running Milvus instance (the test spins up a temporary collection and drops it afterwards; override its name via LLM_TEST_MILVUS_COLLECTION)
- DATABASE_URL for Prisma (only used to satisfy the processor’s dependencies—data is not mutated)
The script calls OpenAI twice (app model + judge). It fails if the judge returns verdict !== pass or score < 85.

Missing Features

Handle scanned documents This API cannot process scanned documents. There are ways to handle them, such as using OCR or vision models to extract the data, but these methods are not implemented yet.

Next Steps Toward Production

Observability stack: integrate Langfuse (or similar LLM tracing) plus OpenTelemetry exporters so prompts, token usage, and background job spans can be monitored centrally.
CI/CD pipeline: add a GitHub Actions workflow that runs lint, unit, e2e, and metadata judge tests, builds/pushes Docker images.
Infrastructure as code: create Kubernetes manifests for the Nest API, workers, Redis, Milvus, and PostgreSQL, including horizontal pod autoscaling, secrets management, and per-environment overlays.
Move file storage to S3/R2: Local disk storage is fragile; migrate uploads to AWS S3 or Cloudflare R2 to align with best practices.
Security Enhancements: Add auth, API tokens, and introduce rate limiting to prevent abuse and protect sensitive docs.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
_tmp		_tmp
llm_tests		llm_tests
prisma		prisma
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.prettierrc		.prettierrc
README.md		README.md
docker-compose.yml		docker-compose.yml
eslint.config.mjs		eslint.config.mjs
image.png		image.png
nest-cli.json		nest-cli.json
package-lock.json		package-lock.json
package.json		package.json
prisma.config.ts		prisma.config.ts
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Legal Document Analyser

Key Features

Folder Structure

System Diagram

Execution Flow

Requirements

Environment Variables

Getting Started

REST Endpoints

GraphQL Queries

Background Processing & Tools

Testing

Missing Features

Next Steps Toward Production

About

Uh oh!

Releases

Packages

Languages

dev3mike/legal-doc-assistant-ai

Folders and files

Latest commit

History

Repository files navigation

Legal Document Analyser

Key Features

Folder Structure

System Diagram

Execution Flow

Requirements

Environment Variables

Getting Started

REST Endpoints

GraphQL Queries

Background Processing & Tools

Testing

Missing Features

Next Steps Toward Production

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages