Legal Document Analyser is a NestJS service that ingests legal PDFs/HTML files, extracts structured metadata with OpenAI + LangChain, stores semantic vectors in Milvus, and exposes both REST and GraphQL APIs for querying processed documents.
- Upload HTML or PDF files over REST with server-side validation and disk storage in
_tmp/. - Background processing pipeline (BullMQ + Redis) that publishes two jobs per upload: metadata extraction and long-form summarisation.
- AI-powered jobs rely on LangChain, OpenAI LLM, and Milvus(Vector Store) to run Retrieval-Augmented Generation (RAG) workflows.
- Prisma-powered PostgreSQL database keeps document records and job history.
- GraphQL schema and REST controllers expose the same read models, including pagination and job status.
src
├─ app.module.ts # Bootstraps Config, GraphQL, Prisma, Docs, Queue modules
├─ common/ # Shared DTOs (pagination helpers, constants)
├─ config/ # zod-based env validation + helpers
├─ docs/ # REST + GraphQL endpoints, DTOs, pipes, resolvers
├─ prisma/ # Prisma module + service
├─ queue/ # BullMQ queues, processors, helpers, interfaces
├─ schema.gql # Auto-generated GraphQL schema (kept in repo)
└─ main.ts # Nest bootstrap file
flowchart LR
subgraph Client Apps
REST[REST upload\n/docs/upload]
GQL[GraphQL queries\n/documents]
end
REST -->|PDF/HTML + metadata flag| C[DocsController]
C -->|persist file info| DB[(PostgreSQL)]
C -->|publish jobs| Q[BullMQ + Redis]
Q -->|METADATA job| M[Metadata Processor\nLangChain + Milvus]
Q -->|SUMMARY job| S[Summarizer Processor\nLangChain]
M -->|vectors| V[(Milvus)]
M -->|metadata| DB
S -->|summary| DB
DB -->|read models| GQL
DB -->|REST listings| REST
- Upload flow:
DocsControllervalidates the file (PDF/HTML), stores it on disk, creates aDocument+Jobsrows via Prisma, then publishes two BullMQ jobs throughQueueService. - Metadata job (
MetadataExtractorProcessor): loads the file, chunks it, creates embeddings with OpenAI, stores them in Milvus, runs a RAG prompt to extract fields (title, court, etc.), and updates the document plus job status in PostgreSQL. - Summary job (
DocumentSummarizerProcessor): runs a MapReduce summarisation prompt on large chunks and stores the final summary back in PostgreSQL. - Read APIs:
DocsController(REST) andDocsResolver(GraphQL) both callDocsService, which pages through documents, maps DTOs, and returns job history so clients can track processing state.
- Node.js 20+ and npm 10+
- Docker (recommended) to start PostgreSQL, Redis, Milvus (etcd + MinIO).
- OpenAI API key with access to
gpt-4o-mini. - Local ports (default):
3006for Nest,5401Postgres,6301Redis,19530/9091Milvus,9000/9001MinIO.
| Name | Description | Example |
|---|---|---|
NODE_ENV |
development, test, or production |
development |
PORT |
HTTP port for Nest | 3000 |
ENABLE_SWAGGER |
Enable REST docs (true / false) |
true |
DATABASE_URL |
PostgreSQL connection string | postgresql://postgres:postgres@localhost:5401/legal-doc-assistant |
MILVUS_URL |
Milvus gRPC endpoint | http://localhost:19530 |
REDIS_HOST |
Redis hostname | localhost |
REDIS_PORT |
Redis port | 6301 |
OPENAI_API_KEY |
OpenAI key used by LangChain | sk-... |
Create a
.envin the project root from the example file (same level aspackage.json) that provides every required variable. TheConfigModulevalidates the file at startup using Zod; missing or invalid values stop the app early.
- Install dependencies
npm install
- Start infrastructure (recommended)
docker compose up -d
- Generate Prisma client & schema
npx prisma generate
- Apply database migrations
npx prisma migrate dev
- Run the Nest app
npm run start:dev
- Open the APIs
- GraphQL Playground:
http://localhost:3006/graphql - (Optional) Swagger UI:
http://localhost:3006/api/swaggerwhenENABLE_SWAGGER=true
- GraphQL Playground:
POST /docs/upload- Form-data fields:
file(PDF/HTML),is_scanned_document(optional boolean). - Response includes
id,job_id(metadata job),created_at.
- Form-data fields:
GET /docs- Query params:
page,limit(defaults 1/10). - Returns paginated documents with metadata + jobs.
- Query params:
GET /docs/:id- Returns a single document including job history.
Open the playground (/graphql) and run:
query Documents($page: Int = 1, $limit: Int = 10) {
documents(page: $page, limit: $limit) {
data {
id
title
summary
jobs {
id
status
type
}
}
meta {
currentPage
itemsPerPage
hasNextPage
}
}
}query Document($id: ID!) {
document(id: $id) {
id
title
court
summary
jobs {
id
status
}
}
}- BullMQ queues are configured in
QueueModule; each job retries up to 3 times with exponential backoff. - LangChain helpers live in
src/queue/helpers/queue.helper.ts, handling file validation, PDF/HTML parsing, and OpenAI client creation. - Milvus maintenance: use
npm run milvus:reset(runsts-node scripts/reset-milvus.ts) to clear the vector collection during local development. - Temporary files: uploads are stored in
_tmp/bymulter. Clean it periodically if disk usage grows.
-
Unit tests
npm testRuns Jest with the default config under
src/**/*.spec.ts. -
E2E tests
npm run test:e2e
Uses
tests/jest-e2e.jsonto boot the full Nest app. Ensure the.env(or exported vars) points to reachable services—Postgres, Redis, etc. -
LLM-as-judge test
npm run llm:test:metadata
Runs an e2e-style metadata extraction scenario (LLM-as-judge) against the real
MetadataExtractorProcessor. Requirements:OPENAI_API_KEYwith access to the chosen modelsMILVUS_URLpointing to a running Milvus instance (the test spins up a temporary collection and drops it afterwards; override its name viaLLM_TEST_MILVUS_COLLECTION)DATABASE_URLfor Prisma (only used to satisfy the processor’s dependencies—data is not mutated)
The script calls OpenAI twice (app model + judge). It fails if the judge returns
verdict !== passorscore < 85.
- Handle scanned documents This API cannot process scanned documents. There are ways to handle them, such as using OCR or vision models to extract the data, but these methods are not implemented yet.
- Observability stack: integrate Langfuse (or similar LLM tracing) plus OpenTelemetry exporters so prompts, token usage, and background job spans can be monitored centrally.
- CI/CD pipeline: add a GitHub Actions workflow that runs lint, unit, e2e, and metadata judge tests, builds/pushes Docker images.
- Infrastructure as code: create Kubernetes manifests for the Nest API, workers, Redis, Milvus, and PostgreSQL, including horizontal pod autoscaling, secrets management, and per-environment overlays.
- Move file storage to S3/R2: Local disk storage is fragile; migrate uploads to AWS S3 or Cloudflare R2 to align with best practices.
- Security Enhancements: Add auth, API tokens, and introduce rate limiting to prevent abuse and protect sensitive docs.
