Skip to content

A NestJS-powered AI API (Rest & GraphQL) that ingests legal documents, extracts key data with LLMs, and stores everything in a vector database for fast, intelligent search.

Notifications You must be signed in to change notification settings

dev3mike/legal-doc-assistant-ai

Repository files navigation

Legal Document Analyser

Legal Document Analyser

Legal Document Analyser is a NestJS service that ingests legal PDFs/HTML files, extracts structured metadata with OpenAI + LangChain, stores semantic vectors in Milvus, and exposes both REST and GraphQL APIs for querying processed documents.


Key Features

  • Upload HTML or PDF files over REST with server-side validation and disk storage in _tmp/.
  • Background processing pipeline (BullMQ + Redis) that publishes two jobs per upload: metadata extraction and long-form summarisation.
  • AI-powered jobs rely on LangChain, OpenAI LLM, and Milvus(Vector Store) to run Retrieval-Augmented Generation (RAG) workflows.
  • Prisma-powered PostgreSQL database keeps document records and job history.
  • GraphQL schema and REST controllers expose the same read models, including pagination and job status.

Folder Structure

src
├─ app.module.ts        # Bootstraps Config, GraphQL, Prisma, Docs, Queue modules
├─ common/              # Shared DTOs (pagination helpers, constants)
├─ config/              # zod-based env validation + helpers
├─ docs/                # REST + GraphQL endpoints, DTOs, pipes, resolvers
├─ prisma/              # Prisma module + service
├─ queue/               # BullMQ queues, processors, helpers, interfaces
├─ schema.gql           # Auto-generated GraphQL schema (kept in repo)
└─ main.ts              # Nest bootstrap file

System Diagram

flowchart LR
  subgraph Client Apps
    REST[REST upload\n/docs/upload]
    GQL[GraphQL queries\n/documents]
  end

  REST -->|PDF/HTML + metadata flag| C[DocsController]
  C -->|persist file info| DB[(PostgreSQL)]
  C -->|publish jobs| Q[BullMQ + Redis]
  Q -->|METADATA job| M[Metadata Processor\nLangChain + Milvus]
  Q -->|SUMMARY job| S[Summarizer Processor\nLangChain]
  M -->|vectors| V[(Milvus)]
  M -->|metadata| DB
  S -->|summary| DB
  DB -->|read models| GQL
  DB -->|REST listings| REST
Loading

Execution Flow

  • Upload flow: DocsController validates the file (PDF/HTML), stores it on disk, creates a Document + Jobs rows via Prisma, then publishes two BullMQ jobs through QueueService.
  • Metadata job (MetadataExtractorProcessor): loads the file, chunks it, creates embeddings with OpenAI, stores them in Milvus, runs a RAG prompt to extract fields (title, court, etc.), and updates the document plus job status in PostgreSQL.
  • Summary job (DocumentSummarizerProcessor): runs a MapReduce summarisation prompt on large chunks and stores the final summary back in PostgreSQL.
  • Read APIs: DocsController (REST) and DocsResolver (GraphQL) both call DocsService, which pages through documents, maps DTOs, and returns job history so clients can track processing state.

Requirements

  • Node.js 20+ and npm 10+
  • Docker (recommended) to start PostgreSQL, Redis, Milvus (etcd + MinIO).
  • OpenAI API key with access to gpt-4o-mini.
  • Local ports (default): 3006 for Nest, 5401 Postgres, 6301 Redis, 19530/9091 Milvus, 9000/9001 MinIO.

Environment Variables

Name Description Example
NODE_ENV development, test, or production development
PORT HTTP port for Nest 3000
ENABLE_SWAGGER Enable REST docs (true / false) true
DATABASE_URL PostgreSQL connection string postgresql://postgres:postgres@localhost:5401/legal-doc-assistant
MILVUS_URL Milvus gRPC endpoint http://localhost:19530
REDIS_HOST Redis hostname localhost
REDIS_PORT Redis port 6301
OPENAI_API_KEY OpenAI key used by LangChain sk-...

Create a .env in the project root from the example file (same level as package.json) that provides every required variable. The ConfigModule validates the file at startup using Zod; missing or invalid values stop the app early.


Getting Started

  1. Install dependencies
    npm install
  2. Start infrastructure (recommended)
    docker compose up -d
  3. Generate Prisma client & schema
    npx prisma generate
  4. Apply database migrations
    npx prisma migrate dev
  5. Run the Nest app
    npm run start:dev
  6. Open the APIs
    • GraphQL Playground: http://localhost:3006/graphql
    • (Optional) Swagger UI: http://localhost:3006/api/swagger when ENABLE_SWAGGER=true

REST Endpoints

  • POST /docs/upload
    • Form-data fields: file (PDF/HTML), is_scanned_document (optional boolean).
    • Response includes id, job_id (metadata job), created_at.
  • GET /docs
    • Query params: page, limit (defaults 1/10).
    • Returns paginated documents with metadata + jobs.
  • GET /docs/:id
    • Returns a single document including job history.

GraphQL Queries

Open the playground (/graphql) and run:

query Documents($page: Int = 1, $limit: Int = 10) {
  documents(page: $page, limit: $limit) {
    data {
      id
      title
      summary
      jobs {
        id
        status
        type
      }
    }
    meta {
      currentPage
      itemsPerPage
      hasNextPage
    }
  }
}
query Document($id: ID!) {
  document(id: $id) {
    id
    title
    court
    summary
    jobs {
      id
      status
    }
  }
}

Background Processing & Tools

  • BullMQ queues are configured in QueueModule; each job retries up to 3 times with exponential backoff.
  • LangChain helpers live in src/queue/helpers/queue.helper.ts, handling file validation, PDF/HTML parsing, and OpenAI client creation.
  • Milvus maintenance: use npm run milvus:reset (runs ts-node scripts/reset-milvus.ts) to clear the vector collection during local development.
  • Temporary files: uploads are stored in _tmp/ by multer. Clean it periodically if disk usage grows.

Testing

  • Unit tests

    npm test

    Runs Jest with the default config under src/**/*.spec.ts.

  • E2E tests

    npm run test:e2e

    Uses tests/jest-e2e.json to boot the full Nest app. Ensure the .env (or exported vars) points to reachable services—Postgres, Redis, etc.

  • LLM-as-judge test

    npm run llm:test:metadata

    Runs an e2e-style metadata extraction scenario (LLM-as-judge) against the real MetadataExtractorProcessor. Requirements:

    • OPENAI_API_KEY with access to the chosen models
    • MILVUS_URL pointing to a running Milvus instance (the test spins up a temporary collection and drops it afterwards; override its name via LLM_TEST_MILVUS_COLLECTION)
    • DATABASE_URL for Prisma (only used to satisfy the processor’s dependencies—data is not mutated)

    The script calls OpenAI twice (app model + judge). It fails if the judge returns verdict !== pass or score < 85.


Missing Features

  • Handle scanned documents This API cannot process scanned documents. There are ways to handle them, such as using OCR or vision models to extract the data, but these methods are not implemented yet.

Next Steps Toward Production

  • Observability stack: integrate Langfuse (or similar LLM tracing) plus OpenTelemetry exporters so prompts, token usage, and background job spans can be monitored centrally.
  • CI/CD pipeline: add a GitHub Actions workflow that runs lint, unit, e2e, and metadata judge tests, builds/pushes Docker images.
  • Infrastructure as code: create Kubernetes manifests for the Nest API, workers, Redis, Milvus, and PostgreSQL, including horizontal pod autoscaling, secrets management, and per-environment overlays.
  • Move file storage to S3/R2: Local disk storage is fragile; migrate uploads to AWS S3 or Cloudflare R2 to align with best practices.
  • Security Enhancements: Add auth, API tokens, and introduce rate limiting to prevent abuse and protect sensitive docs.

About

A NestJS-powered AI API (Rest & GraphQL) that ingests legal documents, extracts key data with LLMs, and stores everything in a vector database for fast, intelligent search.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published