Skip to content

Latest commit

 

History

History
393 lines (300 loc) · 9.28 KB

File metadata and controls

393 lines (300 loc) · 9.28 KB

Vectorize Iris

Vectorize Iris Node.js SDK

Document text extraction for Node.js & TypeScript

Extract text, tables, and structured data from PDFs, images, and documents with a single async function. Built on Vectorize Iris, the industry-leading AI extraction service.

npm version TypeScript License: MIT

Why Iris?

Traditional OCR tools struggle with complex layouts, poor scans, and structured data. Iris uses advanced AI to deliver:

  • High accuracy - Even with poor quality or complex documents
  • 📊 Structure preservation - Maintains tables, lists, and formatting
  • 🎯 Smart chunking - Semantic splitting perfect for RAG pipelines
  • 🔍 Metadata extraction - Extract specific fields using natural language
  • 🚀 TypeScript native - Full type safety with built-in types
  • Async-first - Promise-based API for modern Node.js

Quick Start

Installation

npm install @vectorize-io/iris

Authentication

Set your credentials (get them at vectorize.io):

export VECTORIZE_TOKEN="your-token"
export VECTORIZE_ORG_ID="your-org-id"

Basic Usage

import { extractTextFromFile } from '@vectorize-io/iris';

const result = await extractTextFromFile('document.pdf');
console.log(result.text);

That's it! Iris handles file upload, extraction, and polling automatically.

Features

Basic Text Extraction

import { extractTextFromFile } from '@vectorize-io/iris';

const result = await extractTextFromFile('document.pdf');
console.log(result.text);

Output:

This is the extracted text from your PDF document.
All formatting and structure is preserved.

Tables, lists, and other elements are properly extracted.

Extract from Buffer

import { extractText } from '@vectorize-io/iris';
import * as fs from 'fs';

const fileBuffer = fs.readFileSync('document.pdf');
const result = await extractText(fileBuffer, 'document.pdf');

console.log(`Extracted ${result.text.length} characters`);

Output:

Extracted 5536 characters

Chunking for RAG

import { extractTextFromFile } from '@vectorize-io/iris';
import type { ExtractionOptions } from '@vectorize-io/iris';

const options: ExtractionOptions = {
  chunkSize: 512
};

const result = await extractTextFromFile('long-document.pdf', options);

result.chunks?.forEach((chunk, i) => {
  console.log(`Chunk ${i+1}: ${chunk.substring(0, 100)}...`);
});

Output:

Chunk 1: # Introduction
This document covers the basics of machine learning...

Chunk 2: ## Neural Networks
Neural networks are computational models inspired by...

Chunk 3: ### Training Process
The training process involves adjusting weights...

Custom Parsing Instructions

import { extractTextFromFile } from '@vectorize-io/iris';

const result = await extractTextFromFile('report.pdf', {
  parsingInstructions: 'Extract only tables and numerical data, ignore narrative text'
});

console.log(result.text);

Output:

Q1 2024 Revenue: $1,250,000
Q2 2024 Revenue: $1,450,000
Q3 2024 Revenue: $1,680,000

Region    | Sales  | Growth
----------|--------|-------
North     | $500K  | +12%
South     | $380K  | +8%
East      | $420K  | +15%
West      | $380K  | +10%

Inferred Metadata Schema

import { extractTextFromFile } from '@vectorize-io/iris';

const result = await extractTextFromFile('invoice.pdf', {
  inferMetadataSchema: true
});

const metadata = JSON.parse(result.metadata!);
console.log(JSON.stringify(metadata, null, 2));

Output:

{
  "document_type": "invoice",
  "invoice_number": "INV-2024-001",
  "date": "2024-01-15",
  "total_amount": 1250.00,
  "currency": "USD",
  "vendor": "Acme Corp"
}

Express.js Integration

import express from 'express';
import multer from 'multer';
import { extractText } from '@vectorize-io/iris';
import * as fs from 'fs';

const app = express();
const upload = multer({ dest: 'uploads/' });

app.post('/extract', upload.single('file'), async (req, res) => {
  try {
    const fileBuffer = fs.readFileSync(req.file!.path);
    const result = await extractText(fileBuffer, req.file!.originalname);

    res.json({
      success: true,
      text: result.text,
      charCount: result.text?.length || 0
    });
  } catch (error) {
    res.status(500).json({
      success: false,
      error: error.message
    });
  }
});

app.listen(3000, () => {
  console.log('Server running on port 3000');
});

Request:

curl -F "file=@document.pdf" http://localhost:3000/extract

Response:

{
  "success": true,
  "text": "This is the extracted text...",
  "charCount": 5536
}

Batch Processing

import { extractTextFromFile } from '@vectorize-io/iris';
import * as fs from 'fs/promises';
import * as path from 'path';

async function processDirectory(dirPath: string) {
  const files = await fs.readdir(dirPath);
  const pdfFiles = files.filter(f => f.endsWith('.pdf'));

  for (const file of pdfFiles) {
    const filePath = path.join(dirPath, file);
    console.log(`Processing ${file}...`);

    const result = await extractTextFromFile(filePath);
    const outputPath = filePath.replace('.pdf', '.txt');

    await fs.writeFile(outputPath, result.text!);
    console.log(`  ✓ Saved to ${path.basename(outputPath)}`);
  }
}

processDirectory('./documents');

Output:

Processing report-q1.pdf...
  ✓ Saved to report-q1.txt
Processing report-q2.pdf...
  ✓ Saved to report-q2.txt
Processing report-q3.pdf...
  ✓ Saved to report-q3.txt

Parallel Processing

import { extractTextFromFile } from '@vectorize-io/iris';

const files = ['doc1.pdf', 'doc2.pdf', 'doc3.pdf'];

const results = await Promise.all(
  files.map(file => extractTextFromFile(file))
);

results.forEach((result, i) => {
  console.log(`${files[i]}: ${result.text?.length || 0} chars`);
});

Output:

doc1.pdf: 3421 chars
doc2.pdf: 5892 chars
doc3.pdf: 2156 chars

Error Handling

import { extractTextFromFile, VectorizeIrisError } from '@vectorize-io/iris';

try {
  const result = await extractTextFromFile('document.pdf');
  console.log(result.text);
} catch (error) {
  if (error instanceof VectorizeIrisError) {
    console.error('Extraction failed:', error.message);
  } else {
    console.error('Unexpected error:', error);
  }
}

Output:

Extraction failed: File not found: document.pdf

TypeScript Types

import type {
  ExtractionOptions,
  ExtractionResultData,
  MetadataExtractionStrategySchema
} from '@vectorize-io/iris';

// Type-safe options with structured schema (OpenAPI spec format)
const options: ExtractionOptions = {
  chunkSize: 512,
  parsingInstructions: 'Extract code blocks',
  metadataSchemas: [{
    id: 'doc-meta',
    schema: {
      title: 'string',
      author: 'string',
      date: 'string'
    }
  }],
  pollInterval: 2000,
  timeout: 300000
};

// Type-safe result
const result: ExtractionResultData = await extractTextFromFile('doc.pdf', options);

if (result.success) {
  console.log('Text:', result.text);
  console.log('Chunks:', result.chunks?.length);
  console.log('Metadata:', result.metadata);
}

API Reference

extractTextFromFile(filePath, options?)

Extract text from a file.

Parameters:

  • filePath (string): Path to the file
  • options (ExtractionOptions, optional): Extraction options

Returns: Promise<ExtractionResultData>

extractText(fileBuffer, fileName, options?)

Extract text from a buffer.

Parameters:

  • fileBuffer (Buffer): File content
  • fileName (string): File name
  • options (ExtractionOptions, optional): Extraction options

Returns: Promise<ExtractionResultData>

ExtractionOptions

interface ExtractionOptions {
  apiToken?: string;              // Override env var
  orgId?: string;                 // Override env var
  pollInterval?: number;          // ms between checks (default: 2000)
  timeout?: number;               // max ms to wait (default: 300000)
  type?: 'iris';                  // Extraction type
  chunkSize?: number;             // Chunk size (default: 256)
  metadataSchemas?: Array<{       // Metadata schemas
    id: string;
    schema: string;
  }>;
  inferMetadataSchema?: boolean;  // Auto-detect metadata
  parsingInstructions?: string;   // Custom instructions
}

ExtractionResultData

interface ExtractionResultData {
  success: boolean;
  text?: string;                  // Extracted text
  chunks?: string[];              // Text chunks
  metadata?: string;              // JSON metadata
  metadataSchema?: string;        // Schema ID
  chunksMetadata?: (string|null)[]; // Per-chunk metadata
  chunksSchema?: (string|null)[];   // Per-chunk schemas
  error?: string;                 // Error message
}

📚 Full Documentation | 🏠 Back to Main README