SQLite Document Store Plugin

OpenClaw plugin project for buffered document storage on top of SQLite.

Goal

Provide a strongly typed storage/processing layer for the trading-agent project:

save normalized documents
chunk them for retrieval
index them with SQLite FTS5
generate optional chunk embeddings through an OpenAI-compatible API
keep a path open for sqlite-vec integration
expose a reusable API that agents can call through a future runtime tool layer

Current scope

Implemented now:

strict TypeScript project scaffold
SQLite schema for documents, document_chunks, ingest_runs, and ingest_errors
normalization and chunking
document save / dedupe by content hash
FTS-based search over documents and chunks
optional OpenAI-compatible embeddings on save/backfill
JSONL batch ingestion script
URL ingestion script
embeddings backfill script
tests for chunking, repository behavior, embedding generation/failure handling, and ingest run logging
strict ESLint + TypeScript config

Planned next:

runtime-facing OpenClaw tool registration
sqlite-vec integration
vector / hybrid retrieval helpers
richer reranking helpers
source-specific ingestion helpers above the shared pipeline

Storage model

documents

Stores document-level metadata and normalized content.

document_chunks

Stores retrieval-sized chunks derived from documents.text_clean. Embeddings are stored in:

embedding_json
embedding_model

Document metadata model (v2)

The plugin now stores three practical metadata layers:

1) Core document columns

Used for filtering, ranking, provenance, and reprocessing:

source_type
document_type
published_at
collected_at
language
country
content_hash
status
embedding_status
embedding_model
chunk_count
token_count_estimate
processing_error
last_processed_at

2) Provenance / trust columns

Used to separate official sources from secondary ones:

source_url_canonical
source_domain
source_priority
is_official_source
source_publisher
source_section
retrieved_via
http_status
content_type
etag
last_modified
fetch_run_id
trust_score

3) Chunk metadata

Used for retrieval quality and later vector search:

char_count
embedding_status
starts_at_char
ends_at_char
chunk_kind

FTS indexes

documents_fts
document_chunks_fts

Embeddings config

The plugin can call an OpenAI-compatible embeddings endpoint when saving documents. If enabled, embeddings are generated for each new chunk and stored in SQLite. If the embedding request fails, the document is still saved and the save result reports an embedding failure.

Example config payload for plugins.entries.sqlite-doc-store.config:

{
  "dbPath": "/config/.openclaw/sqlite-doc-store/documents.sqlite",
  "enableFts": true,
  "vectorMode": "disabled",
  "embedding": {
    "enabled": true,
    "apiUrl": "https://api.openai.com/v1/embeddings",
    "apiKey": "YOUR_TOKEN",
    "model": "text-embedding-3-small",
    "timeoutMs": 30000,
    "batchSize": 32
  }
}

Supported embedding fields:

enabled
apiUrl
apiKey
model
timeoutMs
batchSize
dimensions (optional)

Ingestion entrypoints

JSONL batch ingest

npm run ingest:jsonl -- ./reports/trading/ingest/cbr.jsonl

URL ingest

npm run ingest:url -- \
  --url https://cbr.ru/press/pr/?file=13022026_133000key.htm

ingest:url now supports source profiles. For known domains like cbr.ru, minfin.gov.ru, moex.com / iss.moex.com, econs.online, acra-ratings.ru, raexpert.ru, bofit.fi, and cbonds.ru, the plugin can auto-fill:

sourceType
sourceName
default documentType
extraction selectors
provenance defaults (sourcePriority, isOfficialSource, trustScore)

You can still override extraction manually with:

--content-selector
--title-selector
--remove-selector

Embeddings backfill

npm run embeddings:backfill -- --limit 100

Development

npm install
npm run check

Design notes

SQLite is the buffered storage layer
agents should not own low-level chunking/indexing logic
the plugin is intended to become a domain processing layer, not just a thin DB driver
embeddings are stored now so sqlite-vec / hybrid retrieval can be layered in next without reworking ingestion

Status

This repository is an actively usable storage scaffold. The core storage layer, chunking, FTS, and embedding generation are implemented and tested; runtime tool wiring is the next step.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
scripts		scripts
skills/sqlite-doc-store-tools		skills/sqlite-doc-store-tools
sql		sql
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eslint.config.js		eslint.config.js
index.ts		index.ts
openclaw.plugin.json		openclaw.plugin.json
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SQLite Document Store Plugin

Goal

Current scope

Storage model

documents

document_chunks

Document metadata model (v2)

1) Core document columns

2) Provenance / trust columns

3) Chunk metadata

FTS indexes

Embeddings config

Ingestion entrypoints

JSONL batch ingest

URL ingest

Embeddings backfill

Development

Design notes

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SQLite Document Store Plugin

Goal

Current scope

Storage model

documents

document_chunks

Document metadata model (v2)

1) Core document columns

2) Provenance / trust columns

3) Chunk metadata

FTS indexes

Embeddings config

Ingestion entrypoints

JSONL batch ingest

URL ingest

Embeddings backfill

Development

Design notes

Status

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages