OpenClaw plugin project for buffered document storage on top of SQLite.
Provide a strongly typed storage/processing layer for the trading-agent project:
- save normalized documents
- chunk them for retrieval
- index them with SQLite FTS5
- generate optional chunk embeddings through an OpenAI-compatible API
- keep a path open for sqlite-vec integration
- expose a reusable API that agents can call through a future runtime tool layer
Implemented now:
- strict TypeScript project scaffold
- SQLite schema for
documents,document_chunks,ingest_runs, andingest_errors - normalization and chunking
- document save / dedupe by content hash
- FTS-based search over documents and chunks
- optional OpenAI-compatible embeddings on save/backfill
- JSONL batch ingestion script
- URL ingestion script
- embeddings backfill script
- tests for chunking, repository behavior, embedding generation/failure handling, and ingest run logging
- strict ESLint + TypeScript config
Planned next:
- runtime-facing OpenClaw tool registration
- sqlite-vec integration
- vector / hybrid retrieval helpers
- richer reranking helpers
- source-specific ingestion helpers above the shared pipeline
Stores document-level metadata and normalized content.
Stores retrieval-sized chunks derived from documents.text_clean.
Embeddings are stored in:
embedding_jsonembedding_model
The plugin now stores three practical metadata layers:
Used for filtering, ranking, provenance, and reprocessing:
source_typedocument_typepublished_atcollected_atlanguagecountrycontent_hashstatusembedding_statusembedding_modelchunk_counttoken_count_estimateprocessing_errorlast_processed_at
Used to separate official sources from secondary ones:
source_url_canonicalsource_domainsource_priorityis_official_sourcesource_publishersource_sectionretrieved_viahttp_statuscontent_typeetaglast_modifiedfetch_run_idtrust_score
Used for retrieval quality and later vector search:
char_countembedding_statusstarts_at_charends_at_charchunk_kind
documents_ftsdocument_chunks_fts
The plugin can call an OpenAI-compatible embeddings endpoint when saving documents. If enabled, embeddings are generated for each new chunk and stored in SQLite. If the embedding request fails, the document is still saved and the save result reports an embedding failure.
Example config payload for plugins.entries.sqlite-doc-store.config:
{
"dbPath": "/config/.openclaw/sqlite-doc-store/documents.sqlite",
"enableFts": true,
"vectorMode": "disabled",
"embedding": {
"enabled": true,
"apiUrl": "https://api.openai.com/v1/embeddings",
"apiKey": "YOUR_TOKEN",
"model": "text-embedding-3-small",
"timeoutMs": 30000,
"batchSize": 32
}
}Supported embedding fields:
enabledapiUrlapiKeymodeltimeoutMsbatchSizedimensions(optional)
npm run ingest:jsonl -- ./reports/trading/ingest/cbr.jsonlnpm run ingest:url -- \
--url https://cbr.ru/press/pr/?file=13022026_133000key.htmingest:url now supports source profiles. For known domains like cbr.ru, minfin.gov.ru, moex.com / iss.moex.com, econs.online, acra-ratings.ru, raexpert.ru, bofit.fi, and cbonds.ru, the plugin can auto-fill:
sourceTypesourceName- default
documentType - extraction selectors
- provenance defaults (
sourcePriority,isOfficialSource,trustScore)
You can still override extraction manually with:
--content-selector--title-selector--remove-selector
npm run embeddings:backfill -- --limit 100npm install
npm run check- SQLite is the buffered storage layer
- agents should not own low-level chunking/indexing logic
- the plugin is intended to become a domain processing layer, not just a thin DB driver
- embeddings are stored now so sqlite-vec / hybrid retrieval can be layered in next without reworking ingestion
This repository is an actively usable storage scaffold. The core storage layer, chunking, FTS, and embedding generation are implemented and tested; runtime tool wiring is the next step.