Sqawk In-Memory Database Documentation

Overview

Sqawk implements a lightweight in-memory database engine that supports SQL operations on delimiter-separated data (CSV, TSV, etc.). The database is designed for performance and simplicity, focusing on providing essential SQL functionality for data analysis tasks directly on delimiter-separated files without requiring an external database engine.

Data Model
Architecture
File I/O Patterns
Storage Backend Architecture
Performance Considerations

Data Model

Tables

Tables in Sqawk are in-memory representations of delimiter-separated files. Each table has:

A unique name (derived from the file's filename or explicitly provided)
A set of columns with names (derived from the file's header row)
Zero or more rows of data
Optional metadata including the source file path and delimiter information

Tables maintain an internal mapping of column names to their indices for efficient access.

Rows and Columns

Rows: Represented as vectors of values with one element per column
Columns: Identified by name, with automatic type inference based on content
Schema: Dynamically determined from the file's header row
Column Types: Column types are not explicitly declared but inferred at runtime

Table Lifecycle

Tables are loaded from delimiter-separated files at startup and can be modified through SQL operations. Sqawk uses a safe, sed-like writeback model:

Default Behavior: All modifications remain in memory only
Tracking Changes: The system tracks which tables have been modified by any operation
Write on Exit: Modified tables are only written back to their source files if:
- The --write (or -w) flag is explicitly provided
- The table was actually modified by an SQL operation (INSERT, UPDATE, DELETE)
Safe Execution: Without the --write flag, source files remain untouched regardless of operations performed
Write Only Modified: Only tables that were changed are written; unmodified tables are not rewritten
Format Preservation: Original file formats and delimiters are preserved during writeback

This design ensures that users can experiment with data manipulations while maintaining the integrity of source files. The verbose mode (-v) provides additional confirmation about whether changes were saved or not.

Architecture

The in-memory database system consists of these primary components:

Table Module

The Table struct represents an in-memory table with:

Column metadata
Row data
Methods for accessing and manipulating rows
Projection capabilities (selecting subsets of columns)

SQL Executor

The SqlExecutor implements SQL parsing and execution:

Uses sqlparser crate to parse SQL statements
Converts parsed AST to operations on in-memory tables
Handles WHERE clause evaluation
Maintains a set of modified table names (modified_tables) to track changes
Provides save_modified_tables() method that only writes back tables that were actually modified

File Handlers

The file handling system consists of multiple components:

`FileHandler` Trait

Defines a common interface for different file format handlers
Provides abstraction for loading and saving tables from various file formats
Allows for consistent treatment of different delimiter-separated formats

`CsvHandler` Implementation

Specialized handler for standard CSV files
Uses the csv crate for parsing and writing
Handles comma-separated values with standard CSV escaping rules

`DelimHandler` Implementation

Handles files with custom delimiters (e.g., TSV, colon-separated, etc.)
Configured via the -F command-line option
Supports tab, colon, pipe, and other custom separators
Preserves the original delimiter and format during writeback

File I/O Patterns

Atomic File Writes

When saving modified tables back to disk (via the --write flag), Sqawk uses an atomic write pattern to ensure data integrity:

Write to Temporary File: Data is first written to a temporary file in the same directory as the target file
Sync to Disk: The temporary file is explicitly synced (fsync) to ensure data is durably stored on the physical device
Atomic Rename: The temporary file is renamed to the target filename using an atomic rename operation

This pattern provides several safety guarantees:

Crash Safety: If the process crashes during writing, the original file remains intact
Disk Full Protection: If the disk runs out of space, the original file is preserved
All-or-Nothing Semantics: Either the complete new file exists, or the original file is unchanged

The temporary file is created in the same directory as the target file to ensure the rename operation is atomic (same filesystem requirement on POSIX systems).

Delimiter Consistency

Sqawk maintains delimiter consistency between input and output:

When a file is loaded with a specific delimiter (via -F flag or file extension detection), that delimiter is preserved in the table metadata
Query results are output using the same delimiter as the source table
When files are written back, the original delimiter format is preserved

This ensures that TSV files remain TSV, colon-separated files remain colon-separated, and so on. The principle of least surprise: output format matches input format unless explicitly overridden.

Output Modes

Sqawk supports two primary output modes:

Stdout Output: Query results (SELECT) are written to standard output using the source table's delimiter
File Writeback: Modified tables (via INSERT, UPDATE, DELETE) are written back to their source files when --write is specified

For stdout output, the data is streamed directly without buffering concerns. For file writeback, the atomic write pattern described above is used.

Storage Backend Architecture

Sqawk supports multiple storage backends for table row data, enabling efficient handling of different data sources:

Memory Storage (Default for stdin)

The MemoryStorage backend stores all row data in heap-allocated memory:

Used when reading from stdin or pipes
All string values are owned (heap-allocated)
Supports all operations including INSERT, UPDATE, DELETE
Data is mutable throughout the session

Mmap Storage (Default for on-disk files)

The MmapStorage backend uses memory-mapped files for zero-copy access:

Automatically used when loading files from disk
String values borrow directly from the mmap'd region (zero allocation)
Read-only by default to maintain zero-copy efficiency
Converts to MemoryStorage on first write operation

Automatic Backend Selection

The storage backend is selected automatically based on the data source:

On-disk files: Uses mmap for zero-copy loading
Stdin/pipes: Uses traditional in-memory storage
Write operations: Mmap tables convert to memory on first modification

Write Operation Behavior

When a table with mmap storage is modified (INSERT, UPDATE, DELETE):

All rows are copied to heap memory (converting Cow::Borrowed to Cow::Owned)
The storage backend switches from Mmap to Memory
Subsequent operations use the in-memory copy
The original file remains unchanged until --write is specified

This design provides:

Zero-allocation loading for read-only queries
Transparent handling of write operations
Memory efficiency for large read-only datasets

Performance Considerations

Sqawk's in-memory database is optimized for:

Fast Loading: Delimiter-separated files are parsed directly into memory
Zero-Copy Access: Memory-mapped files enable direct access without allocation overhead
Format Flexibility: Support for CSV, TSV, and custom-delimited files
Efficient Filtering: WHERE clauses are applied in a single pass
Low Memory Overhead: Simple data structures minimize memory usage
Zero Configuration: No setup required, works directly with files in various formats

For larger datasets, consider:

The entire dataset must fit in memory (or be addressable via mmap)
Complex queries may require multiple passes over the data
Write operations on mmap tables trigger a copy to memory
Custom delimiters may have slightly different performance characteristics than standard CSV
Read-only operations on large files benefit most from mmap storage

Join Implementation

The database engine supports combining data from multiple tables through SQL-standard join operations.

Supported Join Types

INNER JOIN: Returns rows when there is a match in both tables
LEFT OUTER JOIN: Returns all rows from the left table, with matched rows from the right (or NULLs)
RIGHT OUTER JOIN: Returns all rows from the right table, with matched rows from the left (or NULLs)
FULL OUTER JOIN: Returns all rows from both tables, with NULLs where no match exists
CROSS JOIN: Returns the Cartesian product of both tables

Join Syntax

ON constraints: Supported for specifying join conditions (JOIN t2 ON t1.id = t2.id)
USING constraints: Not supported
Multi-table joins: Supports chaining 3+ tables in a single query

Technical Implementation

Joins are compiled to VM bytecode for efficient execution
Nested loop algorithm with appropriate NULL handling for outer joins
Column references use table qualifiers for disambiguation in result sets
Type coercion rules are applied consistently in join conditions

Current System Limitations

The database engine has several architectural limitations:

No Index Structure: All operations perform full table scans
No Transaction Support: Operations are applied immediately with no rollback capability
Schema Flexibility: Types are inferred rather than enforced
No Constraints System: Referential integrity not enforced

This documentation describes the current state of the Sqawk in-memory database as of the project's current version. Future versions may add additional capabilities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sqawk In-Memory Database Documentation

Overview

Table of Contents

Data Model

Tables

Rows and Columns

Table Lifecycle

Architecture

Table Module

SQL Executor

File Handlers

`FileHandler` Trait

`CsvHandler` Implementation

`DelimHandler` Implementation

File I/O Patterns

Atomic File Writes

Delimiter Consistency

Output Modes

Storage Backend Architecture

Memory Storage (Default for stdin)

Mmap Storage (Default for on-disk files)

Automatic Backend Selection

Write Operation Behavior

Performance Considerations

Join Implementation

Supported Join Types

Join Syntax

Technical Implementation

Current System Limitations

FilesExpand file tree

database.md

Latest commit

History

database.md

File metadata and controls

Sqawk In-Memory Database Documentation

Overview

Table of Contents

Data Model

Tables

Rows and Columns

Table Lifecycle

Architecture

Table Module

SQL Executor

File Handlers

FileHandler Trait

CsvHandler Implementation

DelimHandler Implementation

File I/O Patterns

Atomic File Writes

Delimiter Consistency

Output Modes

Storage Backend Architecture

Memory Storage (Default for stdin)

Mmap Storage (Default for on-disk files)

Automatic Backend Selection

Write Operation Behavior

Performance Considerations

Join Implementation

Supported Join Types

Join Syntax

Technical Implementation

Current System Limitations

`FileHandler` Trait

`CsvHandler` Implementation

`DelimHandler` Implementation