Sqawk implements a lightweight in-memory database engine that supports SQL operations on delimiter-separated data (CSV, TSV, etc.). The database is designed for performance and simplicity, focusing on providing essential SQL functionality for data analysis tasks directly on delimiter-separated files without requiring an external database engine.
Tables in Sqawk are in-memory representations of delimiter-separated files. Each table has:
- A unique name (derived from the file's filename or explicitly provided)
- A set of columns with names (derived from the file's header row)
- Zero or more rows of data
- Optional metadata including the source file path and delimiter information
Tables maintain an internal mapping of column names to their indices for efficient access.
- Rows: Represented as vectors of values with one element per column
- Columns: Identified by name, with automatic type inference based on content
- Schema: Dynamically determined from the file's header row
- Column Types: Column types are not explicitly declared but inferred at runtime
Tables are loaded from delimiter-separated files at startup and can be modified through SQL operations. Sqawk uses a safe, sed-like writeback model:
- Default Behavior: All modifications remain in memory only
- Tracking Changes: The system tracks which tables have been modified by any operation
- Write on Exit: Modified tables are only written back to their source files if:
- The
--write(or-w) flag is explicitly provided - The table was actually modified by an SQL operation (INSERT, UPDATE, DELETE)
- The
- Safe Execution: Without the
--writeflag, source files remain untouched regardless of operations performed - Write Only Modified: Only tables that were changed are written; unmodified tables are not rewritten
- Format Preservation: Original file formats and delimiters are preserved during writeback
This design ensures that users can experiment with data manipulations while maintaining the integrity of source files. The verbose mode (-v) provides additional confirmation about whether changes were saved or not.
The in-memory database system consists of these primary components:
The Table struct represents an in-memory table with:
- Column metadata
- Row data
- Methods for accessing and manipulating rows
- Projection capabilities (selecting subsets of columns)
The SqlExecutor implements SQL parsing and execution:
- Uses
sqlparsercrate to parse SQL statements - Converts parsed AST to operations on in-memory tables
- Handles WHERE clause evaluation
- Maintains a set of modified table names (
modified_tables) to track changes - Provides
save_modified_tables()method that only writes back tables that were actually modified
The file handling system consists of multiple components:
- Defines a common interface for different file format handlers
- Provides abstraction for loading and saving tables from various file formats
- Allows for consistent treatment of different delimiter-separated formats
- Specialized handler for standard CSV files
- Uses the csv crate for parsing and writing
- Handles comma-separated values with standard CSV escaping rules
- Handles files with custom delimiters (e.g., TSV, colon-separated, etc.)
- Configured via the
-Fcommand-line option - Supports tab, colon, pipe, and other custom separators
- Preserves the original delimiter and format during writeback
When saving modified tables back to disk (via the --write flag), Sqawk uses an atomic write pattern to ensure data integrity:
- Write to Temporary File: Data is first written to a temporary file in the same directory as the target file
- Sync to Disk: The temporary file is explicitly synced (
fsync) to ensure data is durably stored on the physical device - Atomic Rename: The temporary file is renamed to the target filename using an atomic rename operation
This pattern provides several safety guarantees:
- Crash Safety: If the process crashes during writing, the original file remains intact
- Disk Full Protection: If the disk runs out of space, the original file is preserved
- All-or-Nothing Semantics: Either the complete new file exists, or the original file is unchanged
The temporary file is created in the same directory as the target file to ensure the rename operation is atomic (same filesystem requirement on POSIX systems).
Sqawk maintains delimiter consistency between input and output:
- When a file is loaded with a specific delimiter (via
-Fflag or file extension detection), that delimiter is preserved in the table metadata - Query results are output using the same delimiter as the source table
- When files are written back, the original delimiter format is preserved
This ensures that TSV files remain TSV, colon-separated files remain colon-separated, and so on. The principle of least surprise: output format matches input format unless explicitly overridden.
Sqawk supports two primary output modes:
- Stdout Output: Query results (SELECT) are written to standard output using the source table's delimiter
- File Writeback: Modified tables (via INSERT, UPDATE, DELETE) are written back to their source files when
--writeis specified
For stdout output, the data is streamed directly without buffering concerns. For file writeback, the atomic write pattern described above is used.
Sqawk supports multiple storage backends for table row data, enabling efficient handling of different data sources:
The MemoryStorage backend stores all row data in heap-allocated memory:
- Used when reading from stdin or pipes
- All string values are owned (heap-allocated)
- Supports all operations including INSERT, UPDATE, DELETE
- Data is mutable throughout the session
The MmapStorage backend uses memory-mapped files for zero-copy access:
- Automatically used when loading files from disk
- String values borrow directly from the mmap'd region (zero allocation)
- Read-only by default to maintain zero-copy efficiency
- Converts to MemoryStorage on first write operation
The storage backend is selected automatically based on the data source:
- On-disk files: Uses mmap for zero-copy loading
- Stdin/pipes: Uses traditional in-memory storage
- Write operations: Mmap tables convert to memory on first modification
When a table with mmap storage is modified (INSERT, UPDATE, DELETE):
- All rows are copied to heap memory (converting Cow::Borrowed to Cow::Owned)
- The storage backend switches from Mmap to Memory
- Subsequent operations use the in-memory copy
- The original file remains unchanged until
--writeis specified
This design provides:
- Zero-allocation loading for read-only queries
- Transparent handling of write operations
- Memory efficiency for large read-only datasets
Sqawk's in-memory database is optimized for:
- Fast Loading: Delimiter-separated files are parsed directly into memory
- Zero-Copy Access: Memory-mapped files enable direct access without allocation overhead
- Format Flexibility: Support for CSV, TSV, and custom-delimited files
- Efficient Filtering: WHERE clauses are applied in a single pass
- Low Memory Overhead: Simple data structures minimize memory usage
- Zero Configuration: No setup required, works directly with files in various formats
For larger datasets, consider:
- The entire dataset must fit in memory (or be addressable via mmap)
- Complex queries may require multiple passes over the data
- Write operations on mmap tables trigger a copy to memory
- Custom delimiters may have slightly different performance characteristics than standard CSV
- Read-only operations on large files benefit most from mmap storage
The database engine supports combining data from multiple tables through SQL-standard join operations.
- INNER JOIN: Returns rows when there is a match in both tables
- LEFT OUTER JOIN: Returns all rows from the left table, with matched rows from the right (or NULLs)
- RIGHT OUTER JOIN: Returns all rows from the right table, with matched rows from the left (or NULLs)
- FULL OUTER JOIN: Returns all rows from both tables, with NULLs where no match exists
- CROSS JOIN: Returns the Cartesian product of both tables
- ON constraints: Supported for specifying join conditions (
JOIN t2 ON t1.id = t2.id) - USING constraints: Not supported
- Multi-table joins: Supports chaining 3+ tables in a single query
- Joins are compiled to VM bytecode for efficient execution
- Nested loop algorithm with appropriate NULL handling for outer joins
- Column references use table qualifiers for disambiguation in result sets
- Type coercion rules are applied consistently in join conditions
The database engine has several architectural limitations:
- No Index Structure: All operations perform full table scans
- No Transaction Support: Operations are applied immediately with no rollback capability
- Schema Flexibility: Types are inferred rather than enforced
- No Constraints System: Referential integrity not enforced
This documentation describes the current state of the Sqawk in-memory database as of the project's current version. Future versions may add additional capabilities.