Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
171 changes: 124 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,97 +1,174 @@
# benchtop


Benchtop is a framework for storing large JSON documents as JSON blobs directly to disk with indexing provided by the key value database PebbleDb.

## Command line

Build:

```
make
```

### Load data

```
benchtop load test.data embeddings test.ndjson
```
- `test.data` : name of archive
- `embeddings` : name of table
- `test.ndjson` : file to be loaded

- `test.data` : name of archive
- `embeddings` : name of table
- `test.ndjson` : file to be loaded

### List tables

```
benchtop tables test.data
```

### Get keys

```
benchtop keys test.data embeddings
```

### Get records

```
benchtop get test.data embeddings <key1> <key2> ...
```


## Format

Data is stored in a large binary files and index using [Pebble Key Value storage](https://github.com/cockroachdb/pebble).

Data is stored in a large binary files and indexed using [Pebble Key Value storage](https://github.com/cockroachdb/pebble).

### Key/Value format
Written using [Pebble](https://github.com/cockroachdb/)

Written using [Pebble](https://github.com/cockroachdb/)

#### Table Entries

**Key**
|bytes|0|5:... |
|-|-|---------|
|type|t|<[]byte> |
|Desc|prefix|user ID|
Benchtop KV Store Key Structure

This document outlines the binary key structure used by the benchtop package for storing and indexing data in a key-value (KV) store like PebbleDB. The structure is designed for efficient lookups, scans, and indexing of tabular or graph-like data by leveraging key prefixes and a consistent binary layout.
Core Concepts

1. Key Prefixes

All keys begin with a single-byte prefix to denote the type of data they represent. This allows different types of data to coexist in the same keyspace and enables efficient prefix scans (e.g., "find all position keys").

T (TablePrefix): Keys related to table metadata.

P (PosPrefix): Keys that map a row ID to its physical location.

F (FieldPrefix): Keys that form a secondary index on specific field values.

R (RFieldPrefix): Keys that form a reverse index for efficient index deletion.

2. Field Separator

A special byte separator, FieldSep (ASCII 0x1F - Unit Separator), is used as a delimiter within compound keys (like the field indexes). This character is chosen because it is a non-printable control character that is not expected to appear in standard string data, ensuring reliable splitting of key components.
Key Types

1. Table Keys

Purpose: To store metadata or identifiers for data tables.

Structure: T | TableId

T: The literal character 'T' (TablePrefix).

TableId: The unique byte slice identifier for the table.

Functions:

NewTableKey(id []byte): Creates a new table key.

ParseTableKey(key []byte): Extracts the TableId from a table key.

2. Position (Row Location) Keys

Purpose: These keys are the primary index, mapping a unique row/vertex ID to its physical location (offset and size) in a data file.

Structure: P | TableId | RowId

P: The literal character 'P' (PosPrefix).

TableId: A 2-byte uint16 (little-endian) identifying the table the row belongs to.

RowId: The unique byte slice identifier for the row/vertex.

Associated Value: The value stored for this key is an encoded RowLoc struct (see below).

Functions:

NewPosKey(table uint16, name []byte): Creates a new position key.

ParsePosKey(key []byte): Extracts the TableId and RowId from a key.

NewPosKeyPrefix(table uint16): Creates a key prefix for scanning all rows within a specific table.

3. Field Index Keys

Purpose: To create a secondary index on specific field values. This allows for fast lookups of all rows that have a certain value for a given field (e.g., find all users where city == 'New York').

Structure: F<sep>Field<sep>Label<sep>Value<sep>RowId

F: The literal character 'F' (FieldPrefix).

<sep>: The FieldSep byte.

Field: The name of the indexed field (e.g., "city").

Label: The label or type of the row (e.g., "user").

Value: The JSON-encoded value of the field (e.g., "New York").

RowId: The unique ID of the row that contains this field value.

Functions:

FieldKey(field, label string, value any, rowID []byte): Creates a full field index key.

FieldKeyParse(key []byte): Parses a field key back into its components.

FieldLabelKey(field, label string): Creates a key prefix for scanning all indexed values for a specific field and label.

4. Reverse Field Index Keys

Purpose: To enable the efficient deletion of a row's entries from the field indexes. When a row is deleted, this reverse index is used to quickly find all the Field Index Keys that point to it, without having to scan the entire index.

Structure: R<sep>Label<sep>Field<sep>RowId

R: The literal character 'R' (RFieldPrefix).

<sep>: The FieldSep byte.

Label: The label of the row.

Field: The name of the indexed field.

The user ID is provided by the user, but should be checked to ensure it is unique.
RowId: The unique ID of the row.

**Value**
|bytes|0:4|4:...|
|-|-|-------|
|type|[]byte|
|Desc|Json formatted Column definitions|
Functions:

First is the Table system ID, which is used as a prefix during key lookup. Then rest
of the bytes describe a list of columns and their data types.
RFieldKey(label, field, rowID string): Creates a new reverse field key.

#### Table ID
**Key**
|bytes|0|5:... |
|-|-|---------|
|type|T|uint32|
|Desc|prefix|system table ID|
Value Structures
RowLoc

The generated ID for a table.
Purpose: Represents the physical location of a data record, acting as a "pointer" to the full data object stored elsewhere. It is the value component for a Position Key.

**Value**
|bytes|0:4|4:...|
|-|-|-------|
|type|[]byte|
|Desc|User ID of table|
Structure: A fixed 10-byte binary layout.

Section (Bytes 0-1): A uint16 identifying the file or section where the data is stored.

#### ID Entries
These map the user specified ID to a data block specified with offset and size.
Offset (Bytes 2-5): A uint32 representing the starting byte offset within the section.

**Key**
|bytes|0|1:5|1:... |
|-|-|-|--------|
|type|k|uint32|<[]byte> |
|Desc|prefix|system table ID|user row ID|
Size (Bytes 6-9): A uint32 representing the length of the data in bytes.

**Value**
|bytes|0:8|8:16|
|-|-|---------|
|type|uint64|uint64|
|Desc|offset|size|
Functions:

EncodeRowLoc(loc *RowLoc): Encodes a RowLoc struct into a 10-byte slice.

### Data file format
Sequentially written [JSON](https://www.json.org/json-en.html/) entries.
DecodeRowLoc(v []byte): Decodes a 10-byte slice back into a RowLoc struct.
6 changes: 2 additions & 4 deletions cmdline/benchtop/cmds/get/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,18 +34,16 @@ var Cmd = &cobra.Command{

TS, _ := driver.(*jsontable.JSONDriver)
for _, key := range keys {
val, closer, err := TS.Pb.Db.Get([]byte(key))
val, closer, err := TS.Pkv.Get([]byte(key))
if err != nil {
if err != pebble.ErrNotFound {
log.Errorf("Err on dr.Pb.Get for key %s in CacheLoader: %v", key, err)
}
log.Errorln("ERR: ", err)
}
fmt.Println("VAL: ", val)
offset, size := benchtop.ParsePosValue(val)
closer.Close()

data, err := table.GetRow(benchtop.RowLoc{Offset: offset, Size: size})
data, err := table.GetRow(benchtop.DecodeRowLoc(val))
if err == nil {
out, err := json.Marshal(data)
if err != nil {
Expand Down
6 changes: 5 additions & 1 deletion cmdline/benchtop/cmds/keys/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ import (
"fmt"

"github.com/bmeg/benchtop/jsontable"
jTable "github.com/bmeg/benchtop/jsontable/table"

"github.com/spf13/cobra"
)

Expand All @@ -27,7 +29,9 @@ var Cmd = &cobra.Command{
return err
}

keys, err := table.Keys()
jT, _ := table.(*jTable.JSONTable)

keys, err := driver.ListTableKeys(jT.TableId)
if err != nil {
return err
}
Expand Down
71 changes: 0 additions & 71 deletions cmdline/benchtop/cmds/load/main.go

This file was deleted.

2 changes: 0 additions & 2 deletions cmdline/benchtop/cmds/root.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ import (

"github.com/bmeg/benchtop/cmdline/benchtop/cmds/get"
"github.com/bmeg/benchtop/cmdline/benchtop/cmds/keys"
"github.com/bmeg/benchtop/cmdline/benchtop/cmds/load"
"github.com/bmeg/benchtop/cmdline/benchtop/cmds/tables"

"github.com/spf13/cobra"
Expand All @@ -20,7 +19,6 @@ var RootCmd = &cobra.Command{

func init() {
RootCmd.AddCommand(keys.Cmd)
RootCmd.AddCommand(load.Cmd)
RootCmd.AddCommand(tables.Cmd)
RootCmd.AddCommand(get.Cmd)

Expand Down
12 changes: 0 additions & 12 deletions distqueue/distances.go

This file was deleted.

Loading
Loading