Skip to content

Commit 7aedf12

Browse files
KyleAMathewsclaude
andauthored
Hash small Uint8Arrays (≤128 bytes) by content rather than reference (#779)
* fix: compare Uint8Arrays by content for proper binary ID equality Fixes `eq` function and hash indexing to compare Uint8Arrays/Buffers by content instead of reference, enabling proper ULID comparisons in WHERE clauses. Changes: - Hash small Uint8Arrays (≤128 bytes) by content in db-ivm for better indexing - Compare Uint8Arrays by content in eq operator via areValuesEqual() function - Add comprehensive tests for Uint8Array equality comparison * test: add tests for Uint8Array equality with zero-filled arrays Add tests that specifically cover the user's reproduction case where Uint8Arrays are created with a length (e.g., new Uint8Array(5)) resulting in zero-filled arrays. Confirms that content comparison works correctly. * test: add tests for primitive equality to verify no regression Add explicit tests for string and number equality to ensure that the areValuesEqual function doesn't break primitive comparisons. All tests pass, confirming the implementation correctly handles both Uint8Arrays and primitives. * fix: compare Uint8Arrays by content for proper binary ID equality Fixes function and hash indexing to compare Uint8Arrays/Buffers by content instead of reference, enabling proper ULID comparisons in WHERE clauses. The issue was that used which compares Uint8Arrays by reference. Now it uses which compares Uint8Arrays byte-by-byte. Changes: - Hash small Uint8Arrays (≤128 bytes) by content in db-ivm for better indexing - Compare Uint8Arrays by content in eq operator via areValuesEqual() function - Made writeByte() public in MurmurHashStream - Add comprehensive tests for Uint8Array equality comparison - Add integration test reproducing the user's exact scenario All tests pass (84/84 evaluator tests, 1/1 integration test). * fix: normalize Uint8Arrays for Map key usage in indexes The previous fix handled Uint8Array comparison at the expression evaluation level, but index lookups still failed because JavaScript Maps use reference equality for object keys. Updated normalizeValue() to convert Uint8Arrays/Buffers to string representations that can be used as Map keys with content-based equality. This enables proper index lookups for binary IDs like ULIDs when auto-indexing is enabled (the default behavior). Also updated the integration test to verify the fix works with auto-indexing enabled. * fix: add 128-byte threshold to prevent large Uint8Array string duplication Applied the same 128-byte threshold to normalizeValue() as used in the hashing function. This prevents creating giant strings in memory when indexing large Uint8Arrays (> 128 bytes). Arrays larger than 128 bytes will fall back to reference equality, which is acceptable as the fix is primarily for ID use cases (ULIDs are 16 bytes, UUIDs are 16 bytes). Added test coverage to verify the threshold behavior works as expected. --------- Co-authored-by: Claude <[email protected]>
1 parent 1515a23 commit 7aedf12

File tree

8 files changed

+438
-32
lines changed

8 files changed

+438
-32
lines changed
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
---
2+
"@tanstack/db": patch
3+
"@tanstack/db-ivm": patch
4+
---
5+
6+
Fix Uint8Array/Buffer comparison to work by content instead of reference. This enables proper equality checks for binary IDs like ULIDs in WHERE clauses using the `eq` function.

packages/db-ivm/src/hashing/hash.ts

Lines changed: 36 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,12 @@ const OBJECT_MARKER = randomHash()
1717
const ARRAY_MARKER = randomHash()
1818
const MAP_MARKER = randomHash()
1919
const SET_MARKER = randomHash()
20+
const UINT8ARRAY_MARKER = randomHash()
21+
22+
// Maximum byte length for Uint8Arrays to hash by content instead of reference
23+
// Arrays smaller than this will be hashed by content, allowing proper equality comparisons
24+
// for small arrays like ULIDs (16 bytes) while still avoiding performance costs for large arrays
25+
const UINT8ARRAY_CONTENT_HASH_THRESHOLD = 128
2026

2127
const hashCache = new WeakMap<object, number>()
2228

@@ -35,6 +41,24 @@ function hashObject(input: object): number {
3541
let valueHash: number | undefined
3642
if (input instanceof Date) {
3743
valueHash = hashDate(input)
44+
} else if (
45+
// Check if input is a Uint8Array or Buffer
46+
(typeof Buffer !== `undefined` && input instanceof Buffer) ||
47+
input instanceof Uint8Array
48+
) {
49+
// For small Uint8Arrays/Buffers (e.g., ULIDs, UUIDs), hash by content
50+
// to enable proper equality comparisons. For large arrays, hash by reference
51+
// to avoid performance costs.
52+
if (input.byteLength <= UINT8ARRAY_CONTENT_HASH_THRESHOLD) {
53+
valueHash = hashUint8Array(input)
54+
} else {
55+
// Deeply hashing large arrays would be too costly
56+
// so we track them by reference and cache them in a weak map
57+
return cachedReferenceHash(input)
58+
}
59+
} else if (input instanceof File) {
60+
// Files are always hashed by reference due to their potentially large size
61+
return cachedReferenceHash(input)
3862
} else {
3963
let plainObjectInput = input
4064
let marker = OBJECT_MARKER
@@ -53,17 +77,6 @@ function hashObject(input: object): number {
5377
plainObjectInput = [...input.entries()]
5478
}
5579

56-
if (
57-
(typeof Buffer !== `undefined` && input instanceof Buffer) ||
58-
input instanceof Uint8Array ||
59-
input instanceof File
60-
) {
61-
// Deeply hashing these objects would be too costly
62-
// but we also don't want to ignore them
63-
// so we track them by reference and cache them in a weak map
64-
return cachedReferenceHash(input)
65-
}
66-
6780
valueHash = hashPlainObject(plainObjectInput, marker)
6881
}
6982

@@ -78,6 +91,18 @@ function hashDate(input: Date): number {
7891
return hasher.digest()
7992
}
8093

94+
function hashUint8Array(input: Uint8Array): number {
95+
const hasher = new MurmurHashStream()
96+
hasher.update(UINT8ARRAY_MARKER)
97+
// Hash the byte length first to differentiate arrays of different sizes
98+
hasher.update(input.byteLength)
99+
// Hash each byte in the array
100+
for (let i = 0; i < input.byteLength; i++) {
101+
hasher.writeByte(input[i]!)
102+
}
103+
return hasher.digest()
104+
}
105+
81106
function hashPlainObject(input: object, marker: number): number {
82107
const hasher = new MurmurHashStream()
83108

packages/db-ivm/src/hashing/murmur.ts

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ export class MurmurHashStream implements Hasher {
5151
this.hash = Math.imul(this.hash, 5) + 0xe6546b64
5252
}
5353

54-
private _writeByte(byte: number): void {
54+
writeByte(byte: number): void {
5555
this.carry |= (byte & 0xff) << (8 * this.carryBytes)
5656
this.carryBytes++
5757
this.length++
@@ -74,29 +74,29 @@ export class MurmurHashStream implements Hasher {
7474

7575
for (let i = 0; i < description.length; i++) {
7676
const code = description.charCodeAt(i)
77-
this._writeByte(code & 0xff)
78-
this._writeByte((code >>> 8) & 0xff)
77+
this.writeByte(code & 0xff)
78+
this.writeByte((code >>> 8) & 0xff)
7979
}
8080
return
8181
}
8282
case `string`:
8383
this.update(STRING_MARKER)
8484
for (let i = 0; i < chunk.length; i++) {
8585
const code = chunk.charCodeAt(i)
86-
this._writeByte(code & 0xff)
87-
this._writeByte((code >>> 8) & 0xff)
86+
this.writeByte(code & 0xff)
87+
this.writeByte((code >>> 8) & 0xff)
8888
}
8989
return
9090
case `number`:
9191
dv.setFloat64(0, chunk, true) // fixed little-endian
92-
this._writeByte(u8[0]!)
93-
this._writeByte(u8[1]!)
94-
this._writeByte(u8[2]!)
95-
this._writeByte(u8[3]!)
96-
this._writeByte(u8[4]!)
97-
this._writeByte(u8[5]!)
98-
this._writeByte(u8[6]!)
99-
this._writeByte(u8[7]!)
92+
this.writeByte(u8[0]!)
93+
this.writeByte(u8[1]!)
94+
this.writeByte(u8[2]!)
95+
this.writeByte(u8[3]!)
96+
this.writeByte(u8[4]!)
97+
this.writeByte(u8[5]!)
98+
this.writeByte(u8[6]!)
99+
this.writeByte(u8[7]!)
100100
return
101101
case `bigint`: {
102102
let value = chunk
@@ -107,10 +107,10 @@ export class MurmurHashStream implements Hasher {
107107
this.update(BIG_INT_MARKER)
108108
}
109109
while (value > 0n) {
110-
this._writeByte(Number(value & 0xffn))
110+
this.writeByte(Number(value & 0xffn))
111111
value >>= 8n
112112
}
113-
if (chunk === 0n) this._writeByte(0)
113+
if (chunk === 0n) this.writeByte(0)
114114
return
115115
}
116116
default:

packages/db-ivm/tests/utils.test.ts

Lines changed: 40 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -299,7 +299,8 @@ describe(`hash`, () => {
299299
expect(hash4).not.toBe(hash6) // Different Symbol content should have different hash
300300
})
301301

302-
it(`should hash Buffers, Uint8Arrays and File objects by reference`, () => {
302+
it(`should hash small Buffers and Uint8Arrays by content`, () => {
303+
// Small buffers (≤128 bytes) are hashed by content for proper equality comparisons
303304
const buffer1 = Buffer.from([1, 2, 3])
304305
const buffer2 = Buffer.from([1, 2, 3])
305306
const buffer3 = Buffer.from([1, 2, 3, 4])
@@ -309,7 +310,7 @@ describe(`hash`, () => {
309310
const hash3 = hash(buffer3)
310311

311312
expect(typeof hash1).toBe(hashType)
312-
expect(hash1).not.toBe(hash2) // Same content but different buffer instances have a different hash because it would be too costly to deeply hash buffers
313+
expect(hash1).toBe(hash2) // Same content = same hash for small buffers
313314
expect(hash1).not.toBe(hash3) // Different Buffer content should have different hash
314315
expect(hash1).toBe(hash(buffer1)) // Hashing same buffer should return same hash
315316

@@ -322,10 +323,46 @@ describe(`hash`, () => {
322323
const hash6 = hash(uint8Array3)
323324

324325
expect(typeof hash4).toBe(hashType)
325-
expect(hash4).not.toBe(hash5) // Same content but different uint8Array instances have a different hash because it would be too costly to deeply hash uint8Arrays
326+
expect(hash4).toBe(hash5) // Same content = same hash for small Uint8Arrays
326327
expect(hash4).not.toBe(hash6) // Different uint8Array content should have different hash
327328
expect(hash4).toBe(hash(uint8Array1)) // Hashing same uint8Array should return same hash
329+
})
330+
331+
it(`should hash large Buffers, Uint8Arrays and File objects by reference`, () => {
332+
// Large buffers (>128 bytes) are hashed by reference to avoid performance costs
333+
const largeBuffer1 = Buffer.alloc(300)
334+
const largeBuffer2 = Buffer.alloc(300)
335+
336+
// Fill with same content
337+
for (let i = 0; i < 300; i++) {
338+
largeBuffer1[i] = i % 256
339+
largeBuffer2[i] = i % 256
340+
}
341+
342+
const hash1 = hash(largeBuffer1)
343+
const hash2 = hash(largeBuffer2)
344+
345+
expect(typeof hash1).toBe(hashType)
346+
expect(hash1).not.toBe(hash2) // Same content but different instances = different hash for large buffers
347+
expect(hash1).toBe(hash(largeBuffer1)) // Hashing same buffer should return same hash
348+
349+
const largeUint8Array1 = new Uint8Array(300)
350+
const largeUint8Array2 = new Uint8Array(300)
351+
352+
// Fill with same content
353+
for (let i = 0; i < 300; i++) {
354+
largeUint8Array1[i] = i % 256
355+
largeUint8Array2[i] = i % 256
356+
}
357+
358+
const hash3 = hash(largeUint8Array1)
359+
const hash4 = hash(largeUint8Array2)
360+
361+
expect(typeof hash3).toBe(hashType)
362+
expect(hash3).not.toBe(hash4) // Same content but different instances = different hash for large Uint8Arrays
363+
expect(hash3).toBe(hash(largeUint8Array1)) // Hashing same uint8Array should return same hash
328364

365+
// Files are always hashed by reference regardless of size
329366
const file1 = new File([`Hello, world!`], `test.txt`)
330367
const file2 = new File([`Hello, world!`], `test.txt`)
331368
const file3 = new File([`Hello, world!`], `test.txt`)

packages/db/src/query/compiler/evaluators.ts

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ import {
33
UnknownExpressionTypeError,
44
UnknownFunctionError,
55
} from "../../errors.js"
6-
import { normalizeValue } from "../../utils/comparison.js"
6+
import { areValuesEqual, normalizeValue } from "../../utils/comparison.js"
77
import type { BasicExpression, Func, PropRef } from "../ir.js"
88
import type { NamespacedRow } from "../../types.js"
99

@@ -172,7 +172,8 @@ function compileFunction(func: Func, isSingleRow: boolean): (data: any) => any {
172172
if (isUnknown(a) || isUnknown(b)) {
173173
return null
174174
}
175-
return a === b
175+
// Use areValuesEqual for proper Uint8Array/Buffer comparison
176+
return areValuesEqual(a, b)
176177
}
177178
}
178179
case `gt`: {

packages/db/src/utils/comparison.ts

Lines changed: 70 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,11 +112,80 @@ export const defaultComparator = makeComparator({
112112
})
113113

114114
/**
115-
* Normalize a value for comparison
115+
* Compare two Uint8Arrays for content equality
116+
*/
117+
function areUint8ArraysEqual(a: Uint8Array, b: Uint8Array): boolean {
118+
if (a.byteLength !== b.byteLength) {
119+
return false
120+
}
121+
for (let i = 0; i < a.byteLength; i++) {
122+
if (a[i] !== b[i]) {
123+
return false
124+
}
125+
}
126+
return true
127+
}
128+
129+
/**
130+
* Threshold for normalizing Uint8Arrays to string representations.
131+
* Arrays larger than this will use reference equality to avoid memory overhead.
132+
* 128 bytes is enough for common ID formats (ULIDs are 16 bytes, UUIDs are 16 bytes)
133+
* while avoiding excessive string allocation for large binary data.
134+
*/
135+
const UINT8ARRAY_NORMALIZE_THRESHOLD = 128
136+
137+
/**
138+
* Normalize a value for comparison and Map key usage
139+
* Converts values that can't be directly compared or used as Map keys
140+
* into comparable primitive representations
116141
*/
117142
export function normalizeValue(value: any): any {
118143
if (value instanceof Date) {
119144
return value.getTime()
120145
}
146+
147+
// Normalize Uint8Arrays/Buffers to a string representation for Map key usage
148+
// This enables content-based equality for binary data like ULIDs
149+
const isUint8Array =
150+
(typeof Buffer !== `undefined` && value instanceof Buffer) ||
151+
value instanceof Uint8Array
152+
153+
if (isUint8Array) {
154+
// Only normalize small arrays to avoid memory overhead for large binary data
155+
if (value.byteLength <= UINT8ARRAY_NORMALIZE_THRESHOLD) {
156+
// Convert to a string representation that can be used as a Map key
157+
// Use a special prefix to avoid collisions with user strings
158+
return `__u8__${Array.from(value).join(`,`)}`
159+
}
160+
// For large arrays, fall back to reference equality
161+
// Users working with large binary data should use a derived key if needed
162+
}
163+
121164
return value
122165
}
166+
167+
/**
168+
* Compare two values for equality, with special handling for Uint8Arrays and Buffers
169+
*/
170+
export function areValuesEqual(a: any, b: any): boolean {
171+
// Fast path for reference equality
172+
if (a === b) {
173+
return true
174+
}
175+
176+
// Check for Uint8Array/Buffer comparison
177+
const aIsUint8Array =
178+
(typeof Buffer !== `undefined` && a instanceof Buffer) ||
179+
a instanceof Uint8Array
180+
const bIsUint8Array =
181+
(typeof Buffer !== `undefined` && b instanceof Buffer) ||
182+
b instanceof Uint8Array
183+
184+
// If both are Uint8Arrays, compare by content
185+
if (aIsUint8Array && bIsUint8Array) {
186+
return areUint8ArraysEqual(a, b)
187+
}
188+
189+
// Different types or not Uint8Arrays
190+
return false
191+
}

0 commit comments

Comments
 (0)