Conversation
Code Review for Classic ALP Encoding SupportSummaryThis PR adds classic ALP (Adaptive Lossless floating-Point) encoding for f32/f64 data as a miniblock compressor. The implementation includes exponent selection, delta-based bitpacking, and exception handling for non-encodable values. P1 Issues1. Potential integer overflow in The "sweet spot" rounding technique can produce incorrect results for values outside the representable integer range: let rounded = fast_round_f32(scaled);
let enc = rounded as i32; // Can overflow if scaled is outside i32 rangeWhile you check 2. Missing feature gate consistency The
3. Exception position overflow risk Exception positions are stored as exception_positions.push(i as u16); // Safe only if i < 65536Consider adding a debug_assert or making the assumption explicit in the code. Suggestions (Non-blocking)
Overall this is a well-documented implementation with proper fallback handling and bitwise-lossless guarantees. |
wjones127
left a comment
There was a problem hiding this comment.
This is exciting!
Have you benchmarked the throughput of this and done any profiling? The sampling of e and f seems pretty heavy, so maybe worth spending some time optimizing.
Also, could you run the fuzz_test.rs on this, and do it for a longer period to check correctness? I think that's just running against 2.1 right now.
Also, I'm not sure it's from this PR, but cargo check -p lance-encoding --no-default-features does currently fail due to some bitpacking code not being gated properly.
| if num_values <= sample_size { | ||
| return (0..num_values).collect(); | ||
| } |
There was a problem hiding this comment.
I wonder if you should have some value like None or empty Vec that should be interpreted as "sample all rows", that way you can avoid the allocation in this common case.
| if data.len() != 4 { | ||
| return Err(Error::invalid_input( | ||
| format!("ALP decompression expects 4 buffers, got {}", data.len()), | ||
| location!(), | ||
| )); | ||
| } | ||
|
|
||
| let n = usize::try_from(num_values) | ||
| .map_err(|_| Error::invalid_input("ALP chunk too large for usize", location!()))?; | ||
|
|
||
| let mut iter = data.into_iter(); | ||
| let packed_deltas = iter.next().unwrap(); | ||
| let header = iter.next().unwrap(); | ||
| let exception_positions = iter.next().unwrap(); | ||
| let exception_bits = iter.next().unwrap(); |
There was a problem hiding this comment.
nit: you should be able to do the length check and unpack the array in one go. The upshot of this is that the compile doesn't need to put a panic message in the binary for each call to .next().unwrap().
| if data.len() != 4 { | |
| return Err(Error::invalid_input( | |
| format!("ALP decompression expects 4 buffers, got {}", data.len()), | |
| location!(), | |
| )); | |
| } | |
| let n = usize::try_from(num_values) | |
| .map_err(|_| Error::invalid_input("ALP chunk too large for usize", location!()))?; | |
| let mut iter = data.into_iter(); | |
| let packed_deltas = iter.next().unwrap(); | |
| let header = iter.next().unwrap(); | |
| let exception_positions = iter.next().unwrap(); | |
| let exception_bits = iter.next().unwrap(); | |
| let [packed_deltas, header, exception_positions, exception_bits] = | |
| data.try_into().map_err(|_| Error::invalid_input( | |
| format!("ALP decompression expects 4 buffers, got {}", data.len()), | |
| location!(), | |
| ))?; | |
| let n = usize::try_from(num_values) | |
| .map_err(|_| Error::invalid_input("ALP chunk too large for usize", location!()))?; |
|
@Xuanwo is there any progress regarding this? love to see this in lance |
This PR will add classic alp encoding support
Parts of this PR were drafted with assistance from Codex (with
gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.