Skip to content

Conversation

@friendlymatthew
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This PR adds hash support for Union data types, enabling group by, distinct, hash joins, and aggregations on union-typed columns

hash_union_array hashes each child array once. Then for every row, it uses the type id and offset to retrieve the appropriate hash value

@github-actions github-actions bot added the common Related to common crate label Nov 14, 2025
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/hash-union-array branch from 71fab8e to e8fb251 Compare November 14, 2025 19:39
Comment on lines 351 to 361
#[allow(clippy::needless_range_loop)]
for i in 0..array.len() {
let type_id = array.type_id(i);
let child_offset = array.value_offset(i);

let child_hash = &child_hashes[type_id as usize]
.as_ref()
.expect("invalid type_id");

hashes_buffer[i] = child_hash[child_offset];
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that clippy's suggestion hurt readability

error: the loop variable `i` is used to index `hashes_buffer`
   --> datafusion/common/src/hash_utils.rs:351:14
    |
351 |     for i in 0..array.len() {
    |              ^^^^^^^^^^^^^^
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.91.0/index.html#needless_range_loop
    = note: `-D clippy::needless-range-loop` implied by `-D warnings`
    = help: to override `-D warnings` add `#[allow(clippy::needless_range_loop)]`
help: consider using an iterator and enumerate()
    |
351 -     for i in 0..array.len() {
351 +     for (i, <item>) in hashes_buffer.iter_mut().enumerate().take(array.len()) {
    |

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/hash-union-array branch from e8fb251 to a68f0d8 Compare November 14, 2025 20:59
unreachable!()
};

let mut child_hashes = vec![None; 128];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 128 the max? I feel like just using a HashMap and then doing a get() below instead would be more readable and the same performance (maybe unless we 128 single item arrays or something...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Type ids are i8s, and negative values should not be valid so we bound the total number of spaces to 128

In terms of using a direct-indexed Vec by type id over a hashmap, I just chose not to overcomplicate things. Fwiw, I've yet to write benchmarks for this (I'll plan on doing that later on) and if this part shows up in a profile, I'll happily update it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch that. I'll use a hashmap

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/hash-union-array branch from a68f0d8 to 55ddb39 Compare November 15, 2025 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make UnionArrays hashable

2 participants