Skip to content

Conversation

@Shnatsel
Copy link

@Shnatsel Shnatsel commented Jun 16, 2025

I came across a challenge to remove bounds checks from this function in https://flexineering.com/posts/lz4-011/

This is the code currently in tree: https://godbolt.org/z/W3zqcbnd5

This is the proposed code: https://godbolt.org/z/an1sYhc4c

As you can see, all the bounds checks from the hot loop are gone. On the flip side, there is more assembly total. Not sure why, perhaps the compiler unrolls the loop more aggressively?

It's possible to get the amount of bounds checks down to 2, but that produces even more assembly: https://godbolt.org/z/dKfbcv98r
You can try this version if you check out commit dd23da8

I have not benchmarked whether this is actually faster or slower in practice. It's possible that the increase in the total amount of assembly loaded into instruction cache will offset the gains from removing the already perfectly predictable branches of the bounds checks.

This is likely going to run into all the same issues as #141, but I wanted to give it another try.

@PSeitz
Copy link
Owner

PSeitz commented Jul 17, 2025

Thanks for the PR! Sadly this seems to be 1-2% slower.

I think it's a promising approach.

pub fn extend_from_within_overlapping(output: &mut [u8], pos: &mut usize, start: usize, num_bytes: usize) {
    let offset = *pos - start;
    for i in start + offset..start + offset + num_bytes {
        output[i] = output[i - offset];
    }
    *pos += num_bytes;
}

If offset is large enough, we can switch to copying larger chunks and the assembly indeed does seem to do this.

The main difference I think is that the original version does a 16byte check and copy, while the the Cell version has a 32byte version with some more checks.

The tradeoff for this function is probably better to have fewer checks, since we don't copy that much usually.
I wish we had some hints for the compiler to pass on that information.

@Shnatsel
Copy link
Author

Sadly there doesn't seem to be a way to tell the compiler that the expected inputs are going to be small (other than profile-guided optimization, which is something the user of the library will have to do, and so is impractical here).

@Shnatsel Shnatsel closed this Jul 17, 2025
@PSeitz
Copy link
Owner

PSeitz commented Jul 17, 2025

There is the nightly optimize parameter, which is already used, which could be used here.
If the assembly is small, it is usually suitable for small input.

#![feature(optimize_attribute)]

use std::cell::Cell;

#[optimize(size)]
pub fn extend_from_within_overlapping(
    slice: &mut [u8],
    pos: &mut usize,
    start: usize,
    num_bytes: usize,
) {
    let offset = *pos - start;

    if offset >= 8 && num_bytes >= 8 {
        extend_from_within_overlappin_large(slice, pos, start, num_bytes)
    } else {
        // fallback: byte-by-byte via Cell
        let cells: &[Cell<u8>] = Cell::from_mut(slice).as_slice_of_cells();
        let src = &cells[start..];
        let dst = &src[offset..];
        for (s, d) in src.iter().zip(dst.iter()).take(num_bytes) {
            d.set(s.get());
        }
        *pos += num_bytes;
    }
}

#[inline]
pub fn extend_from_within_overlappin_large(
    slice: &mut [u8],
    pos: &mut usize,
    start: usize,
    num_bytes: usize,
) {
    // ...
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants