Skip to content

Conversation

@JakeDern
Copy link
Contributor

@JakeDern JakeDern commented Dec 6, 2025

This PR addresses a few problems that I ran into getting df_engine running in my kubernetes environment. The end result of all this is that you can now run a docker build like this: docker build --build-arg FEATURES="azure" --build-arg RUSTFLAGS="-C target-feature=-gfni" --build-arg DEBUG=1 --build-context otel-arrow=../../ -f Dockerfile -t df_engine .

TLS

Some of the Azure SDK crates + reqwest have a lot of default features enabled that we are not using and which rely on linking with openssl for TLS. This was giving me trouble with our musl build.

I've disabled default features for these packages at the workspace level and opted for using rustls-tls with reqwest in the otap crate.

Configuring the docker build

  • Added the FEATURES build arg to the dockerfile and the cross platform build script so that you can turn on features like azure when building
  • Added the ability to specify RUSTFLAGS for the docker build so that we can turn some features on or off when compiling which brings me to my next item...

Adding some optional debug utils to the image

When running on k8s I was getting very mysterious crashes after a couple of seconds with 0 logs to stdout. Remoting into the container and running the binary gave me the "Illegal instruction (core dumped)" message, which was a start.

After checking the obvious vector extensions like SSE/AVX/AVX512, I was stumped. So, I added gdb to the container and found the exact instruction which is the longest mnemonic I have ever seen (comes from arrow's buffer crate):

(gdb)  x/i $pc
=> 0x7ffff745fb4f <new+463>:    vgf2p8affineqb $0x0,%ymm3,%ymm5,%ymm5

Turns out my laptop has support for a specific feature called gfni, which the nodes with Intel Xeon processors that I was running did not.

I think this will not be the last time someone needs to debug a native code/cross-compilation problem, so I added an option to the build called DEBUG that will include gdb (and maybe other debug utilities in the future) in the image.

@github-actions github-actions bot added the rust Pull requests that update Rust code label Dec 6, 2025
@codecov
Copy link

codecov bot commented Dec 6, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.53%. Comparing base (b3ee4b0) to head (de50e47).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1540      +/-   ##
==========================================
+ Coverage   83.44%   83.53%   +0.09%     
==========================================
  Files         428      428              
  Lines      119043   119720     +677     
==========================================
+ Hits        99338   100014     +676     
- Misses      19171    19172       +1     
  Partials      534      534              
Components Coverage Δ
otap-dataflow 84.71% <ø> (+0.10%) ⬆️
query_abstraction 80.61% <ø> (ø)
query_engine 90.30% <ø> (+0.03%) ⬆️
syslog_cef_receivers ∅ <ø> (∅)
otel-arrow-go 53.50% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JakeDern JakeDern changed the title Add more options to otap_df docker build, slim down features for some packages Add more options to df_engine docker build, slim down features for some packages Dec 6, 2025
@JakeDern JakeDern marked this pull request as ready for review December 6, 2025 21:11
@JakeDern JakeDern requested a review from a team as a code owner December 6, 2025 21:11
azure_identity = { workspace = true, optional = true }
flate2 = { workspace = true, optional = true }
reqwest = { workspace = true, optional = true }
reqwest = { workspace = true, optional = true, default-features = false, features = ["rustls-tls"] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments here -

  • No need to repeat [default-features = false] in the crate if the workspace already sets it.
  • It’s better to let the crate manage its own reqwest features instead of forcing them at the workspace level.
  • Pick one of rustls-tls or rustls-tls-native-roots. They both pull in their own CA bundles, so using both at once doesn’t help.

Copy link
Contributor Author

@JakeDern JakeDern Dec 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review!

No need to repeat [default-features = false] in the crate if the workspace already sets it.

Good to know! Removed it in the otap crate.

It’s better to let the crate manage its own reqwest features instead of forcing them at the workspace level.

Makes perfect sense, removed at the workspace level.

Pick one of rustls-tls or rustls-tls-native-roots. They both pull in their own CA bundles, so using both at once doesn’t help.

Good catch, I meant to pick rustls-tls not rustls-tls-native-roots and forgot to update both places. Removed the workspace level feature anyways now, so we just have rustls-tls in the otap crate.

@lalitb
Copy link
Member

lalitb commented Dec 7, 2025

One thing to keep in mind: the PR switches the TLS backend from native-tls (OpenSSL or BoringSSL) to rustls with ring. That solves the musl build issue, but ring isn’t FIPS 140-2 or 140-3 certified. If FIPS compliance becomes important later, we might want to look at rustls-tls-aws-lc-rs instead since aws-lc is FIPS validated, or potentially move back to native-tls. Not a blocker, just something to be aware of.

@JakeDern
Copy link
Contributor Author

JakeDern commented Dec 7, 2025

One thing to keep in mind: the PR switches the TLS backend from native-tls (OpenSSL or BoringSSL) to rustls with ring. That solves the musl build issue, but ring isn’t FIPS 140-2 or 140-3 certified. If FIPS compliance becomes important later, we might want to look at rustls-tls-aws-lc-rs instead since aws-lc is FIPS validated, or potentially move back to native-tls. Not a blocker, just something to be aware of.

Yeah absolutely, my goal here is just to try to get things working in the context of the existing docker build. We actually have a couple of other challenges on the Microsoft side related to crypto with this project, because rustls-tls-aws-lc-rs is not an approved library internally either, despite the FIPS compliance, and there's no plan for it to be.

I started trying to get arrow-rs-object-store to allow for pluggable crypto a few months ago rather than just swap from ring to rustls-tls-aws-lc-rs, but I haven't had time to revisit the work: apache/arrow-rs-object-store#462.

Comment on lines +25 to +28
# RUSTFLAGS can be used to pass argument to rustc e.g.
# --build-arg RUSTFLAGS="-C target-feature=-gfni"
ARG RUSTFLAGS
ENV RUSTFLAGS=${RUSTFLAGS}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like that we have the ability to set the rust flags for the docker build!

However, I think the impact of this is that we'll override the default override the default rustflags from ./.cargo/config.toml even if --build-arg RUSTFLAGS... isn't set.

FYI, we set -C target-cpu=native in https://github.com/open-telemetry/otel-arrow/blob/main/rust/otap-dataflow/.cargo/config.toml which is probably why you're seeing the AVX512 instructions which your build machine supports.

You'll see the rust flags getting set in the build if you add a -v to the cargo build step in rust/otap-dataflow/cross-arch-build.sh and run the docker build with --progress plain. On main, we'll see output like this:

#18 15.90      Running `/usr/local/rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin/rustc <...snip...> -C target-cpu=native

On this PR's branch, we don't see -C target-cpu=native.

Without target-cpu=native, we wouldn't get micro-architecture specific instructions in the final binary, which we would probably want for best performance in many cases.

That said, the way we're setting the target-cpu in ./cargo/config.toml is probably not correct if we eventually want to publish otap-dataflow on crates.io (which I think we eventually do #1340). I think this will cause our library compiled w/ native instructions, which users who depend on it might not expect.

One thing we could do is:

  • a) remove the rustflags configuration from ./.cargo/config.toml
  • b) set the default value for the RUSTFLAGS docker ARG to -C target-cpu=native

Does that sound OK?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clhain just want to confirm something -- for the performance tests in this repo, we're always using the docker images right? e.g. we're not building using cross-arch-build.sh directly (outside of docker) docker and then just running the resulting binary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context, I had no idea we already had default args in .cargo/config.toml!

That totally works for me, but what do you think about keeping the configuration in .cargo/config.toml so that folks will still get best performance by default when building/benchmarking locally and then still add -C target-cpu=native as the default for the docker build as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah good point -- I hadn't thought about folks doing local benchmarking.

OK, sounds good. Let's leave .cargo/config.toml as is for now, and just set -C target-cpu=native as the default for the docker build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust Pull requests that update Rust code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants