Skip to content

Conversation

@corasaurus-hex
Copy link
Contributor

@corasaurus-hex corasaurus-hex commented Nov 9, 2025

Another take on #18457 -- this time we're not changing the public interface.

Which issue does this PR close?

Rationale for this change

Currently Datafusion can only read Arrow files if the're in the File format, not the Stream format. I work with a bunch of Stream format files and wanted native support.

What changes are included in this PR?

To accomplish the above, this PR splits the Arrow datasource into two separate implementations (ArrowStreamFile* and ArrowFile*) with a facade on top to differentiate between the formats at query planning time. This maintains the same public interface to maintain backwards compatibility.

Are these changes tested?

Yes, there are end-to-end sqllogictests along with tests for the changes within datasource-arrow.

Are there any user-facing changes?

Technically yes, in that we support a new format now. I'm not sure which documentation would need to be updated?

@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Nov 9, 2025
@github-actions github-actions bot added the datasource Changes to the datasource crate label Nov 9, 2025
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Nov 9, 2025
Comment on lines 416 to 420
impl Default for ArrowSource {
fn default() -> Self {
Self::default_file_source()
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to have some conflicts with #18386 FYI. Happy to help resolve them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, I'm just eager to get this merged. I'm at five separate reviewers so far 😭, I'm just hoping this last pass is enough to clear the hurdle

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, sorry for all of the work you've had to go through, but it's all for a better outcome in the end! I will try to review this tomorrow but I have a flight so no promises. For what it's worth though I don't think this is likely to be rushed into the 51 release since it's mostly adding a feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better for it in the end, too, really. And from reading your PR I got the same impression that it wouldn't make it into 51.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and don't worry about the review! Flying in the US looks to be rough enough on its own and getting worse and I absolutely wouldn't want to add to your stress. Safe travels!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merge conflicts weren't bad at all, fwiw

@corasaurus-hex
Copy link
Contributor Author

I'm pulling these changes back into the original PR where there's the most history. Thanks for checking this out @adriangb!

@corasaurus-hex corasaurus-hex deleted the cs--register-arrow-ipc-stream-format-files-wrapped branch November 9, 2025 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for registering files in the Arrow IPC stream format as tables using register_arrow or similar

2 participants