Skip to content

Core, Data: File Format API interfaces#12774

Merged
rdblue merged 53 commits intoapache:mainfrom
pvary:file_format_api_only
Feb 6, 2026
Merged

Core, Data: File Format API interfaces#12774
rdblue merged 53 commits intoapache:mainfrom
pvary:file_format_api_only

Conversation

@pvary
Copy link
Contributor

@pvary pvary commented Apr 11, 2025

The interface part of the changes from #12298

Interfaces which have to be implemented by the File Formats:

  • ReadBuilder - Builder for reading data from data files
  • AppenderBuilder - Builder for writing data to data files
  • ObjectModel - Providing ReadBuilders, and AppenderBuilders for the specific data file format and object model pair

Interfaces which are used by the actual readers/writers:

  • AppenderBuilder - Builder for writing a file
  • DataWriterBuilder - Builder for generating a data file
  • PositionDeleteWriterBuilder - Builder for generating a position delete file
  • EqualityDeleteWriterBuilder - Builder for generating an equality delete file
  • No ReadBuilder here - the file format reader builder is reused

Implementation classes tying them together:

  • WriterBuilder class which implements the AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder interfaces using the AppenderBuilder provided by the File Format itself
  • ObjectModelRegistry which stores the available ObjectModels and users could request the readers (ReadBuilder) and writers (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) from.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pvary for this pr, left some comments, genearlly looks great!

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pvary for this pr, LGTM!

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some nits!

@amogh-jahagirdar amogh-jahagirdar self-requested a review May 7, 2025 14:25
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pvary !

Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some initial comments on the interfaces. will still need to take a look at the other bigger PR to understand more on the work as a whole.

@RussellSpitzer
Copy link
Member

@huaxingao, @pvary Could you take a look from a comet prospective? I know you have some custom code that would be using this as well

@pvary
Copy link
Contributor Author

pvary commented May 21, 2025

@huaxingao, @pvary Could you take a look from a comet prospective? I know you have some custom code that would be using this as well

Originally I thought that the comet could be just another FileAccessFactory to register, but based on @rdblue's suggestion I have merged it back to the spark-vectorized/Parquet factory. If the current API is not working for the custom code, that direction could be something which might be worth to explore.

…ppender<D> build()' instead '<D> FileAppender<D> build()'
@pvary pvary force-pushed the file_format_api_only branch from 646d9f7 to d450ec6 Compare November 21, 2025 21:41
@pvary pvary added this to the Iceberg 1.11.0 milestone Jan 7, 2026
@pvary
Copy link
Contributor Author

pvary commented Feb 5, 2026

Merged the changes from #12298.

The remaining open questions are highlighted:

Based on @rdblue’s comment on the other PR, the change is “about ready to go in.”

However, as it has changed somewhat compared to earlier iterations, previous reviewers may want to take a final look.

Thanks for everyone spending time on this!

* Creates a writer for the given schemas.
*
* @param icebergSchema the Iceberg schema defining the table structure
* @param fileSchema the file format specific target schema for the output files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, but I want to note it somewhere: the file schema will be converted from the Iceberg Schema and will directly match in structure, names, and equivalent types. We should probably document this and make sure that all of the format builders follow that rule.

Similarly, we should also think about guarantees and/or requirements of the engine schema. I think this is dependent on the engine, though. If an engine builds its integration so that names must match or positions must match, that's up to the engine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should add it here too, or it is enough as we describe it at the writer and the reader?

  • FileWriterBuilder.engineSchema
   * <p>The engine schema must be aligned with the Iceberg schema, but may include representation
   * details that Iceberg considers equivalent.
  • ReadBuilder.engineProjection
   * <p>When provided, this schema should be consistent with the requested Iceberg projection, while
   * allowing representation differences. Examples include:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should state how the file schema is derived because it is always a direct translation from the Iceberg schema and the structure and names match.

For the engine schema, I think mentioning that it is engine-specific is the right thing to do.

* <p>Some data types require additional type information from the engine schema that cannot be
* fully expressed by the Iceberg schema or the data itself. For example, a variant type may use a
* shredded representation that relies on engine-specific metadata to map back to the Iceberg
* schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a simple example (tinyint / int) would help as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a sentence about the ints

* schema.
*
* <p>The engine schema must be aligned with the Iceberg schema, but may include representation
* details that Iceberg considers equivalent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good way to state the requirement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it enough here, or shall we add it somewhere else too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine here. Engine schema is really a contract between the engine and its registered object model.

*
* <p>This registry provides access to {@link ReadBuilder}s for data consumption and {@link
* FileWriterBuilder}s for writing various types of Iceberg content files. The appropriate builder
* is selected based on {@link FileFormat} and object model name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: object model class.

* <p>{@link FormatModel} objects are registered through {@link #register(FormatModel)} and used for
* creating readers and writers. Read builders are returned directly from the factory. Write
* builders may be wrapped in specialized content file writer implementations depending on the
* requested builder type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that this part about the write builders being wrapped needs to be mentioned here. What about "Readers and writers are created using builders from the static factory methods of this class."

@rdblue
Copy link
Contributor

rdblue commented Feb 6, 2026

I think this is ready and from our regular syncs as well as the discussion on the dev list, I don't think that there are any remaining blockers so I'll go ahead and merge it. That will unblock the file format and engine integrations, which can all happen in parallel after this core work is done.

Thanks @pvary!

@rdblue rdblue merged commit a8ece05 into apache:main Feb 6, 2026
32 checks passed
@pvary
Copy link
Contributor Author

pvary commented Feb 7, 2026

Thanks for the review and the merge @rdblue, and fore everybody else too!
Opened the Parquet PR: #15253

Will open the others and the javadoc tweak soon too.

@pvary
Copy link
Contributor Author

pvary commented Feb 7, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.