Core, Data: File Format API interfaces by pvary · Pull Request #12774 · apache/iceberg

pvary · 2025-04-11T13:44:07Z

The interface part of the changes from #12298

Interfaces which have to be implemented by the File Formats:

ReadBuilder - Builder for reading data from data files
AppenderBuilder - Builder for writing data to data files
ObjectModel - Providing ReadBuilders, and AppenderBuilders for the specific data file format and object model pair

Interfaces which are used by the actual readers/writers:

AppenderBuilder - Builder for writing a file
DataWriterBuilder - Builder for generating a data file
PositionDeleteWriterBuilder - Builder for generating a position delete file
EqualityDeleteWriterBuilder - Builder for generating an equality delete file
No ReadBuilder here - the file format reader builder is reused

Implementation classes tying them together:

WriterBuilder class which implements the AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder interfaces using the AppenderBuilder provided by the File Format itself
ObjectModelRegistry which stores the available ObjectModels and users could request the readers (ReadBuilder) and writers (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) from.

liurenjie1024

Thanks @pvary for this pr, left some comments, genearlly looks great!

data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java

data/src/main/java/org/apache/iceberg/data/AppenderBuilder.java

data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java

data/src/main/java/org/apache/iceberg/data/PositionDeleteWriterBuilder.java

core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java

liurenjie1024

Thanks @pvary for this pr, LGTM!

data/src/main/java/org/apache/iceberg/data/AppenderBuilder.java

liurenjie1024

LGTM, just some nits!

data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java

core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java

liurenjie1024

Thanks @pvary !

stevenzwu

left some initial comments on the interfaces. will still need to take a look at the other bigger PR to understand more on the work as a whole.

core/src/main/java/org/apache/iceberg/io/ObjectModel.java

core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java

data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java

core/src/main/java/org/apache/iceberg/io/ObjectModel.java

core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java

core/src/main/java/org/apache/iceberg/io/ReadBuilder.java

data/src/main/java/org/apache/iceberg/data/FileWriteBuilderBase.java

…ilder

core/src/main/java/org/apache/iceberg/formats/ModelWriteBuilder.java

core/src/main/java/org/apache/iceberg/io/WriteBuilder.java

core/src/main/java/org/apache/iceberg/io/ReadBuilder.java

data/src/main/java/org/apache/iceberg/data/PositionDeleteWriteBuilder.java

data/src/main/java/org/apache/iceberg/data/ContentFileWriteBuilderImpl.java

data/src/main/java/org/apache/iceberg/data/ContentFileWriteBuilder.java

data/src/main/java/org/apache/iceberg/data/ContentFileWriteBuilderImpl.java

core/src/main/java/org/apache/iceberg/io/WriteBuilder.java

data/src/main/java/org/apache/iceberg/data/FileAccessFactoryRegistry.java

RussellSpitzer · 2025-05-20T20:36:02Z

@huaxingao, @pvary Could you take a look from a comet prospective? I know you have some custom code that would be using this as well

… renames)

pvary · 2025-05-21T13:24:50Z

@huaxingao, @pvary Could you take a look from a comet prospective? I know you have some custom code that would be using this as well

Originally I thought that the comet could be just another FileAccessFactory to register, but based on @rdblue's suggestion I have merged it back to the spark-vectorized/Parquet factory. If the current API is not working for the custom code, that direction could be something which might be worth to explore.

…ppender<D> build()' instead '<D> FileAppender<D> build()'

…ed by the FormatModel implementations

core/src/main/java/org/apache/iceberg/formats/CommonWriteBuilder.java

core/src/main/java/org/apache/iceberg/formats/FormatModelRegistry.java

core/src/main/java/org/apache/iceberg/deletes/PositionDeleteWriter.java

core/src/main/java/org/apache/iceberg/formats/FormatModelRegistry.java

pvary · 2026-02-05T12:49:41Z

Merged the changes from #12298.

The remaining open questions are highlighted:

A potential (though unlikely) breaking change: Core, Data: File Format API interfaces #12774 (comment)
Casting‑related ugliness: https: Core, Data: File Format API interfaces #12774 (comment)

Based on @rdblue’s comment on the other PR, the change is “about ready to go in.”

However, as it has changed somewhat compared to earlier iterations, previous reviewers may want to take a final look.

Thanks for everyone spending time on this!

core/src/main/java/org/apache/iceberg/formats/BaseFormatModel.java

rdblue · 2026-02-06T17:05:30Z

core/src/main/java/org/apache/iceberg/formats/BaseFormatModel.java

+     * Creates a writer for the given schemas.
+     *
+     * @param icebergSchema the Iceberg schema defining the table structure
+     * @param fileSchema the file format specific target schema for the output files


Not a blocker, but I want to note it somewhere: the file schema will be converted from the Iceberg Schema and will directly match in structure, names, and equivalent types. We should probably document this and make sure that all of the format builders follow that rule.

Similarly, we should also think about guarantees and/or requirements of the engine schema. I think this is dependent on the engine, though. If an engine builds its integration so that names must match or positions must match, that's up to the engine.

Do you think we should add it here too, or it is enough as we describe it at the writer and the reader?

FileWriterBuilder.engineSchema

* <p>The engine schema must be aligned with the Iceberg schema, but may include representation * details that Iceberg considers equivalent.

ReadBuilder.engineProjection

* <p>When provided, this schema should be consistent with the requested Iceberg projection, while * allowing representation differences. Examples include:

I think this should state how the file schema is derived because it is always a direct translation from the Iceberg schema and the structure and names match.

For the engine schema, I think mentioning that it is engine-specific is the right thing to do.

core/src/main/java/org/apache/iceberg/formats/BaseFormatModel.java

core/src/main/java/org/apache/iceberg/formats/FileWriterBuilder.java

rdblue · 2026-02-06T17:10:16Z

core/src/main/java/org/apache/iceberg/formats/FileWriterBuilder.java

+   * <p>Some data types require additional type information from the engine schema that cannot be
+   * fully expressed by the Iceberg schema or the data itself. For example, a variant type may use a
+   * shredded representation that relies on engine-specific metadata to map back to the Iceberg
+   * schema.


I think a simple example (tinyint / int) would help as well.

Added a sentence about the ints

rdblue · 2026-02-06T17:10:40Z

core/src/main/java/org/apache/iceberg/formats/FileWriterBuilder.java

+   * schema.
+   *
+   * <p>The engine schema must be aligned with the Iceberg schema, but may include representation
+   * details that Iceberg considers equivalent.


This is a good way to state the requirement.

Is it enough here, or shall we add it somewhere else too?

I think it's fine here. Engine schema is really a contract between the engine and its registered object model.

rdblue · 2026-02-06T22:12:21Z

core/src/main/java/org/apache/iceberg/formats/FormatModelRegistry.java

+ *
+ * <p>This registry provides access to {@link ReadBuilder}s for data consumption and {@link
+ * FileWriterBuilder}s for writing various types of Iceberg content files. The appropriate builder
+ * is selected based on {@link FileFormat} and object model name.


Nit: object model class.

rdblue · 2026-02-06T22:14:21Z

core/src/main/java/org/apache/iceberg/formats/FormatModelRegistry.java

+ * <p>{@link FormatModel} objects are registered through {@link #register(FormatModel)} and used for
+ * creating readers and writers. Read builders are returned directly from the factory. Write
+ * builders may be wrapped in specialized content file writer implementations depending on the
+ * requested builder type.


Not sure that this part about the write builders being wrapped needs to be mentioned here. What about "Readers and writers are created using builders from the static factory methods of this class."

rdblue · 2026-02-06T22:26:52Z

I think this is ready and from our regular syncs as well as the discussion on the dev list, I don't think that there are any remaining blockers so I'll go ahead and merge it. That will unblock the file format and engine integrations, which can all happen in parallel after this core work is done.

Thanks @pvary!

pvary · 2026-02-07T08:39:06Z

Thanks for the review and the merge @rdblue, and fore everybody else too!
Opened the Parquet PR: #15253

Will open the others and the javadoc tweak soon too.

pvary · 2026-02-07T21:34:42Z

Opened the follow-up PRs:

Parquet: Core, Parquet, Data: Implementation of ParquetFormatModel #15253
Avro: Core, Data: Implementation of AvroFormatModel #15254
ORC: Core, Orc, Data: Implementation of ORCFormatModel #15255
Javadoc tweaks: Core: FormatModelRegistry javadoc tweaks #15257

I would appreciate reviews on them.

Thanks,
Peter

Core, Data: File Format API interfaces

b33aec4

github-actions bot added core data build labels Apr 11, 2025

pvary mentioned this pull request Apr 11, 2025

Core: Interface based DataFile reader and writer API #12298

Open

liurenjie1024 reviewed Apr 21, 2025

View reviewed changes

pvary added 2 commits April 29, 2025 12:33

Renjie's comments

5b79dbd

registerObjectModel exception handling fix

a1daced

liurenjie1024 approved these changes May 6, 2025

View reviewed changes

jbonofre approved these changes May 6, 2025

View reviewed changes

data/src/main/java/org/apache/iceberg/data/AppenderBuilder.java Outdated Show resolved Hide resolved

Removing the need for data.AppenderBuilder

79d7703

liurenjie1024 reviewed May 7, 2025

View reviewed changes

data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java Outdated Show resolved Hide resolved

data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java Outdated Show resolved Hide resolved

Fixed Renjie findings

8404aa7

amogh-jahagirdar self-requested a review May 7, 2025 14:25

stevenzwu reviewed May 7, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java Outdated Show resolved Hide resolved

stevenzwu reviewed May 7, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java Outdated Show resolved Hide resolved

liurenjie1024 approved these changes May 8, 2025

View reviewed changes

Cosmentic changes

a00540d

stevenzwu reviewed May 16, 2025

View reviewed changes

pvary added 2 commits May 19, 2025 12:22

Steven's comments

2a4816a

ObjectModelRegistry->FileAccessFactory, and AppenederBuilder->WriteBu…

ba16b59

…ilder

stevenzwu reviewed May 20, 2025

View reviewed changes

RussellSpitzer reviewed May 20, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/io/WriteBuilder.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed May 20, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/io/WriteBuilder.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed May 20, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/io/WriteBuilder.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed May 20, 2025

View reviewed changes

data/src/main/java/org/apache/iceberg/data/FileAccessFactoryRegistry.java Outdated Show resolved Hide resolved

Review comments by Steven and Russell (some javadoc, and a few method…

9976bfb

… renames)

Added generic for the input/output of the reader/writer - like 'FileA…

62ea041

…ppender<D> build()' instead '<D> FileAppender<D> build()'

pvary force-pushed the file_format_api_only branch from 646d9f7 to d450ec6 Compare November 21, 2025 21:41

Create a marker class for the Comet reader and a few extra nits

564ba8c

pvary added this to the Iceberg 1.11.0 milestone Jan 7, 2026

Added back the ReadBuilder.outputSchema method as the attirbute is us…

17f030f

…ed by the FormatModel implementations

mxm reviewed Jan 26, 2026

View reviewed changes

Changes from apache#12298 based on Ryan's comments

74ca9f7

pvary commented Feb 5, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/deletes/PositionDeleteWriter.java Outdated Show resolved Hide resolved

pvary commented Feb 5, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/formats/FormatModelRegistry.java Outdated Show resolved Hide resolved

Last fixes

6cf720a