API, Core: Introduce foundational types for V4 manifest support#15049
API, Core: Introduce foundational types for V4 manifest support#15049anoopj wants to merge 4 commits intoapache:mainfrom
Conversation
| DATA(0), | ||
| POSITION_DELETES(1), | ||
| EQUALITY_DELETES(2); | ||
| EQUALITY_DELETES(2), |
There was a problem hiding this comment.
I had to make this API change so that TrackingInfo can return the contentType. This is backward compatible since this is a new enum value. If we don't want to make any API changes, I will need to change TrackingInfo.contentType() to return an int instead.
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Did a cursory review, still need to review in more depth when I get a chance but thank you for publishing this @anoopj !
| * | ||
| * <p>Used for applying manifest deletion vectors. | ||
| */ | ||
| Long pos(); |
There was a problem hiding this comment.
Should this be a primitive long? as There would always be some non-null ordinal position
There was a problem hiding this comment.
Minor Preference: Would prefer to use the full name position()
There was a problem hiding this comment.
I was trying to be consistent with the existing ContentFile.pos() API that returns a Long.
There was a problem hiding this comment.
I'm not sure that we will always project this.
Also, this is in TrackingInfo so I don't think we need it here.
| * Returns the tracking information for this entry. | ||
| * | ||
| * <p>Contains status, snapshot ID, sequence numbers, and first-row-id. Optional - may be null if | ||
| * tracking info is inherited. | ||
| */ | ||
| TrackingInfo trackingInfo(); |
There was a problem hiding this comment.
Sorry if I missed the discussion on the original PR, so looks like TrackingInfo has a manifestLocation and ordinal as well. I think the intent there is so that TrackingInfo has enough information self contained on it's structure so it can stand alone?
There was a problem hiding this comment.
Ah just saw the top level comment in TrackingInfo, cool!
| * <p>Must be defined if location is defined. | ||
| */ | ||
| Long fileSizeInBytes(); |
There was a problem hiding this comment.
Maybe a miss on the proposal but this would be required always no? And as a result primitive long?
There was a problem hiding this comment.
That makes sense. I changed the proposal (left it as a suggestion). I think we had it optional when we had inline manifest DVs as a separate record, which is not the case anymore.
| * @return a DataFile representation | ||
| * @throws IllegalStateException if content_type is not DATA | ||
| */ | ||
| DataFile asDataFile(PartitionSpec spec); |
There was a problem hiding this comment.
according to #14533 (comment) this should probably not have any arguments
| "referenced_file", | ||
| Types.StringType.get(), | ||
| "Location of data file that a DV references if content_type is POSITION_DELETES. " | ||
| + "Location of affiliated data manifest if content_type is DELETE_MANIFEST, or null if unaffiliated"); |
There was a problem hiding this comment.
These descriptions should be shorter. This explains what the field is, but it doesn't need to document all idiosyncrasies.
| */ | ||
| interface TrackingInfo { | ||
| /** Status of an entry in a tracked file */ | ||
| enum Status { |
There was a problem hiding this comment.
We had to copy this because it was in ManifestEntry. It's probably better to make this top-level to avoid that problem in the future.
| * | ||
| * <p>Must be defined if content_type is POSITION_DELETES, must be null otherwise. | ||
| */ | ||
| ContentInfo contentInfo(); |
There was a problem hiding this comment.
Not a problem: I'm debating whether to have a separate struct for this. Primarily motivated by how awkward the name "content info" is. It's nice to have an optional struct with required fields, but that is a very minor benefit.
There was a problem hiding this comment.
I am neutral on this. Can change this if you feel strongly.
| * | ||
| * <p>TODO: Define ContentStats structure per V4 proposal. | ||
| */ | ||
| Object contentStats(); |
There was a problem hiding this comment.
I think this is a really important part of the implementation, so we should focus on getting this ready.
There was a problem hiding this comment.
ContentStats is already in core: https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/stats/ContentStats.java#L25
But in order to use it here we would have to move it into the api module
There was a problem hiding this comment.
Thank you @nastra. Incorporated. We don't need to move it to api for now because this PR is currently against core.
There was a problem hiding this comment.
ah got it, great. Somehow I thought we're in the api module. All good then
|
Looking good overall, but I think we need to get |
These types follow the https://s.apache.org/iceberg-single-file-commit and will be used by subsequent PRs for manifest reading/writing. For now, we are adding these as package-private interfaces in core, and eventually we will move them into api. Changes API: - Add DATA_MANIFEST and DELETE_MANIFEST to FileContent enum for V4 manifest entry types Core: - TrackedFile - Unified V4 manifest entry representation (equivalent of ContentFile for V4) - TrackingInfo - Entry status, snapshot ID, sequence numbers, and first-row-id - ContentInfo - Metadata for externally stored content (deletion vectors) - ManifestStats - Manifest-level statistics (file/row counts, min sequence number)
47542d3 to
67c20b6
Compare
These types will be used for the v4 adaptive metadata tree and will be used by subsequent PRs for manifest reading/writing.
For now, we are adding these as package-private interfaces in core, and eventually we will move them into api.
Prototype PR: #14533
Changes
API:
Core: