-
Notifications
You must be signed in to change notification settings - Fork 13
support prefetch read for orc #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds prefetch read support for ORC format files, enabling parallel reading and improved performance. The implementation introduces a new interface hierarchy to separate prefetch-capable readers from standard file batch readers.
Key changes include:
- Creation of a new
PrefetchFileBatchReaderinterface that extracts prefetch-specific methods fromFileBatchReader - Implementation of
OrcReaderWrapperandReadRangeGeneratorclasses to enable ORC prefetch functionality - Refactoring of the existing
PrefetchFileBatchReaderclass toPrefetchFileBatchReaderImplto avoid naming conflicts
Reviewed changes
Copilot reviewed 31 out of 31 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
include/paimon/reader/prefetch_file_batch_reader.h |
New interface defining prefetch-specific methods (SeekToRow, GetNextRowToRead, GenReadRanges, SetReadRanges) |
include/paimon/reader/file_batch_reader.h |
Removed prefetch-specific methods, delegating them to PrefetchFileBatchReader interface |
src/paimon/format/orc/orc_reader_wrapper.h |
New wrapper class providing prefetch support for ORC readers |
src/paimon/format/orc/orc_reader_wrapper.cpp |
Implementation of ORC reader wrapper methods |
src/paimon/format/orc/read_range_generator.h |
New class for generating optimal read ranges for ORC files |
src/paimon/format/orc/read_range_generator.cpp |
Implementation of read range generation logic based on stripe and row group metadata |
src/paimon/format/orc/orc_file_batch_reader.h |
Updated to inherit from PrefetchFileBatchReader and use OrcReaderWrapper |
src/paimon/format/orc/orc_file_batch_reader.cpp |
Refactored to delegate operations to OrcReaderWrapper |
src/paimon/format/orc/orc_format_defs.h |
Added constants for prefetch configuration thresholds |
src/paimon/format/parquet/parquet_file_batch_reader.h |
Updated to inherit from PrefetchFileBatchReader |
src/paimon/common/reader/prefetch_file_batch_reader_impl.h |
Renamed from PrefetchFileBatchReader to avoid naming conflict |
src/paimon/common/reader/prefetch_file_batch_reader_impl.cpp |
Updated with new class name and modified to work with PrefetchFileBatchReader interface |
src/paimon/core/operation/abstract_split_read.cpp |
Removed ORC exclusion from prefetch logic, enabling ORC prefetch support |
src/paimon/format/blob/blob_file_batch_reader.h |
Removed now-optional prefetch methods (formats not supporting prefetch) |
src/paimon/format/lance/lance_file_batch_reader.h |
Removed now-optional prefetch methods (formats not supporting prefetch) |
src/paimon/format/avro/avro_file_batch_reader.h |
Removed now-optional prefetch methods (formats not supporting prefetch) |
test/inte/read_inte_test.cpp |
Added test cases for ORC prefetch functionality |
test/inte/read_inte_with_index_test.cpp |
Added test case for ORC with prefetch enabled |
src/paimon/format/orc/orc_reader_wrapper_test.cpp |
New test file for OrcReaderWrapper functionality |
src/paimon/format/orc/read_range_generator_test.cpp |
New comprehensive test suite for ReadRangeGenerator |
Comments suppressed due to low confidence (2)
src/paimon/common/reader/prefetch_file_batch_reader_impl.cpp:524
- Using assert(false) in production code on line 522 is problematic. This will cause the program to terminate in debug builds when GetNextRowToRead is called on PrefetchFileBatchReaderImpl, which is inappropriate for library code. If this method should not be called, it should return a proper Status error like Status::NotImplemented or Status::Invalid with a descriptive message instead of using assert.
src/paimon/common/reader/prefetch_file_batch_reader_impl.cpp:107 - The constructor parameter is declared as 'const std::vector<std::shared_ptr<PrefetchFileBatchReader>>& readers' but on line 107 it's moved with 'std::move(readers)'. Since the parameter is a const reference, moving from it will not actually move the contents. Either remove 'const' from the parameter declaration or copy instead of move. The move is likely ineffective here.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| }; | ||
|
|
||
| uint64_t current_row = begin_row_num; | ||
| int32_t next_range_size = range_size; |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable 'next_range_size' on line 113 is declared as int32_t but is assigned from calculate_range_size which returns uint64_t. This could lead to overflow or sign conversion issues if range_size is large. Consider changing next_range_size to uint64_t to match the return type and avoid potential data loss.
| int32_t next_range_size = range_size; | |
| uint64_t next_range_size = range_size; |
| void OrcFileBatchReader::GetSubColumnIds(const ::orc::Type* type, std::vector<uint64_t>* col_ids) { | ||
| col_ids->push_back(type->getColumnId()); | ||
| for (uint64_t i = 0; i < type->getSubtypeCount(); i++) { | ||
| GetSubColumnIds(type->getSubtype(i), col_ids); |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name 'col_ids' is not very descriptive. Consider renaming to 'target_column_ids' or 'column_id_list' to match the naming conventions used elsewhere in the codebase (see line 192 where 'target_column_ids' is used).
| void OrcFileBatchReader::GetSubColumnIds(const ::orc::Type* type, std::vector<uint64_t>* col_ids) { | |
| col_ids->push_back(type->getColumnId()); | |
| for (uint64_t i = 0; i < type->getSubtypeCount(); i++) { | |
| GetSubColumnIds(type->getSubtype(i), col_ids); | |
| void OrcFileBatchReader::GetSubColumnIds(const ::orc::Type* type, | |
| std::vector<uint64_t>* target_column_ids) { | |
| target_column_ids->push_back(type->getColumnId()); | |
| for (uint64_t i = 0; i < type->getSubtypeCount(); i++) { | |
| GetSubColumnIds(type->getSubtype(i), target_column_ids); |
| assert(false); | ||
| } | ||
| } else { | ||
| for (const auto& [_, len] : kind_and_length) { | ||
| length += len; | ||
| } | ||
| } | ||
| } else { | ||
| assert(false); |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assert statements on lines 205 and 213 will terminate the program in release builds when these conditions are hit. These should be proper error handling with Status returns or exceptions, especially since this is library code that could be called with invalid data. Consider replacing assertions with proper error returns.
| const std::shared_ptr<arrow::MemoryPool>& arrow_pool, | ||
| const std::shared_ptr<::orc::MemoryPool>& orc_pool); | ||
|
|
||
| static void GetSubColumnIds(const ::orc::Type* type, std::vector<uint64_t>* col_ids); |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function parameter 'col_ids' should be renamed to 'target_column_ids' or 'column_ids' for consistency with the naming pattern used throughout the codebase. The abbreviated form 'col_ids' doesn't follow the naming conventions seen in other parts of the code.
| static void GetSubColumnIds(const ::orc::Type* type, std::vector<uint64_t>* col_ids); | |
| static void GetSubColumnIds(const ::orc::Type* type, std::vector<uint64_t>* target_column_ids); |
Purpose
support orc prefetch reader
Tests
API and Format
Documentation