Skip to content

Conversation

@lucasfang
Copy link
Collaborator

Purpose

support orc prefetch reader

Tests

API and Format

Documentation

Copilot AI review requested due to automatic review settings December 22, 2025 08:56
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds prefetch read support for ORC format files, enabling parallel reading and improved performance. The implementation introduces a new interface hierarchy to separate prefetch-capable readers from standard file batch readers.

Key changes include:

  • Creation of a new PrefetchFileBatchReader interface that extracts prefetch-specific methods from FileBatchReader
  • Implementation of OrcReaderWrapper and ReadRangeGenerator classes to enable ORC prefetch functionality
  • Refactoring of the existing PrefetchFileBatchReader class to PrefetchFileBatchReaderImpl to avoid naming conflicts

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
include/paimon/reader/prefetch_file_batch_reader.h New interface defining prefetch-specific methods (SeekToRow, GetNextRowToRead, GenReadRanges, SetReadRanges)
include/paimon/reader/file_batch_reader.h Removed prefetch-specific methods, delegating them to PrefetchFileBatchReader interface
src/paimon/format/orc/orc_reader_wrapper.h New wrapper class providing prefetch support for ORC readers
src/paimon/format/orc/orc_reader_wrapper.cpp Implementation of ORC reader wrapper methods
src/paimon/format/orc/read_range_generator.h New class for generating optimal read ranges for ORC files
src/paimon/format/orc/read_range_generator.cpp Implementation of read range generation logic based on stripe and row group metadata
src/paimon/format/orc/orc_file_batch_reader.h Updated to inherit from PrefetchFileBatchReader and use OrcReaderWrapper
src/paimon/format/orc/orc_file_batch_reader.cpp Refactored to delegate operations to OrcReaderWrapper
src/paimon/format/orc/orc_format_defs.h Added constants for prefetch configuration thresholds
src/paimon/format/parquet/parquet_file_batch_reader.h Updated to inherit from PrefetchFileBatchReader
src/paimon/common/reader/prefetch_file_batch_reader_impl.h Renamed from PrefetchFileBatchReader to avoid naming conflict
src/paimon/common/reader/prefetch_file_batch_reader_impl.cpp Updated with new class name and modified to work with PrefetchFileBatchReader interface
src/paimon/core/operation/abstract_split_read.cpp Removed ORC exclusion from prefetch logic, enabling ORC prefetch support
src/paimon/format/blob/blob_file_batch_reader.h Removed now-optional prefetch methods (formats not supporting prefetch)
src/paimon/format/lance/lance_file_batch_reader.h Removed now-optional prefetch methods (formats not supporting prefetch)
src/paimon/format/avro/avro_file_batch_reader.h Removed now-optional prefetch methods (formats not supporting prefetch)
test/inte/read_inte_test.cpp Added test cases for ORC prefetch functionality
test/inte/read_inte_with_index_test.cpp Added test case for ORC with prefetch enabled
src/paimon/format/orc/orc_reader_wrapper_test.cpp New test file for OrcReaderWrapper functionality
src/paimon/format/orc/read_range_generator_test.cpp New comprehensive test suite for ReadRangeGenerator
Comments suppressed due to low confidence (2)

src/paimon/common/reader/prefetch_file_batch_reader_impl.cpp:524

  • Using assert(false) in production code on line 522 is problematic. This will cause the program to terminate in debug builds when GetNextRowToRead is called on PrefetchFileBatchReaderImpl, which is inappropriate for library code. If this method should not be called, it should return a proper Status error like Status::NotImplemented or Status::Invalid with a descriptive message instead of using assert.
    src/paimon/common/reader/prefetch_file_batch_reader_impl.cpp:107
  • The constructor parameter is declared as 'const std::vector<std::shared_ptr<PrefetchFileBatchReader>>& readers' but on line 107 it's moved with 'std::move(readers)'. Since the parameter is a const reference, moving from it will not actually move the contents. Either remove 'const' from the parameter declaration or copy instead of move. The move is likely ineffective here.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

};

uint64_t current_row = begin_row_num;
int32_t next_range_size = range_size;
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable 'next_range_size' on line 113 is declared as int32_t but is assigned from calculate_range_size which returns uint64_t. This could lead to overflow or sign conversion issues if range_size is large. Consider changing next_range_size to uint64_t to match the return type and avoid potential data loss.

Suggested change
int32_t next_range_size = range_size;
uint64_t next_range_size = range_size;

Copilot uses AI. Check for mistakes.
Comment on lines +204 to +207
void OrcFileBatchReader::GetSubColumnIds(const ::orc::Type* type, std::vector<uint64_t>* col_ids) {
col_ids->push_back(type->getColumnId());
for (uint64_t i = 0; i < type->getSubtypeCount(); i++) {
GetSubColumnIds(type->getSubtype(i), col_ids);
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name 'col_ids' is not very descriptive. Consider renaming to 'target_column_ids' or 'column_id_list' to match the naming conventions used elsewhere in the codebase (see line 192 where 'target_column_ids' is used).

Suggested change
void OrcFileBatchReader::GetSubColumnIds(const ::orc::Type* type, std::vector<uint64_t>* col_ids) {
col_ids->push_back(type->getColumnId());
for (uint64_t i = 0; i < type->getSubtypeCount(); i++) {
GetSubColumnIds(type->getSubtype(i), col_ids);
void OrcFileBatchReader::GetSubColumnIds(const ::orc::Type* type,
std::vector<uint64_t>* target_column_ids) {
target_column_ids->push_back(type->getColumnId());
for (uint64_t i = 0; i < type->getSubtypeCount(); i++) {
GetSubColumnIds(type->getSubtype(i), target_column_ids);

Copilot uses AI. Check for mistakes.
Comment on lines +205 to +213
assert(false);
}
} else {
for (const auto& [_, len] : kind_and_length) {
length += len;
}
}
} else {
assert(false);
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert statements on lines 205 and 213 will terminate the program in release builds when these conditions are hit. These should be proper error handling with Status returns or exceptions, especially since this is library code that could be called with invalid data. Consider replacing assertions with proper error returns.

Copilot uses AI. Check for mistakes.
const std::shared_ptr<arrow::MemoryPool>& arrow_pool,
const std::shared_ptr<::orc::MemoryPool>& orc_pool);

static void GetSubColumnIds(const ::orc::Type* type, std::vector<uint64_t>* col_ids);
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function parameter 'col_ids' should be renamed to 'target_column_ids' or 'column_ids' for consistency with the naming pattern used throughout the codebase. The abbreviated form 'col_ids' doesn't follow the naming conventions seen in other parts of the code.

Suggested change
static void GetSubColumnIds(const ::orc::Type* type, std::vector<uint64_t>* col_ids);
static void GetSubColumnIds(const ::orc::Type* type, std::vector<uint64_t>* target_column_ids);

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant