Skip to content

Conversation

@SGZW
Copy link
Contributor

@SGZW SGZW commented Dec 23, 2025

Purpose

Linked issue: close #xxx

background: compile and run under gcc8
1.

src/paimon/common/file_index/file_index_format_test.cpp

unordered_map iterate order is undefined, so make ut not depend on the order of return vector's element order
2.

src/paimon/format/orc/complex_predicate_test.cpp, src/paimon/format/orc/orc_file_batch_reader_test.cpp

fix ut expect data, hard coding results is mismatch, reason as follows:
(1) refer: https://github.com/eggert/tz/blob/main/asia#L653
When using the Asia/Shanghai timezone, timestamps prior to 1901 have an additional offset of 5 minutes and 43 seconds
(2) refer: https://github.com/HowardHinnant/date/blob/master/include/date/tz.h
arrow vendored tz lib which direct read /etc/localtime to get timezone and format

Tests

API and Format

Documentation

@SGZW
Copy link
Contributor Author

SGZW commented Dec 23, 2025

@lxy-9602 PTAL,Thanks

[10, 1, 0, "2008-12-28 00:00:00.000123456", null, "dad"],
[10, 1, 100, "2008-12-28 00:00:00.00012345", "-123.45000", "eat"],
[10, 1, null, "1899-01-01 00:59:20.001001001", "0.00000", "fat"],
[10, 1, null, "1899-01-01 01:05:03.001001001", "0.00000", "fat"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unit test runs normally on my local machine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will recheck it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems these tests are passed as expected in our local environments and CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i find it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that ChunkedArrayFromJSON uses arrow_vendored::date::current_zone()? I looked at the source code, and it seems to always use the UTC time zone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my tz data version as follows, could you share your environment details?

xx@n31-146-204:~$ apt show tzdata
Package: tzdata
Version: 2025c-3
Priority: required
Section: localization
Maintainer: GNU Libc Maintainers <[email protected]>
Installed-Size: 1,391 kB
Provides: tzdata-forky
Depends: debconf (>= 0.5) | debconf-2.0
Breaks: tzdata-legacy (= 2023c-8), tzdata-legacy (= 2023c-9)
Replaces: tzdata-legacy (= 2023c-8), tzdata-legacy (= 2023c-9)
Homepage: https://www.iana.org/time-zones
Tag: role::app-data, use::timekeeping
Download-Size: 263 kB
APT-Sources: http://mirrors.aliyun.com/debian testing/main amd64 Packages
Description: time zone and daylight-saving time data
 This package contains data required for the implementation of
 standard local time for many representative locations around the
 globe. It is updated periodically to reflect changes made by
 political bodies to time zone boundaries, UTC offsets, and
 daylight-saving rules.

N: There is 1 additional record. Please use the '-a' switch to see it
xx@n31-146-204:~$ zdump -v Asia/Shanghai | grep 1900
Asia/Shanghai  Mon Dec 31 15:54:16 1900 UT = Mon Dec 31 23:59:59 1900 LMT isdst=0 gmtoff=29143
Asia/Shanghai  Mon Dec 31 15:54:17 1900 UT = Mon Dec 31 23:54:17 1900 CST isdst=0 gmtoff=28800

Copy link
Collaborator

@lucasfang lucasfang Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

PRETTY_NAME="Ubuntu 24.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.3 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo
Package: tzdata
Version: 2025b-0ubuntu0.24.04.1
Priority: important
Section: libs
Origin: Ubuntu
Maintainer: Ubuntu Developers <[email protected]>
Original-Maintainer: GNU Libc Maintainers <[email protected]>
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Installed-Size: 1420 kB
Provides: tzdata-trixie
Depends: debconf (>= 0.5) | debconf-2.0
Breaks: tzdata-legacy (= 2023c-8ubuntu1)
Replaces: tzdata-legacy (= 2023c-8ubuntu1)
Homepage: https://www.iana.org/time-zones
Task: cloud-minimal, minimal, server-minimal
Download-Size: 276 kB
APT-Manual-Installed: no
APT-Sources: http://archive.ubuntu.com/ubuntu noble-updates/main amd64 Packages
Description: time zone and daylight-saving time data
Asia/Shanghai  Mon Dec 31 15:54:16 1900 UT = Mon Dec 31 23:59:59 1900 LMT isdst=0 gmtoff=29143
Asia/Shanghai  Mon Dec 31 15:54:17 1900 UT = Mon Dec 31 23:54:17 1900 CST isdst=0 gmtoff=28800

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is probably at the ORC layer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/orc/blob/main/c%2B%2B/src/ColumnReader.cc#L327

Regarding the timezone handling for ORC here: if the timezones of the reader and writer are different, the final display results will vary. I’ve checked the code and confirmed this is indeed the case.
In the code, for the writer timezone: if no metadata is recorded in the ORC file, the local timezone will be used directly. For the reader timezone, it follows the configuration in the row_reader options – the default is GMT. If the local timezones differ, the displayed results will appear inconsistent. Could you please verify the code? @ChaomingZhangCN @lucasfang

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know that ORC reader performs timezone conversion based on the writer's and reader's timezones. However, the writer's timezone is fixed, and all readers use Asia/Shanghai—so the timezone conversion logic should be consistent. It's puzzling to see such a discrepancy.

@lucasfang lucasfang requested a review from Copilot December 23, 2025 09:19
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@SGZW
Copy link
Contributor Author

SGZW commented Dec 23, 2025

@lxy-9602 @ChaomingZhangCN PTAL, thanks

@ChaomingZhangCN
Copy link
Contributor

In that case, the integration test case ReadInteTest#TestAppendReadWithComplexTypePredicate should also fail.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +667 to +678
for (const auto& reader : index_file_readers) {
if (auto* r = dynamic_cast<BitSliceIndexBitmapFileIndexReader*>(reader.get())) {
bsi_reader = r;
} else if (auto* r = dynamic_cast<BitmapFileIndexReader*>(reader.get())) {
bitmap_reader = r;
} else if (auto* r = dynamic_cast<BloomFilterFileIndexReader*>(reader.get())) {
bloom_filter_reader = r;
}
}
ASSERT_TRUE(bsi_reader);
ASSERT_TRUE(bitmap_reader);
ASSERT_TRUE(bloom_filter_reader);
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop iterating through index_file_readers to find the different reader types has a potential issue: if multiple readers of the same type exist, only the last one encountered will be assigned. While this may not be an issue in practice if the implementation guarantees at most one reader of each type, it would be safer to add assertions within the loop to verify that each reader type is encountered at most once. This would make the assumption explicit and catch potential bugs if the behavior changes in the future.

Copilot uses AI. Check for mistakes.
Comment on lines +191 to +202
for (const auto& reader : index_file_readers) {
if (auto* r = dynamic_cast<BitSliceIndexBitmapFileIndexReader*>(reader.get())) {
bsi_reader = r;
} else if (auto* r = dynamic_cast<BitmapFileIndexReader*>(reader.get())) {
bitmap_reader = r;
} else if (auto* r = dynamic_cast<BloomFilterFileIndexReader*>(reader.get())) {
bloom_filter_reader = r;
}
{
ASSERT_OK_AND_ASSIGN(
auto result, bitmap_reader->VisitIn({Literal(Timestamp(1745542802000l, 0)),
}
ASSERT_TRUE(bsi_reader);
ASSERT_TRUE(bitmap_reader);
ASSERT_TRUE(bloom_filter_reader);
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop iterating through index_file_readers to find the different reader types has a potential issue: if multiple readers of the same type exist, only the last one encountered will be assigned. While this may not be an issue in practice if the implementation guarantees at most one reader of each type, it would be safer to add assertions within the loop to verify that each reader type is encountered at most once. This would make the assumption explicit and catch potential bugs if the behavior changes in the future.

Copilot uses AI. Check for mistakes.
Comment on lines +349 to +360
for (const auto& reader : index_file_readers) {
if (auto* r = dynamic_cast<BitSliceIndexBitmapFileIndexReader*>(reader.get())) {
bsi_reader = r;
} else if (auto* r = dynamic_cast<BitmapFileIndexReader*>(reader.get())) {
bitmap_reader = r;
} else if (auto* r = dynamic_cast<BloomFilterFileIndexReader*>(reader.get())) {
bloom_filter_reader = r;
}
{
ASSERT_OK_AND_ASSIGN(
auto result, bitmap_reader->VisitIn({Literal(Timestamp(1745542802001l, 0)),
}
ASSERT_TRUE(bsi_reader);
ASSERT_TRUE(bitmap_reader);
ASSERT_TRUE(bloom_filter_reader);
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop iterating through index_file_readers to find the different reader types has a potential issue: if multiple readers of the same type exist, only the last one encountered will be assigned. While this may not be an issue in practice if the implementation guarantees at most one reader of each type, it would be safer to add assertions within the loop to verify that each reader type is encountered at most once. This would make the assumption explicit and catch potential bugs if the behavior changes in the future.

Copilot uses AI. Check for mistakes.
Comment on lines +506 to +517
for (const auto& reader : index_file_readers) {
if (auto* r = dynamic_cast<BitSliceIndexBitmapFileIndexReader*>(reader.get())) {
bsi_reader = r;
} else if (auto* r = dynamic_cast<BitmapFileIndexReader*>(reader.get())) {
bitmap_reader = r;
} else if (auto* r = dynamic_cast<BloomFilterFileIndexReader*>(reader.get())) {
bloom_filter_reader = r;
}
}
ASSERT_TRUE(bsi_reader);
ASSERT_TRUE(bitmap_reader);
ASSERT_TRUE(bloom_filter_reader);
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop iterating through index_file_readers to find the different reader types has a potential issue: if multiple readers of the same type exist, only the last one encountered will be assigned. While this may not be an issue in practice if the implementation guarantees at most one reader of each type, it would be safer to add assertions within the loop to verify that each reader type is encountered at most once. This would make the assumption explicit and catch potential bugs if the behavior changes in the future.

Copilot uses AI. Check for mistakes.
@SGZW
Copy link
Contributor Author

SGZW commented Dec 24, 2025

In that case, the integration test case ReadInteTest#TestAppendReadWithComplexTypePredicate should also fail.

I’m building the project with Blade, and I haven’t run the integration tests yet. I’ve already run into some issues with the unit tests, so I’ll look into them further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants