Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion sdks/python/apache_beam/io/gcp/gcsio_integration_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -242,7 +242,12 @@ def test_create_default_bucket(self, mock_default_gcs_bucket_name):
# verify soft delete policy is disabled by default in the default bucket
# after creation
self.assertEqual(bucket.soft_delete_policy.retention_duration_seconds, 0)
bucket.delete()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is changing the underlying test - if the bucket doesn't exist, that means that we should be failing the test (this just covers up potentially flaky actual behavior). Do you know why this is happening? Is it because tests are overwriting the same bucket name, a timing issue, or are we actually failing to create these buckets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh absolutely right, then this masks the issue as I wasn't aware that the test should fail if bucket doesnt exists but when i was investigating the test class i found out that:
get_or_create_default_gcs_bucket() creates/gets bucket by create_bucket()
then wait 60 seconds for propagation
then it asserts the bucket exists using the object returned from creation
after that it make a fresh call by get_bucket() to lookup_bucket() which may hit a different gcs endpoint or cache layer
even after the 60 seconds wait lookup_bucket() might not immediately see the newly created bucket maybe because
different api endpoints create_bucket or lookup_bucket having different cache states or gcs eventual consistency across regions or even caches, also does create_bucket() and lookup_bucket() may hit different gcs endpoints or caches?
and it might be name collisions as multiple test runs using the same random bucket names?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before doing this, could we try retrying the call on failure several times (maybe 5)? Then if the bucket still is not found, we can pass

Also, how often is this flaking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the last 9 runs, we have 2 failures out of 8 completed runs (excluding 1 cancelled), giving us a 25%

existing_bucket = self.gcsio.get_bucket(overridden_bucket_name)
if existing_bucket:
try:
existing_bucket.delete()
except NotFound:
pass

self.assertIsNone(self.gcsio.get_bucket(overridden_bucket_name))

Expand Down
Loading