Fix code bundle upload hanging indefinitely on network issues#896
Merged
Fix code bundle upload hanging indefinitely on network issues#896
Conversation
The httpx client used for uploading code bundles had no timeout configured, causing `flyte run` to hang forever at "Uploading code bundle..." when the network connection to object storage was blocked or dropped. - Add 10-minute upload timeout (600s read, 30s connect) to httpx client - Catch httpx.TimeoutException, ConnectError, and OSError in retry loop - Fix last_error tracking that was always None in retry warning logs - Add unit tests for timeout, retry, and error handling behavior Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
AdilFayyaz
previously approved these changes
Apr 4, 2026
Collaborator
AdilFayyaz
left a comment
There was a problem hiding this comment.
Are we sure we want to hardcode the value? What if an upload takes longer than 10 minutes? Your call, otherwise LGTM!
Signed-off-by: Kevin Su <pingsutw@apache.org>
Member
Author
|
Add an env var to make it configurable |
AdilFayyaz
approved these changes
Apr 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
httpx.AsyncClientused for uploading code bundles to object storage. Previously no timeout was configured, causingflyte runto hang forever at "Uploading code bundle..." when the network connection was blocked or dropped.httpx.TimeoutException,httpx.ConnectError, andOSErrorin the retry loop so transport-level errors are retried with backoff (previously only HTTP status codes were retried).last_errortracking that was alwaysNonein retry warning logs.Context
Reported by a customer whose
flyte runhung indefinitely at "Uploading code bundle..." from remote EC2 instances. The control plane connection worked fine, but the signed URL upload to object storage was silently blocked by network configuration. Without a timeout, the SDK gave no error — it just hung forever.Slack thread: https://unionai.slack.com/archives/C05H8K48HPE/p1774990000333939
Test plan
pytest tests/flyte/remote/test_upload_retry.py— 6 new tests covering timeout config, retry on timeout/connect errors, retry on 5xx then success, and no retry on 4xxpytest tests/flyte/remote/test_data_errors.py— existing upload error tests still pass