Skip to content

Fix code bundle upload hanging indefinitely on network issues#896

Merged
pingsutw merged 7 commits intomainfrom
code-bundle-hang
Apr 7, 2026
Merged

Fix code bundle upload hanging indefinitely on network issues#896
pingsutw merged 7 commits intomainfrom
code-bundle-hang

Conversation

@pingsutw
Copy link
Copy Markdown
Member

@pingsutw pingsutw commented Apr 2, 2026

Summary

  • Add 10-minute timeout (600s read, 30s connect) to the httpx.AsyncClient used for uploading code bundles to object storage. Previously no timeout was configured, causing flyte run to hang forever at "Uploading code bundle..." when the network connection was blocked or dropped.
  • Catch httpx.TimeoutException, httpx.ConnectError, and OSError in the retry loop so transport-level errors are retried with backoff (previously only HTTP status codes were retried).
  • Fix last_error tracking that was always None in retry warning logs.

Context

Reported by a customer whose flyte run hung indefinitely at "Uploading code bundle..." from remote EC2 instances. The control plane connection worked fine, but the signed URL upload to object storage was silently blocked by network configuration. Without a timeout, the SDK gave no error — it just hung forever.

Slack thread: https://unionai.slack.com/archives/C05H8K48HPE/p1774990000333939

Test plan

  • pytest tests/flyte/remote/test_upload_retry.py — 6 new tests covering timeout config, retry on timeout/connect errors, retry on 5xx then success, and no retry on 4xx
  • pytest tests/flyte/remote/test_data_errors.py — existing upload error tests still pass

pingsutw added 6 commits April 2, 2026 11:53
The httpx client used for uploading code bundles had no timeout configured,
causing `flyte run` to hang forever at "Uploading code bundle..." when the
network connection to object storage was blocked or dropped.

- Add 10-minute upload timeout (600s read, 30s connect) to httpx client
- Catch httpx.TimeoutException, ConnectError, and OSError in retry loop
- Fix last_error tracking that was always None in retry warning logs
- Add unit tests for timeout, retry, and error handling behavior

Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
AdilFayyaz
AdilFayyaz previously approved these changes Apr 4, 2026
Copy link
Copy Markdown
Collaborator

@AdilFayyaz AdilFayyaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want to hardcode the value? What if an upload takes longer than 10 minutes? Your call, otherwise LGTM!

Signed-off-by: Kevin Su <pingsutw@apache.org>
@pingsutw
Copy link
Copy Markdown
Member Author

pingsutw commented Apr 4, 2026

Add an env var to make it configurable

@pingsutw pingsutw merged commit 4692303 into main Apr 7, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants