Skip to content

Conversation

@michaelfeil
Copy link
Contributor

@michaelfeil michaelfeil commented Dec 13, 2025

Overview:

Details:

This PR comes after my company had a memory leak in our dynamo version (circular reference). Profiling large json payloads (3MB or larger each - 500k tokens, lead to a lot of profiling efforts.

In the profiling efforts, it became clear that a lot of allocations are queued around async_openai/chat.rs Struct types and memory in pythonize, which seems to cause some fragmentation We are also seeing some chunky behaviour for large allocations around serde_json::Value. In the end - it seemed dynamo was fine, we upgraded to another master as base + rewrote some things to make them less likley.

As side effect, we created the Python Bridge, to debug when things got moved into python and back. On the way we discovered that the team used a lot of tokio::spawn_blocking and has had performance issues due to the sync lock nature of python gil.

The python bridge is a global python singleton that makes it simpler to serialize the gil from rust, by submitting Rust. The goal is to be easier on the tokio runtime by awaiting the result of Python::with_gil using async without tokio.

Enjoy reading though it. I am pretty happy with the design as a global. Once thing is that the bridge remains active until you close the python process (which is fine, you don't want to tie it to a single engine's runtime).

Please skip DCO, merged main to often - the commits are all under DCO.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

@michaelfeil michaelfeil requested a review from a team as a code owner December 13, 2025 00:55
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 13, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi michaelfeil! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added the external-contribution Pull request is from an external contributor label Dec 13, 2025
@michaelfeil michaelfeil changed the title add python bridge feat: add python bridge Dec 13, 2025
@github-actions github-actions bot added the feat label Dec 13, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 13, 2025

Walkthrough

This pull request introduces a global Python-Rust bridge singleton that manages async conversions and GIL access in a dedicated thread pool. The bridge replaces direct Python GIL handling throughout the Python bindings, with corresponding refactoring in engine, library, and protocol layers to use the new async conversion and GIL acquisition APIs.

Changes

Cohort / File(s) Summary
Python Bridge Infrastructure
lib/bindings/python/rust/bridge.rs
New module implementing Bridge struct as global singleton (PYTHON_BRIDGE) with single worker thread and bounded job queue. Provides async to_py/from_py conversions via pythonize/depythonize, async with_gil for GIL-locked closures, and sync_with_gil blocking variant with deadlock guards. Maps conversion and channel errors to PyErr variants (PyRuntimeError, PyValueError).
Python Bindings Refactoring
lib/bindings/python/rust/engine.rs, lib/bindings/python/rust/lib.rs
Replaced direct Python::with_gil calls with bridge::Bridge::global().with_gil. In engine.rs: switched streaming path from spawn_blocking to bridge-managed GIL, overhauled error handling for PyGeneratorExit detection. In lib.rs: moved request depythonization outside GIL, converted to JSON via Bridge::global().from_py().await; updated streaming response handling to use async to_py conversion. Added TODO for streaming timeout guard.
Async Protocol Mapper
lib/runtime/src/protocols/annotated.rs
Added pub async fn map_data_async on Annotated that applies async transformation to inner data, awaits result, and returns Annotated with preserved id/event/comment metadata. Handles error propagation and None data cases.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • bridge.rs: Async/threading logic with GIL management, error handling consistency, and thread-safe singleton initialization require careful verification
  • engine.rs & lib.rs: Extensive refactoring across multiple Python interop paths with shifted GIL boundaries and async/await conversion patterns
  • Deadlock prevention in sync_with_gil: Cross-thread synchronization logic with bridge thread detection needs careful review
  • Error mapping: Consistent translation of channel failures and conversion errors across async/sync boundaries
  • GIL drop semantics: Ensure temporary Python objects are dropped within worker thread context in all paths

Poem

🐰 A bridge to Python, shiny and new,
Async conversions through and through,
GIL in a thread, no locks to fray,
Rust talks Python the modern way! 🌉✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: add python bridge' directly and clearly describes the main feature being introduced, matching the primary change across all modified files.
Description check ✅ Passed PR description covers Overview, Details, and Where to start, but is missing explicit Related Issues reference and has incomplete issue template placeholder.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
lib/bindings/python/rust/lib.rs (1)

1042-1044: Consider implementing the timeout to prevent resource leaks.

The TODO is valid - if the Python side fails to consume the stream, the sender will block indefinitely, potentially leaking the spawned task. Consider using tokio::time::timeout as noted.

Example implementation:

use tokio::time::{timeout, Duration};

if let Err(_) = timeout(Duration::from_secs(300), tx.send(annotated)).await {
    tracing::error!("Timeout sending response - Python consumer may be stalled");
    break;
}

Would you like me to open an issue to track this improvement?

lib/bindings/python/rust/bridge.rs (1)

26-41: Bridge initialization design looks reasonable.

The single-worker design is appropriate given Python's GIL is itself a global singleton - multiple threads would just contend on the GIL. The bounded channel (1024) provides backpressure. Consider documenting these design decisions.

Consider adding a brief doc comment explaining the design rationale:

/// Global Python bridge for GIL-managed operations.
///
/// Uses a single worker thread since Python's GIL serializes execution anyway.
/// Jobs are queued via a bounded channel (capacity 1024) to provide backpressure.
pub fn global() -> &'static Bridge {
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45e881d and 7e3f92b.

📒 Files selected for processing (4)
  • lib/bindings/python/rust/bridge.rs (1 hunks)
  • lib/bindings/python/rust/engine.rs (4 hunks)
  • lib/bindings/python/rust/lib.rs (4 hunks)
  • lib/runtime/src/protocols/annotated.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-07-14T21:25:56.930Z
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Applied to files:

  • lib/bindings/python/rust/engine.rs
📚 Learning: 2025-08-21T17:23:02.836Z
Learnt from: michaelfeil
Repo: ai-dynamo/dynamo PR: 2591
File: lib/bindings/python/rust/http.rs:0-0
Timestamp: 2025-08-21T17:23:02.836Z
Learning: In lib/bindings/python/rust/http.rs, the enable_endpoint method uses EndpointType::all() to dynamically support all available endpoint types with case-insensitive matching, which is more maintainable than hardcoded match statements for endpoint type mappings.

Applied to files:

  • lib/bindings/python/rust/lib.rs
🧬 Code graph analysis (3)
lib/bindings/python/rust/bridge.rs (1)
lib/bindings/python/rust/engine.rs (4)
  • mpsc (165-165)
  • new (80-87)
  • new (110-126)
  • e (317-317)
lib/bindings/python/rust/engine.rs (1)
lib/bindings/python/rust/bridge.rs (1)
  • global (22-42)
lib/bindings/python/rust/lib.rs (2)
lib/bindings/python/rust/bridge.rs (1)
  • global (22-42)
lib/runtime/src/protocols/annotated.rs (2)
  • map_data_async (121-143)
  • is_error (145-147)
🪛 GitHub Actions: Copyright Checks
lib/bindings/python/rust/bridge.rs

[error] 1-1: Copyright check failed. Missing or invalid copyright header in lib/bindings/python/rust/bridge.rs.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: clippy (.)
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: tests (lib/runtime/examples)
🔇 Additional comments (5)
lib/runtime/src/protocols/annotated.rs (1)

126-143: Implementation logic is correct.

The async transformation properly handles all cases: transforms data when present, propagates errors via from_error, and preserves metadata (id, event, comment) in both success and None branches. This mirrors the synchronous map_data behavior appropriately.

lib/bindings/python/rust/lib.rs (1)

942-952: Consistent pattern for bridge-based request conversion.

The request handling pattern using Bridge::global().from_py() is consistent across round_robin, random, and direct methods. This properly moves Python object conversion off the async task thread.

lib/bindings/python/rust/engine.rs (2)

182-203: Good migration to bridge-based GIL handling.

The change from spawn_blocking + direct Python::with_gil to bridge::Bridge::global().with_gil() centralizes GIL management and aligns with the PR's goal of reducing Tokio runtime pressure from GIL contention.


307-333: Improved error classification with bridge-based GIL handling.

The reworked process_item function properly:

  1. Extracts the error message before moving into the bridge closure
  2. Uses bridge for GIL-dependent is_instance_of check
  3. Correctly maps errors to appropriate ResponseProcessingError variants
  4. Uses bridge.from_py::<Resp> for deserialization
lib/bindings/python/rust/bridge.rs (1)

44-94: Serialization/deserialization helpers are well-structured.

The to_py and from_py methods properly:

  • Use oneshot channels for async response handling
  • Map channel errors to appropriate PyErr variants
  • Ensure Python objects are dropped while GIL is held (line 82 comment)

Signed-off-by: michaelfeil <[email protected]>
Signed-off-by: michaelfeil <[email protected]>
Signed-off-by: michaelfeil <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contribution Pull request is from an external contributor feat size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant