Dynamo Utils

A collection of utility scripts and configuration files for developing and deploying the NVIDIA Dynamo distributed inference framework.

⚠️ DISCLAIMER

IMPORTANT: This is an experimental utilities repository and is NOT officially tied to or supported by the ai-dynamo project. These tools are provided as-is without any warranty or official support. Use at your own risk.

This collection of utilities is maintained independently for development convenience and is not part of the official Dynamo project.

Overview

This repository contains essential development tools, build scripts, and configuration management utilities for working with the Dynamo project. These tools streamline the development workflow, manage container environments, and facilitate testing of the Dynamo inference system.

Prerequisites

Docker with GPU support
Python >= 3.10
Rust toolchain (for building Dynamo components)
Git
jq (for JSON processing in scripts)

Directory Structure

dynamo-utils/
├── compile.sh                    # Build and install Dynamo Python packages
├── curl.sh                       # Test models via chat completions API
├── inference.sh                  # Launch Dynamo inference services
├── devcontainer_sync.sh          # Sync dev configs across projects
├── devcontainer.json             # VS Code Dev Container configuration
├── common.py                     # Shared utilities module
├── show_dynamo_branches.py       # Branch status checker
├── show_commit_history.py        # Commit history with composite SHAs
├── git_stats.py                  # Git repository statistics analyzer
├── update_html_pages.sh          # HTML page update cron script
├── soak_fe.py                    # Frontend soak testing script
├── gpu_reset.sh                  # GPU reset utility
└── container/                    # Docker-related scripts
    ├── build_images.py               # Automated Docker build/test pipeline
    ├── build_images_report.html.j2   # HTML report template
    ├── retag_images.py               # Docker image retagging utility
    ├── restart_gpu_containers.sh     # GPU error monitoring/restart
    └── cleanup_old_images.sh         # Cleanup old Docker images

Key Scripts

Build & Development

`compile.sh`

Builds and installs Python packages for the Dynamo distributed inference framework.

# Development mode (fast build, editable install)
./compile.sh --dev

# Release mode (optimized build)
./compile.sh --release

# Clean Python packages
./compile.sh --python-clean

# Clean Rust build artifacts
./compile.sh --cargo-clean

Packages built:

ai-dynamo-runtime: Core Rust extensions + Python bindings
ai-dynamo: Complete framework with all components

Testing & Inference

`curl.sh`

Convenient script to test models via the chat completions API with retry logic and response validation.

# Basic test
./curl.sh --port 8000 --prompt "Hello!"

# Streaming with retry
./curl.sh --port 8000 --stream --retry --prompt "Tell me a story"

# Loop testing with metrics
./curl.sh --port 8000 --loop --metrics --random

Options:

--port: API server port (default: 8000)
--prompt: User prompt
--stream: Enable streaming responses
--retry: Retry until success
--loop: Run in infinite loop
--metrics: Show performance metrics

`inference.sh`

Launches Dynamo inference services (frontend and backend).

# Run with default framework
./inference.sh

# Run with specific backend
./inference.sh --framework vllm

# Dry run to see commands
./inference.sh --dry-run

Environment variables:

DYN_FRONTEND_PORT: Frontend port (default: 8000)
DYN_BACKEND_PORT: Backend port (default: 8081)

Configuration Management

`devcontainer_sync.sh`

Automatically syncs development configuration files across multiple Dynamo project directories.

# Sync configurations
./devcontainer_sync.sh

# Dry run to preview changes
./devcontainer_sync.sh --dryrun

# Force sync regardless of changes
./devcontainer_sync.sh --force

# Silent mode for cron jobs
./devcontainer_sync.sh --silent

Python Utilities

container/build_images.py

Automated Docker build and test pipeline system that builds and tests multiple inference frameworks (VLLM, SGLANG, TRTLLM) across different target environments (base, dev, local-dev), generates HTML reports, and sends email notifications.

Usage Examples

Quick Test (Single Framework):

python3 container/build_images.py --repo-path ~/nvidia/dynamo_ci --sanity-check-only --framework sglang --force-run

Parallel Build with Skip:

python3 container/build_images.py --repo-path ~/nvidia/dynamo_ci --skip-action-if-already-passed --parallel --force-run

Full Build:

python3 container/build_images.py --repo-path ~/nvidia/dynamo_ci --parallel --force-run

Note: --repo-path is required and specifies the path to the Dynamo repository.

HTML Report Generation

Generate two versions when email is requested:
- File version: relative paths (just filenames)
- Email version: absolute URLs with hostname
Use helper function get_log_url() to abstract URL generation
Pass use_absolute_urls flag to control URL format
Default hostname: keivenc-linux
Default HTML path: /nvidia/dynamo_ci/logs

Log File Paths

Logs stored in: ~/nvidia/dynamo_ci/logs/YYYY-MM-DD/
Log filename format: YYYY-MM-DD.{sha_short}.{framework}-{target}-{type}.log
HTML report: YYYY-MM-DD.{sha_short}.report.html
Always use date subdirectories
Use log_dir.parent to get root logs directory

Version Extraction

Automatically extracts version from build.sh output (not hardcoded)
Version format depends on git repository state:
- If on a tagged commit: v{tag} (e.g., v0.6.1)
- Otherwise: v{latest_tag}.dev.{commit_id} (e.g., v0.6.1.dev.d9b674b86)
Version is extracted once per framework before creating task graphs
All image tags use the dynamically extracted version (base, runtime, dev, local-dev)

Process Management

Use lock files to prevent concurrent runs
Lock file: .dynamo_builder.lock (in repository root)
Store PID in lock file
Check if process is still running before acquiring lock
Clean up stale locks automatically

Email Notifications

Uses SMTP server: smtp.nvidia.com:25
Email subject format: {SUCC|FAIL}: DynamoDockerBuilder - {sha_short} [{failed_tasks}]
HTML email body with clickable links (absolute URLs)
Includes failed task names in subject if any failures occurred
Note: Email notifications are NOT sent in dry-run mode (only shows a note that email would be sent)
Separate error handling for email failures (errors are logged separately from HTML report generation)
Email sending continues even if HTML report generation fails

Commit History Feature

Overview: The show_commit_history.py script displays recent commits with their composite SHAs. Supports both terminal and HTML output modes with integrated caching for performance.

Caching System:

Cache file: .commit_history_cache.json (in repository root)
Format: JSON mapping of commit SHA (full) → composite SHA
Purpose: Avoid expensive git checkout + composite SHA recalculation
Performance: Cached lookups are nearly instant vs ~1-2 seconds per commit calculation

Usage Examples:

Terminal output with caching:

python3 show_commit_history.py --repo-path ~/nvidia/dynamo_ci --max-commits 50

HTML output with caching:

python3 show_commit_history.py --repo-path ~/nvidia/dynamo_ci --html --max-commits 50 --output ~/nvidia/dynamo_ci/logs/commit-history.html

Verbose mode (shows cache hits/misses):

python3 show_commit_history.py --repo-path ~/nvidia/dynamo_ci --html --max-commits 50 --output ~/nvidia/dynamo_ci/logs/commit-history.html --verbose

HTML Output Features:

Clickable commit SHAs linking to GitHub commit page
Clickable PR numbers extracted from commit messages
Docker image detection showing expandable list of images containing each commit SHA
Expandable sections using HTML <details> tag
GitHub-style CSS

Implementation Details:

Commit SHA format: Full SHA for GitHub links, 9-char short SHA for display
PR extraction regex: r'$#(\d+)$'
Docker image query: Uses docker images --format "{{.Repository}}:{{.Tag}}" once for all commits
Cache management: Loaded at start, saved after updates

Image Size Population

Overview: Docker image sizes are automatically populated in HTML reports for all BUILD tasks, regardless of whether images were just built or were skipped (using --skip-build-if-image-exists).

Implementation:

Location: container/build_images.py:1810-1851 and line 2003-2004
Method: _populate_image_sizes(pipeline: BuildPipeline)
Called after task execution completes (but before HTML report generation)
Only runs when NOT in dry-run mode
Queries Docker for all BUILD tasks that have an image_tag
Populates task.image_size with format "XX.X GB"

Why This Approach:

Centralized: All image size detection happens in one place after task execution
Works for all scenarios: Handles both newly built images and existing images
Non-blocking: Uses try-except to gracefully handle missing images
Efficient: Only queries Docker once per image after all tasks complete

update_html_pages.sh

Automated cron script that runs every 30 minutes to update multiple HTML pages.

Schedule:

*/30 * * * * $HOME/nvidia/dynamo-utils/update_html_pages.sh

Tasks Performed:

Cleanup old logs (runs first)
- Keeps only the last 10 non-empty dated directories in $LOGS_DIR (defaults to $NVIDIA_HOME/logs)
- Deletes older directories to save disk space
- Logs all cleanup actions with directory counts
Updates branch status HTML ($NVIDIA_HOME/index.html where NVIDIA_HOME defaults to parent of script directory)
- Calls show_dynamo_branches.py --html --output $BRANCHES_TEMP_FILE
- Shows status of all dynamo branches
- Uses atomic file replacement (temp file → final file)
Updates commit history HTML ($DYNAMO_REPO/index.html where DYNAMO_REPO defaults to $NVIDIA_HOME/dynamo_latest)
- Calls show_commit_history.py --repo-path . --html --max-commits 200 --output $COMMIT_HISTORY_HTML
- Leverages caching for fast updates (only calculates new commits)
- Shows last 200 commits with composite SHAs and Docker images

Log file: $LOGS_DIR/cron.log (where LOGS_DIR defaults to $NVIDIA_HOME/logs)

Performance Optimization:

First run: ~50-100 seconds (calculates all 200 commits)
Subsequent runs: ~5-10 seconds (only processes new commits, rest from cache)
Cache file size: ~20-40KB for 200 commits
No cache invalidation needed: Composite SHAs are deterministic based on file content

common.py

Shared utilities module providing reusable components across all scripts.

Key Classes:

GitUtils: GitPython API wrapper (no subprocess calls)
GitHubAPIClient: GitHub API client with auto token detection and rate limit handling
DockerUtils: Docker operations and image management
DynamoImageInfo, DockerImageInfo: Dataclasses for image metadata (used by retag script)

Utility Functions:

get_terminal_width(): Terminal width detection with fallback
normalize_framework(): Framework name normalization
get_framework_display_name(): Pretty framework names for output

show_dynamo_branches.py

Branch status checker that displays information about dynamo* repository branches with GitHub PR integration.

Features:

Scans all dynamo* directories for branch information
Queries GitHub API for PR status using common.GitHubAPIClient
Supports both terminal and HTML output modes
Parallel data gathering for improved performance
Automatic GitHub token detection (env var, gh CLI config)

git_stats.py

Git repository statistics analyzer that provides detailed contributor metrics and rankings.

Features:

Analyzes git commit history for any time range
Tracks unique contributors with commit counts
Calculates lines added/deleted/changed per contributor
Provides average statistics per person
Dual ranking views: by commits and by lines changed
Supports flexible time ranges (days, since/until dates, all-time)

Usage Examples:

# Statistics for last 30 days
python3 git_stats.py --days 30

# Statistics for last 7 days
python3 git_stats.py --days 7

# Statistics since a specific date
python3 git_stats.py --since "2025-01-01"

# Statistics for a date range
python3 git_stats.py --since "2025-01-01" --until "2025-01-31"

# All time statistics
python3 git_stats.py

Output Includes:

Number of unique contributors
Total commits and line changes
Average commits per person
Average lines added/deleted/changed per person
Contributor rankings by commits (with full details)
Contributor rankings by lines changed

Sample Output:

================================================================================
Git Repository Statistics - Last 30 days
================================================================================

Number of unique contributors: 59
Total commits: 340
Total lines added: 148,764
Total lines deleted: 61,093
Total lines changed: 209,857

Average commits per person: 5.8
Average lines added per person: 2521.4
Average lines deleted per person: 1035.5
Average lines changed per person: 3556.9

================================================================================
Contributor Rankings (by commits)
================================================================================

Rank   Name                           Email                               Commits  Added      Deleted    Changed
--------------------------------------------------------------------------------------------------------------
1      John Doe                       [email protected]                    31       4,455      3,512      7,967
...

Environment Setup

The typical workflow for setting up a development environment:

Clone the Dynamo repository
Use ./compile.sh --dev to build in development mode
Test with ./inference.sh and ./curl.sh

Tips & Best Practices

Development Mode: Use ./compile.sh --dev for faster iteration during development
Testing: Always test API endpoints with ./curl.sh after starting services
Configuration Sync: Run devcontainer_sync.sh periodically or via cron to keep configs updated
Container Development: Use the Dev Container for a consistent development environment
Port Conflicts: Check port availability before running inference services

Troubleshooting

Port Already in Use

If you encounter port conflicts when running inference.sh:

# Check what's using the port
lsof -i :8000
# Kill the process or use different ports
DYN_FRONTEND_PORT=8090 DYN_BACKEND_PORT=8091 ./inference.sh

Build Failures

For build issues with compile.sh:

# Clean and rebuild
./compile.sh --cargo-clean
./compile.sh --python-clean
./compile.sh --dev

Container Issues

If Docker container fails to start:

# Check Docker daemon
docker ps
# Verify GPU support
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

Contributing

When contributing to this repository:

Test scripts thoroughly before committing
Update this README if adding new scripts or features
Use meaningful commit messages with --signoff

Additional Documentation

CLAUDE.md: Operational procedures, environment setup, and inference server documentation
.cursorrules: Coding conventions, style guidelines, and development practices

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
container		container
.cursorignore		.cursorignore
.cursorrules		.cursorrules
.cursorrules_reorder_sections.py		.cursorrules_reorder_sections.py
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
check_pr_status.py		check_pr_status.py
common.py		common.py
compile.sh		compile.sh
curl.sh		curl.sh
ddns.py		ddns.py
devcontainer.Dockerfile.j2		devcontainer.Dockerfile.j2
devcontainer.json.j2		devcontainer.json.j2
devcontainer_sync.py		devcontainer_sync.py
git_stats.py		git_stats.py
gpu_reset.sh		gpu_reset.sh
inference.sh		inference.sh
kill_dynamo_processes.sh		kill_dynamo_processes.sh
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt
rerun_github_pr.py		rerun_github_pr.py
show_commit_history.j2		show_commit_history.j2
show_commit_history.py		show_commit_history.py
show_dynamo_branches.py		show_dynamo_branches.py
soak_fe.py		soak_fe.py
update_cron_tail.sh		update_cron_tail.sh
update_html_pages.sh		update_html_pages.sh

keivenchang/dynamo-utils

Folders and files

Latest commit

History

Repository files navigation