Skip to content

CI: add batch runner with sim ChipWorker reuse#421

Closed
hw-native-sys-bot wants to merge 1 commit intohw-native-sys:mainfrom
hw-native-sys-bot:batch-ci-chipworker-reuse
Closed

CI: add batch runner with sim ChipWorker reuse#421
hw-native-sys-bot wants to merge 1 commit intohw-native-sys:mainfrom
hw-native-sys-bot:batch-ci-chipworker-reuse

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

@hw-native-sys-bot hw-native-sys-bot commented Mar 31, 2026

Summary

  • add a Python CI runner in tools/ci.py that reuses one ChipWorker per runtime group instead of spawning a fresh subprocess per task
  • make sim executor, kernel, and host_build_graph orchestration temp .so files unique so in-process sim reuse does not hit stale dlopen paths
  • close host orchestration handles after graph construction and add a --build-runtime flag for local validation after editing src/
  • cover sim worker reuse and the new CLI flag in unit tests

Root Cause

Sim was not blocked by ChipWorker itself. The real failure came from fixed temp .so paths such as /tmp/aicpu_sim_<pid>.so and /tmp/orch_so_<pid>.so. Once tools/ci.py reused ChipWorker in a single process, repeated dlopen and file recreation on the same paths caused stale loader resolution and the follow-on undefined symbol build_*_graph failures.

Testing

  • pytest tests/ut/test_ci_runner.py -q
  • CCACHE_DISABLE=1 python tools/ci.py -p a2a3sim -r host_build_graph -c 6622890 -t 600 --clone-protocol https
  • python tools/ci.py -p a2a3 -d 0 -c 6622890 --clone-protocol https
    Not completed in this pass. The onboard path still fails earlier with set_device failed with code 507899 and needs separate debugging.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces tools/ci.py, a new Python-based batch CI test runner that replaces ci.sh and optimizes device usage via ChipWorker. It features parallel task compilation, support for both simulation and hardware (including A5-specific orchestration), and a retry mechanism for failed tests. Feedback identifies a critical issue where sys.exit(1) in the watchdog handler may fail to terminate the process if threads are hung, suggesting os._exit(1) instead. Additionally, improvements are suggested for handling subprocess.TimeoutExpired in the A5 execution path and reporting quarantined devices to improve visibility into hardware stability.

print(f"\n{'=' * 40}")
print(f"[CI] TIMEOUT: exceeded {args.timeout}s ({args.timeout // 60}min) limit, aborting")
print(f"{'=' * 40}")
sys.exit(1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

sys.exit(1) in a signal handler only raises SystemExit in the main thread. Since the worker threads are not daemonized, the process will not terminate if a worker is hung in a C++ call (e.g., during kernel execution). Use os._exit(1) to ensure the entire process terminates immediately upon timeout.

Suggested change
sys.exit(1)
os._exit(1)

Comment on lines +646 to +662
proc = subprocess.run(full_cmd, capture_output=True, text=True, timeout=args.timeout)
# Parse results from stdout (simplified — rely on exit code)
passed = proc.returncode == 0
if not passed:
logger.error(f"[a5:dev{dev_id}] Failed:\n{proc.stdout}\n{proc.stderr}")
with lock:
results.append(
TaskResult(
name=f"a5-device-{dev_id}",
platform=args.platform,
passed=passed,
device=str(dev_id),
attempt=0,
elapsed_s=0,
error=proc.stderr if not passed else None,
)
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

subprocess.run will raise subprocess.TimeoutExpired if the timeout is reached. This exception is currently unhandled in the _run_device thread, which will cause the thread to terminate without appending a result to the results list, leading to incomplete reporting. Additionally, the current implementation collapses all individual task results from the A5 subprocess into a single device-level entry, which is a regression in reporting granularity compared to the non-A5 path.

        try:
            proc = subprocess.run(full_cmd, capture_output=True, text=True, timeout=args.timeout)
            passed = proc.returncode == 0
            error_msg = proc.stderr if not passed else None
            if not passed:
                logger.error(f"[a5:dev{dev_id}] Failed:\n{proc.stdout}\n{proc.stderr}")
        except subprocess.TimeoutExpired:
            passed = False
            error_msg = f"Timed out after {args.timeout}s"
            logger.error(f"[a5:dev{dev_id}] {error_msg}")

        with lock:
            results.append(
                TaskResult(
                    name=f"a5-device-{dev_id}",
                    platform=args.platform,
                    passed=passed,
                    device=str(dev_id),
                    attempt=0,
                    elapsed_s=0,
                    error=error_msg,
                )
            )

Comment on lines +594 to +595
quarantined: set[int] = set()
quarantine_lock = Lock()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The quarantined set and its associated lock are populated by worker threads when a device exhausts its retries, but this information is never used or reported by the orchestrator. Consider logging the list of quarantined devices after all threads have joined to provide better visibility into hardware stability issues.

- run simulation tasks through ChipWorker reuse, matching the HW path
- make sim and host_build_graph temp .so files unique and close host orchestration handles so in-process dlopen reuse does not resolve stale objects
- add --build-runtime for local src/ validation and cover the sim worker reuse path in unit tests
@hw-native-sys-bot hw-native-sys-bot force-pushed the batch-ci-chipworker-reuse branch from 907e794 to 66cbfa1 Compare April 1, 2026 14:19
@hw-native-sys-bot hw-native-sys-bot changed the title Feat: add batch CI runner with ChipWorker reuse CI: add batch runner with sim ChipWorker reuse Apr 1, 2026
@hw-native-sys-bot hw-native-sys-bot deleted the batch-ci-chipworker-reuse branch April 2, 2026 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant