CI: add batch runner with sim ChipWorker reuse by hw-native-sys-bot · Pull Request #421 · hw-native-sys/simpler

hw-native-sys-bot · 2026-03-31T13:04:52Z

Summary

add a Python CI runner in tools/ci.py that reuses one ChipWorker per runtime group instead of spawning a fresh subprocess per task
make sim executor, kernel, and host_build_graph orchestration temp .so files unique so in-process sim reuse does not hit stale dlopen paths
close host orchestration handles after graph construction and add a --build-runtime flag for local validation after editing src/
cover sim worker reuse and the new CLI flag in unit tests

Root Cause

Sim was not blocked by ChipWorker itself. The real failure came from fixed temp .so paths such as /tmp/aicpu_sim_<pid>.so and /tmp/orch_so_<pid>.so. Once tools/ci.py reused ChipWorker in a single process, repeated dlopen and file recreation on the same paths caused stale loader resolution and the follow-on undefined symbol build_*_graph failures.

Testing

pytest tests/ut/test_ci_runner.py -q
CCACHE_DISABLE=1 python tools/ci.py -p a2a3sim -r host_build_graph -c 6622890 -t 600 --clone-protocol https
python tools/ci.py -p a2a3 -d 0 -c 6622890 --clone-protocol https
Not completed in this pass. The onboard path still fails earlier with set_device failed with code 507899 and needs separate debugging.

gemini-code-assist

Code Review

This pull request introduces tools/ci.py, a new Python-based batch CI test runner that replaces ci.sh and optimizes device usage via ChipWorker. It features parallel task compilation, support for both simulation and hardware (including A5-specific orchestration), and a retry mechanism for failed tests. Feedback identifies a critical issue where sys.exit(1) in the watchdog handler may fail to terminate the process if threads are hung, suggesting os._exit(1) instead. Additionally, improvements are suggested for handling subprocess.TimeoutExpired in the A5 execution path and reporting quarantined devices to improve visibility into hardware stability.

gemini-code-assist · 2026-03-31T13:07:08Z

tools/ci.py

+        print(f"\n{'=' * 40}")
+        print(f"[CI] TIMEOUT: exceeded {args.timeout}s ({args.timeout // 60}min) limit, aborting")
+        print(f"{'=' * 40}")
+        sys.exit(1)


sys.exit(1) in a signal handler only raises SystemExit in the main thread. Since the worker threads are not daemonized, the process will not terminate if a worker is hung in a C++ call (e.g., during kernel execution). Use os._exit(1) to ensure the entire process terminates immediately upon timeout.

Suggested change

sys.exit(1)

os._exit(1)

gemini-code-assist · 2026-03-31T13:07:08Z

tools/ci.py

+        proc = subprocess.run(full_cmd, capture_output=True, text=True, timeout=args.timeout)
+        # Parse results from stdout (simplified — rely on exit code)
+        passed = proc.returncode == 0
+        if not passed:
+            logger.error(f"[a5:dev{dev_id}] Failed:\n{proc.stdout}\n{proc.stderr}")
+        with lock:
+            results.append(
+                TaskResult(
+                    name=f"a5-device-{dev_id}",
+                    platform=args.platform,
+                    passed=passed,
+                    device=str(dev_id),
+                    attempt=0,
+                    elapsed_s=0,
+                    error=proc.stderr if not passed else None,
+                )
+            )


subprocess.run will raise subprocess.TimeoutExpired if the timeout is reached. This exception is currently unhandled in the _run_device thread, which will cause the thread to terminate without appending a result to the results list, leading to incomplete reporting. Additionally, the current implementation collapses all individual task results from the A5 subprocess into a single device-level entry, which is a regression in reporting granularity compared to the non-A5 path.

try: proc = subprocess.run(full_cmd, capture_output=True, text=True, timeout=args.timeout) passed = proc.returncode == 0 error_msg = proc.stderr if not passed else None if not passed: logger.error(f"[a5:dev{dev_id}] Failed:\n{proc.stdout}\n{proc.stderr}") except subprocess.TimeoutExpired: passed = False error_msg = f"Timed out after {args.timeout}s" logger.error(f"[a5:dev{dev_id}] {error_msg}") with lock: results.append( TaskResult( name=f"a5-device-{dev_id}", platform=args.platform, passed=passed, device=str(dev_id), attempt=0, elapsed_s=0, error=error_msg, ) )

gemini-code-assist · 2026-03-31T13:07:08Z

tools/ci.py

+    quarantined: set[int] = set()
+    quarantine_lock = Lock()


The quarantined set and its associated lock are populated by worker threads when a device exhausts its retries, but this information is never used or reported by the orchestrator. Consider logging the list of quarantined devices after all threads have joined to provide better visibility into hardware stability issues.

- run simulation tasks through ChipWorker reuse, matching the HW path - make sim and host_build_graph temp .so files unique and close host orchestration handles so in-process dlopen reuse does not resolve stale objects - add --build-runtime for local src/ validation and cover the sim worker reuse path in unit tests

gemini-code-assist bot reviewed Mar 31, 2026

View reviewed changes

hw-native-sys-bot force-pushed the batch-ci-chipworker-reuse branch from 907e794 to 66cbfa1 Compare April 1, 2026 14:19

hw-native-sys-bot changed the title ~~Feat: add batch CI runner with ChipWorker reuse~~ CI: add batch runner with sim ChipWorker reuse Apr 1, 2026

hw-native-sys-bot closed this Apr 2, 2026

hw-native-sys-bot deleted the batch-ci-chipworker-reuse branch April 2, 2026 09:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: add batch runner with sim ChipWorker reuse#421

CI: add batch runner with sim ChipWorker reuse#421
hw-native-sys-bot wants to merge 1 commit intohw-native-sys:mainfrom
hw-native-sys-bot:batch-ci-chipworker-reuse

hw-native-sys-bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 31, 2026

Uh oh!

gemini-code-assist bot Mar 31, 2026

Uh oh!

gemini-code-assist bot Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hw-native-sys-bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hw-native-sys-bot commented Mar 31, 2026 •

edited

Loading