CI: add batch runner with sim ChipWorker reuse#421
CI: add batch runner with sim ChipWorker reuse#421hw-native-sys-bot wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces tools/ci.py, a new Python-based batch CI test runner that replaces ci.sh and optimizes device usage via ChipWorker. It features parallel task compilation, support for both simulation and hardware (including A5-specific orchestration), and a retry mechanism for failed tests. Feedback identifies a critical issue where sys.exit(1) in the watchdog handler may fail to terminate the process if threads are hung, suggesting os._exit(1) instead. Additionally, improvements are suggested for handling subprocess.TimeoutExpired in the A5 execution path and reporting quarantined devices to improve visibility into hardware stability.
| print(f"\n{'=' * 40}") | ||
| print(f"[CI] TIMEOUT: exceeded {args.timeout}s ({args.timeout // 60}min) limit, aborting") | ||
| print(f"{'=' * 40}") | ||
| sys.exit(1) |
There was a problem hiding this comment.
sys.exit(1) in a signal handler only raises SystemExit in the main thread. Since the worker threads are not daemonized, the process will not terminate if a worker is hung in a C++ call (e.g., during kernel execution). Use os._exit(1) to ensure the entire process terminates immediately upon timeout.
| sys.exit(1) | |
| os._exit(1) |
| proc = subprocess.run(full_cmd, capture_output=True, text=True, timeout=args.timeout) | ||
| # Parse results from stdout (simplified — rely on exit code) | ||
| passed = proc.returncode == 0 | ||
| if not passed: | ||
| logger.error(f"[a5:dev{dev_id}] Failed:\n{proc.stdout}\n{proc.stderr}") | ||
| with lock: | ||
| results.append( | ||
| TaskResult( | ||
| name=f"a5-device-{dev_id}", | ||
| platform=args.platform, | ||
| passed=passed, | ||
| device=str(dev_id), | ||
| attempt=0, | ||
| elapsed_s=0, | ||
| error=proc.stderr if not passed else None, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
subprocess.run will raise subprocess.TimeoutExpired if the timeout is reached. This exception is currently unhandled in the _run_device thread, which will cause the thread to terminate without appending a result to the results list, leading to incomplete reporting. Additionally, the current implementation collapses all individual task results from the A5 subprocess into a single device-level entry, which is a regression in reporting granularity compared to the non-A5 path.
try:
proc = subprocess.run(full_cmd, capture_output=True, text=True, timeout=args.timeout)
passed = proc.returncode == 0
error_msg = proc.stderr if not passed else None
if not passed:
logger.error(f"[a5:dev{dev_id}] Failed:\n{proc.stdout}\n{proc.stderr}")
except subprocess.TimeoutExpired:
passed = False
error_msg = f"Timed out after {args.timeout}s"
logger.error(f"[a5:dev{dev_id}] {error_msg}")
with lock:
results.append(
TaskResult(
name=f"a5-device-{dev_id}",
platform=args.platform,
passed=passed,
device=str(dev_id),
attempt=0,
elapsed_s=0,
error=error_msg,
)
)| quarantined: set[int] = set() | ||
| quarantine_lock = Lock() |
There was a problem hiding this comment.
The quarantined set and its associated lock are populated by worker threads when a device exhausts its retries, but this information is never used or reported by the orchestrator. Consider logging the list of quarantined devices after all threads have joined to provide better visibility into hardware stability issues.
- run simulation tasks through ChipWorker reuse, matching the HW path - make sim and host_build_graph temp .so files unique and close host orchestration handles so in-process dlopen reuse does not resolve stale objects - add --build-runtime for local src/ validation and cover the sim worker reuse path in unit tests
907e794 to
66cbfa1
Compare
Summary
tools/ci.pythat reuses one ChipWorker per runtime group instead of spawning a fresh subprocess per task.sofiles unique so in-process sim reuse does not hit staledlopenpaths--build-runtimeflag for local validation after editingsrc/Root Cause
Sim was not blocked by ChipWorker itself. The real failure came from fixed temp
.sopaths such as/tmp/aicpu_sim_<pid>.soand/tmp/orch_so_<pid>.so. Oncetools/ci.pyreused ChipWorker in a single process, repeateddlopenand file recreation on the same paths caused stale loader resolution and the follow-onundefined symbol build_*_graphfailures.Testing
pytest tests/ut/test_ci_runner.py -qCCACHE_DISABLE=1 python tools/ci.py -p a2a3sim -r host_build_graph -c 6622890 -t 600 --clone-protocol httpspython tools/ci.py -p a2a3 -d 0 -c 6622890 --clone-protocol httpsNot completed in this pass. The onboard path still fails earlier with
set_device failed with code 507899and needs separate debugging.