-
Notifications
You must be signed in to change notification settings - Fork 46
Open
Description
Background
Kaun is our neural network training library that currently lacks real-time monitoring capabilities. Training progress, loss curves, and metrics are only visible through print statements or post-training analysis. We need a TensorBoard-like experience in the terminal for monitoring training runs.
Objective
Create a decoupled monitoring system with:
- Logging layer: Training runs write metrics to filesystem in a structured format
- TUI Dashboard: Separate process that reads logs and visualizes them using Mosaic
This separation allows:
- Multiple monitoring frontends (TUI, web, notebooks)
- Monitoring remote training runs
- Post-hoc analysis of completed runs
Design Considerations
Logging Format:
- JSON lines or binary format for efficiency
- Directory structure:
runs/<run_id>/metrics.jsonl - Atomic writes to handle concurrent access
- Consider rotation/compression for long runs
- Consider logging in the same format as tensorboard if appropriate
Dashboard Features:
- Training/validation loss curves
- Learning rate schedules
- Metrics (accuracy, perplexity, etc.)
- Training speed (steps/sec, tokens/sec)
- Multiple run comparison
- Real-time file watching
Success Criteria
- Logging has minimal impact on training performance
- Dashboard can monitor live or completed runs
- Smooth UI updates without terminal flickering
- Works with existing Kaun examples (MNIST)
- Clear separation between logging and visualization
Metadata
Metadata
Assignees
Labels
No labels