Is your feature request related to a problem?
We do not have a standardized way to evaluate agent outputs against ground-truth datasets. Teams resort to manual spot checks or custom scripts, which makes it impossible to track quality regressions or compare models objectively. This blocks our 1.0 goal of shipping enterprise-grade evaluation tooling.
Describe the solution you want to see
- Build an evaluation harness that can ingest CSVs or Hugging Face datasets containing inputs and expected outputs, execute Flock workflows, and compute a configurable set of metrics (BLEU, ROUGE, accuracy, hallucination flags, cost, latency, etc.).
- Provide a declarative configuration file (YAML/JSON) to describe datasets, orchestrator setup, and metric suites so evaluations are reproducible.
- Output results as structured artifacts (saved to DuckDB/Azure storage) and human-friendly reports (Markdown/HTML) that can be shared with stakeholders.
- Integrate the harness with CI and the dashboard so product teams can run ad-hoc evaluations and track trends over time.
Describe alternatives you have considered
Current practice is to run workflows manually and eyeball the outputs or write ad-hoc Python scripts tied to specific experiments. Those approaches don’t scale, have no reporting, and can’t be reused across customers.
Additional context
Leverage the same storage abstraction we plan for persistence so offline runs and hosted evaluations share code. Pair this work with the benchmarking initiative to reuse dataset loaders and metric implementations where possible.
Is your feature request related to a problem?
We do not have a standardized way to evaluate agent outputs against ground-truth datasets. Teams resort to manual spot checks or custom scripts, which makes it impossible to track quality regressions or compare models objectively. This blocks our 1.0 goal of shipping enterprise-grade evaluation tooling.
Describe the solution you want to see
Describe alternatives you have considered
Current practice is to run workflows manually and eyeball the outputs or write ad-hoc Python scripts tied to specific experiments. Those approaches don’t scale, have no reporting, and can’t be reused across customers.
Additional context
Leverage the same storage abstraction we plan for persistence so offline runs and hosted evaluations share code. Pair this work with the benchmarking initiative to reuse dataset loaders and metric implementations where possible.