🧮 [FEATURE] [1.0] Evaluation Harness

### Is your feature request related to a problem?
We do not have a standardized way to evaluate agent outputs against ground-truth datasets. Teams resort to manual spot checks or custom scripts, which makes it impossible to track quality regressions or compare models objectively. This blocks our 1.0 goal of shipping enterprise-grade evaluation tooling.

### Describe the solution you want to see
- Build an evaluation harness that can ingest CSVs or Hugging Face datasets containing inputs and expected outputs, execute Flock workflows, and compute a configurable set of metrics (BLEU, ROUGE, accuracy, hallucination flags, cost, latency, etc.).
- Provide a declarative configuration file (YAML/JSON) to describe datasets, orchestrator setup, and metric suites so evaluations are reproducible.
- Output results as structured artifacts (saved to DuckDB/Azure storage) and human-friendly reports (Markdown/HTML) that can be shared with stakeholders.
- Integrate the harness with CI and the dashboard so product teams can run ad-hoc evaluations and track trends over time.

### Describe alternatives you have considered
Current practice is to run workflows manually and eyeball the outputs or write ad-hoc Python scripts tied to specific experiments. Those approaches don’t scale, have no reporting, and can’t be reused across customers.

### Additional context
Leverage the same storage abstraction we plan for persistence so offline runs and hosted evaluations share code. Pair this work with the benchmarking initiative to reuse dataset loaders and metric implementations where possible.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🧮 [FEATURE] [1.0] Evaluation Harness #276

Is your feature request related to a problem?

Describe the solution you want to see

Describe alternatives you have considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

🧮 [FEATURE] [1.0] Evaluation Harness #276

Description

Is your feature request related to a problem?

Describe the solution you want to see

Describe alternatives you have considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions