Skip to content

🧮 [FEATURE] [1.0] Evaluation Harness #276

@AndreRatzenberger

Description

@AndreRatzenberger

Is your feature request related to a problem?

We do not have a standardized way to evaluate agent outputs against ground-truth datasets. Teams resort to manual spot checks or custom scripts, which makes it impossible to track quality regressions or compare models objectively. This blocks our 1.0 goal of shipping enterprise-grade evaluation tooling.

Describe the solution you want to see

  • Build an evaluation harness that can ingest CSVs or Hugging Face datasets containing inputs and expected outputs, execute Flock workflows, and compute a configurable set of metrics (BLEU, ROUGE, accuracy, hallucination flags, cost, latency, etc.).
  • Provide a declarative configuration file (YAML/JSON) to describe datasets, orchestrator setup, and metric suites so evaluations are reproducible.
  • Output results as structured artifacts (saved to DuckDB/Azure storage) and human-friendly reports (Markdown/HTML) that can be shared with stakeholders.
  • Integrate the harness with CI and the dashboard so product teams can run ad-hoc evaluations and track trends over time.

Describe alternatives you have considered

Current practice is to run workflows manually and eyeball the outputs or write ad-hoc Python scripts tied to specific experiments. Those approaches don’t scale, have no reporting, and can’t be reused across customers.

Additional context

Leverage the same storage abstraction we plan for persistence so offline runs and hosted evaluations share code. Pair this work with the benchmarking initiative to reuse dataset loaders and metric implementations where possible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions