Releases · parameterlab/MASEval

18 Jan 21:55

github-actions

v0.3.0

2f70bcc

v0.3.0 Latest

Latest

[0.3.0] - 2025-01-18

Added

Parallel Execution

Added parallel task execution with num_workers parameter in Benchmark.run() using ThreadPoolExecutor (PR: #14)
Added ComponentRegistry class for thread-safe component registration with thread-local storage (PR: #14)
Added TaskContext for cooperative timeout checking with check_timeout(), elapsed, remaining, and is_expired properties (PR: #14)
Added TaskProtocol dataclass with timeout_seconds, timeout_action, max_retries, priority, and tags fields for task-level execution control (PR: #14)
Added TimeoutAction enum (SKIP, RETRY, RAISE) for configurable timeout behavior (PR: #14)
Added TaskTimeoutError exception with elapsed, timeout, and partial_traces attributes (PR: #14)
Added TASK_TIMEOUT to TaskExecutionStatus enum for timeout classification (PR: #14)

Task Queue Abstraction

Added TaskQueue abstract base class with iterator interface for flexible task scheduling (PR: #14)
Added SequentialQueue for simple FIFO task ordering (PR: #14)
Added PriorityQueue for priority-based task scheduling using TaskProtocol.priority (PR: #14)
Added AdaptiveTaskQueue abstract base class for feedback-based adaptive scheduling with initial_state(), select_next_task(remaining, state), and update_state(task, report, state) methods (PR: #14)

ModelAdapter Chat Interface

Added chat() method to ModelAdapter as the primary interface for LLM inference, accepting a list of messages in OpenAI format and returning a ChatResponse object and accepting tools
Added ChatResponse dataclass containing content, tool_calls, role, usage, model, and stop_reason fields for structured response handling

AnthropicModelAdapter

New AnthropicModelAdapter for direct integration with Anthropic Claude models via the official Anthropic SDK
Handles Anthropic-specific message format conversion (system messages, tool_use/tool_result blocks) internally while accepting OpenAI-compatible input
Added anthropic optional dependency: pip install maseval[anthropic]

Benchmarks

Tau2 Benchmark: Full implementation of the tau2-bench benchmark for evaluating LLM-based agents on customer service tasks across airline, retail, and telecom domains (PR: #16)
Tau2Benchmark, Tau2Environment, Tau2User, Tau2Evaluator components for framework-agnostic evaluation (PR: #16)
DefaultAgentTau2Benchmark using an agent setup closely resembeling to the original tau2-bench implementation (PR: #16)
Data loading utilities: load_tasks(), ensure_data_exists(), configure_model_ids() (PR: #16)
Metrics: compute_benchmark_metrics(), compute_pass_at_k(), compute_pass_hat_k() for tau2-style scoring (PR: #16)
Domain implementations with tool kits: AirlineTools, RetailTools, TelecomTools with full database simulation (PR: #16)

User

AgenticUser class for users that can use tools during conversations (PR: #16)
Multiple stop token support: User now accepts stop_tokens (list) instead of single stop_token, enabling different termination reasons (PR: #16)
Stop reason tracking: User traces now include stop_reason, max_turns, turns_used, and stopped_by_user for detailed termination analysis (PR: #16)

Simulator

AgenticUserLLMSimulator for LLM-based user simulation with tool use capabilities (PR: #16)

Examples

Tau2 benchmark example with default agent implementation and result comparison scripts (PR: #16)

Changed

Benchmark

Benchmark.agent_data parameter is now optional (defaults to empty dict) (PR: #16)
Refactored Benchmark to delegate registry operations to ComponentRegistry class (PR: #)
Benchmark.run() now accepts optional queue parameter (BaseTaskQueue) for custom task scheduling (PR: #14)

Task

Task.id is now str type instead of UUID. Benchmarks can provide human-readable IDs directly (e.g., Task(id="retail_001", ...)). Auto-generates UUID string if not provided. (PR: #16)

Fixed

Task reports now use task.id directly instead of metadata["task_id"] (PR: #16)

Removed

Assets 4

05 Dec 16:29

github-actions

v0.2.0

c25fcdc

v0.2.0

[0.2.0] - 2025-12-05

Added

Exceptions and Error Classification

Added AgentError, EnvironmentError, UserError exception hierarchy in maseval.core.exceptions for classifying execution failures by responsibility (PR: #13)
Added TaskExecutionStatus.AGENT_ERROR, ENVIRONMENT_ERROR, USER_ERROR, UNKNOWN_EXECUTION_ERROR for fine-grained error classification enabling fair scoring (PR: #13)
Added validation helpers: validate_argument_type(), validate_required_arguments(), validate_no_extra_arguments(), validate_arguments_from_schema() for tool implementers (PR: #13)
Added ToolSimulatorError and UserSimulatorError exception subclasses for simulator-specific context while inheriting proper classification (PR: #13)

Documentation

Added Exception Handling guide explaining error classification, fair scoring, and rerunning failed tasks (PR: #13)

Benchmarks

MACS Benchmark: Multi-Agent Collaboration Scenarios benchmark (PR: #13)

Benchmark

Added execution_loop() method to Benchmark base class enabling iterative agent-user interaction (PR: #13)
Added max_invocations constructor parameter to Benchmark (default: 1 for backwards compatibility) (PR: #13)
Added abstract get_model_adapter(model_id, **kwargs) method to Benchmark base class as universal model factory to be used throughout the benchmarks. (PR: #13)

User

Added max_turns and stop_token parameters to User base class for multi-turn support with early stopping. Same applied to UserLLMSimulator. (PR: #13)
Added is_done(), _check_stop_token(), and increment_turn() methods to User base class (PR: #13)
Added get_initial_query() method to User base class for LLM-generated initial messages (PR: #13)
Added initial_query parameter in User base class to trigger the agentic system. (PR: #13)

Environment

Added Environment.get_tool(name) method for single-tool lookup (PR: #13)

Interface

LlamaIndex integration: LlamaIndexAgentAdapter and LlamaIndexUser for evaluating LlamaIndex workflow-based agents (PR: #7)
The logs property inside SmolAgentAdapter and LanggraphAgentAdapter are now properly filled. (PR: #3)

Examples

Added a new example: The 5_a_day_benchmark (PR: #10)

Changed

Exception Handling

Benchmark now classifies execution errors into AGENT_ERROR (agent's fault), ENVIRONMENT_ERROR (tool/infra failure), USER_ERROR (user simulator failure), or UNKNOWN_EXECUTION_ERROR (unclassified) instead of generic TASK_EXECUTION_FAILED (PR: #13)
ToolLLMSimulator now raises ToolSimulatorError (classified as ENVIRONMENT_ERROR) on failure (PR: #13)
UserLLMSimulator now raises UserSimulatorError (classified as USER_ERROR) on failure (PR: #13)

Environment

Environment.create_tools() now returns Dict[str, Any] instead of list (PR: #13)

Benchmark

Benchmark.run_agents() signature changed: added query: str parameter (PR: #13)
Benchmark.run() now uses execution_loop() internally to handle agent-user interaction cycles (PR: #13)
Benchmark class now has a fail_on_setup_error flag that raises errors observed during setup of task (PR: #10)

Callback

FileResultLogger now accepts pathlib.Path for argument output_dir and has an overwrite argument to prevent overwriting of existing logs files.

Evaluator

The Evaluator class now has a filter_traces base method to conveniently adapt the same evaluator to different entities in the traces (PR: #10).

Simulator

The LLMSimulator now throws an exception when json cannot be decoded instead of returning the error message as text to the agent (PR: #13).

Other

Documentation formatting improved. Added darkmode and links to Github (PR: #11).
Improved Quick Start Guide in docs/getting-started/quickstart.md. (PR: #10)
maseval.interface.agents structure changed. Tools requiring framework imports (beyond just typing) now in <framework>_optional.py and imported dynamically from <framework>.py. (PR: #12)
Various formatting improvements in the documentation (PR: #12)
Added documentation for View Source Code pattern in CONTRIBUTING.md and _optional.py pattern in interface README (PR: #12)

Fixed

Interface

LlamaIndexAgentAdapter now supports multiple LlamaIndex agent types including ReActAgent (workflow-based), FunctionAgent, and legacy agents by checking for .chat(), .query(), and .run() methods in priority order (PR: #10)

Other

Consistent naming of agent adapter over wrapper (PR: #3)
Fixed an issue that LiteLLM interface and Mixins were not shown in documentation properly (#PR: 12)