SPICE TOOL and SPICEBench🌶️ (A SWE-Bench-verified–style labeling tool and benchmark for evaluating the software engineering capabilities of foundation models)

This is the official repository accompanying our paper: SPICE : An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation

How to cite our paper

@article{oliva2025spice,
  title={SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation},
  author={Oliva, Gustavo A and Rajbahadur, Gopi Krishnan and Bhatia, Aaditya and Zhang, Haoxiang and Chen, Yihao and Chen, Zhilong and Leung, Arthur and Lin, Dayi and Chen, Boyuan and Hassan, Ahmed E},
  booktitle={Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE)},
  year={2025}
}

SPICEBench🌶️

A curated dataset of 6,802 automatically labeled (with our SPICE tool) instances drawn from 291 real-world open-source projects in SWE-Gym. This is the largest known collection of solvable SWE-bench-like tasks, and is over 13 times larger than SWEBench-Verified's human-labeled subset. SPICE-bench provides a rich resource for fine-tuning, benchmarking, and training SE-focused foundation models.

The dataset is available under data directory.

The data labels for Issue quality, Test quality and Difficulty level are available with the keys task_score, evaluation_score and difficulty_score per instance. The data is provided as a .jsonl file for ease of use.

SPICE Tool🌶️ - A Tool for creating your own SPICEBench

Introduction

Dataset Requirements

The input dataset must contain the following columns:

instance_id: A unique identifier for each instance.
repo: The repository name in the format <owner>/<repo>.
base_commit: The commit hash used to set the repository's state.
problem_statement: A textual description of the issue, comprising a title and body separated by a newline.
patch: The code patch addressing the issue.
test_patch: The test cases associated with the patch.

Objective

For each instance in the dataset, this project will:

Assign a score to:
- Issue quality (i.e., how clear the issue is)
- Test quality (i.e., how faithful the tests are too the issue as well as how coupled they are to the gold patch)
- Difficulty level (i.e., how hard it is to solve the issue)
Provide a detailed rationale for each score.

Scoring Framework

The scoring process aligns with the same guidelines used for creating SWE-bench Verified, ensuring consistency with established benchmarking standards.

Implementation

Scoring of the issues is done with a local model running on top of Ollama on a Huawei internal server. Scoring of tests and difficulty are done with the help of a thin wrapper around Aider and OpenAI gpt-4o (strong model) and gpt-4o-mini (weak model)

How to run SPICE

Install poetry and the project's dependencies with poetry install
Add your OPENAI_API_KEY to env-vars.sh
Source the environment variables: source ./env-vars.sh
Run python -m swebench_qa.app --help for the CLI instructions.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
swebench_qa		swebench_qa
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env-template.txt		env-template.txt
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SPICE TOOL and SPICEBench🌶️ (A SWE-Bench-verified–style labeling tool and benchmark for evaluating the software engineering capabilities of foundation models)

How to cite our paper

SPICEBench🌶️

SPICE Tool🌶️ - A Tool for creating your own SPICEBench

Introduction

Dataset Requirements

Objective

Scoring Framework

Implementation

How to run SPICE

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

SAILResearch/SPICEBench

Folders and files

Latest commit

History

Repository files navigation

SPICE TOOL and SPICEBench🌶️ (A SWE-Bench-verified–style labeling tool and benchmark for evaluating the software engineering capabilities of foundation models)

How to cite our paper

SPICEBench🌶️

SPICE Tool🌶️ - A Tool for creating your own SPICEBench

Introduction

Dataset Requirements

Objective

Scoring Framework

Implementation

How to run SPICE

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages