This GitHub repository hosts evaluation results and codebase for Summarization Metrics for Spanish and Basque: Do Automatic Scores and LLM-Judges Correlate with Humans?.
BASSE is a multilingual (Basque and Spanish) dataset designed primarily for the meta-evaluation of automatic summarization metrics and LLM-as-a-Judge models. We generated automatic summaries for 90 news documents in these two languages (45 each) using Anthropic's Claude, OpenAI's GPT-4o, Reka AI's Reka, Meta's Llama 3.1 Instruct and Cohere's Command R+. For each of these models, we use four different prompts (base, core, 5W1H, tldr; see paper for more details), with the goal of generating a diverse array of summaries, both regarding quality and style. We also include human-generated reference summaries for each news document.
After generating these summaries, we annotated them for Coherence, Consistency, Fluency, Relevance, and 5W1H on a 5-point Likert scale, largely following the annotation protocol from SummEval.
BASSE is available in 🤗 HuggingFace: https://huggingface.co/datasets/HiTZ/BASSE. Use it
from the datasets library:
from datasets import load_dataset
ds_basse_eu = load_dataset("HiTZ/BASSE", "eu")
ds_basse_es = load_dataset("HiTZ/BASSE", "es")or the pandas library:
import pandas as pd
df_basse_eu = pd.read_parquet("hf://datasets/HiTZ/BASSE/eu/test-00000-of-00001.parquet")
df_basse_es = pd.read_parquet("hf://datasets/HiTZ/BASSE/es/test-00000-of-00001.parquet")We base our experimental setup on the codebases of SummEval for automatic metrics and Prometheus-Eval for LLM-as-a-Judge evaluation.
To use this repo, first clone it and install the necessary dependencies:
git clone --recursive https://github.com/hitz-zentroa/summarization.git
cd summarization
pip install -r requirements.txtThe --recursive flag ensures that SummEval is cloned into the lib directory. If you
prefer, you can instead attempt to install it directly via pip:
pip install summ-evalIf you're planning to run automatic metrics, you need to set the following environment variable (required by ROUGE):
export ROUGE_HOME=$PWD/lib/SummEval/evaluation/summ_eval/ROUGE-1.5.5You will also need to download a couple of assets.
NLTK's punkt_tab resources are necessary to preprocess BASSE for automatic metric
scoring. Download them by doing:
import nltk
nltk.download('punkt_tab')FastText embeddings are used to compute ROUGE scores. Download them into assets:
wget -P assets https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.eu.300.vec.gz
wget -P assets https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.es.300.vec.gzWe measured agreement between human annotators for each evaluation criterion in terms of Cohen's quadratic kappa, κ, Krippendorff's ordinal alpha, ɑ, and agreement percentage.
Use the notebook agreement.ipynb to reproduce the tables and plots included in our article, which are located under results/agreement.
Note that agreement was computed over three annotation rounds: rounds 0 and 1 involve
annotations on the same set of summaries, before and after agreement discussions between
annotators, respectively. Round 0 is provided separately from the BASSE corpus for
reproducibility (HiTZ/BASSE configurations eu-round-0
and es-round-0). Round 2
covers a new set of summaries. In the paper, we refer to rounds 0, 1, and 2 as 1,
2, and 3, respectively, as a matter of style.
The notebook corpus.ipynb contains code for both quantitative and qualitative analysis of the BASSE corpus.
The quantitative analysis includes computing the number of documents, tokens, and sentences per document, along with vocabulary size and the number of summaries generated in the target language. These metrics are aggregated by LLM, prompt, and language. Additionally, we use SummEval’s DataStats metric to assess summary compression and novel bigram usage.
The qualitative analysis reports human ratings on the criteria Coherence, Consistency, Fluency, Relevance, and 5W1H broken down by LLM and prompt pair.
All plots and tables generated by the notebook, as included in our paper, are available under results/corpus.
Following the SummEval framework, we apply a range of standard automatic metrics (some with minor adaptations or fixes provided in src/metrics), to later meta-evaluate their correlation with human judgments in Spanish and Basque.
To reproduce the metric-based evaluations, run the script:
python -m src.01_apply_automatic_metrics --language LANGUAGE --metrics METRIC [METRIC ...]where LANGUAGE is one of {eu,es} and METRIC is one of
{rouge,m_rouge_we,m_bert_score,bleu,chrf,m_meteor,cider,stats} or all to compute all
implemented metrics.
The output of this script is located under pred/metrics. It consists of one CSV file per language and metric, with the following columns:
model(str): the model whose summaries have been evaluated. As with the BASSE corpus, it is a combination of the actual model name and prompt, e.g"claude-base".metric(str): the name of the specific metric;m_bert_score, for instance, produces BertScore precision, recall and F1-score.score(float): the score obtained by the model's summaries (usually the mean of the individual summaries' scores, but the aggregation depends on the metric).
Each CSV contains one row per model name and prompt pair.
Using the Prometheus-Eval codebase, we run a range of open and proprietary LLMs to evaluate summaries across five criteria, to later meta-evaluate their correlation with human judgments in Spanish and Basque.
To reproduce the judge-based evaluations, run the script:
python -m src.02_apply_llm_as_a_judge --model MODEL --language LANGUAGE --criterion CRITERION [--num_gpus NUM_GPUS] [--use_async_lite]where LANGUAGE is one of {eu,es} and CRITERION is one of
{Coherence,Consistency,Fluency,Relevance,5W1H}. NUM_GPUS controls the number of GPUs
used by vLLM to load the judge locally. Use the flag --use_async_lite instead when
using a commercial LLM API for judging (see supported providers here:
https://docs.litellm.ai/docs/providers).
The output of this script is located under pred/judges. It consists of one CSV file per language, judge, and criterion, with the following columns:
model(str): the model whose summaries have been evaluated. As with the BASSE corpus, it is a combination of the actual model name and prompt, e.g"claude-base".feedback(str): the feedback given by the model to explain, support or justify the produced score.score(float): the score obtained by the summary.
Each CSV contains one row per summary judged.
To reproduce the correlations between the automatic and human evaluations, run the script:
python -m src.03_compute_correlations --language LANGUAGE [--metrics METRIC [METRIC ...]] [--judges JUDGE [JUDGE ...]]where LANGUAGE is one of {eu,es}, METRIC is the name of a metric in
pred/metrics or all, and JUDGE is the name of a judge in
pred/judges or all.
The output of this script is located under results/correlation. It consists of one LaTeX table per language, evaluator type (metric or judge LLM), and correlation statistic (Spearman's rank correlation coefficient, ρ, and Kendall's rank correlation coefficient, τ).
Use the notebook correlations.ipynb to reproduce the correlations plot included in our article.
BasqueSumm is the first collection of news documents and their corresponding subheads (often
used as a proxy for summaries) for Basque. It was automatically compiled from www.berria.eus.
It is available in 🤗 HuggingFace: https://huggingface.co/datasets/HiTZ/BasqueSumm. Use it
from the datasets library:
from datasets import load_dataset
ds_basque_summ = load_dataset("HiTZ/BasqueSumm")or the pandas library:
import pandas as pd
df_basque_summ = pd.read_json("hf://datasets/HiTZ/BasqueSumm/data/berria_summ.jsonl", lines=True)We release BASSE and BasqueSumm under a CC BY-NC-SA 4.0 license
Please cite the following paper if you use the BASSE corpus or its associated codebase:
@misc{barnes2025summarizationmetricsspanishbasque,
title={Summarization Metrics for {S}panish and {B}asque: Do Automatic Scores and {LLM}-Judges Correlate with Humans?},
author={Jeremy Barnes and Naiara Perez and Alba Bonet-Jover and Begoña Altuna},
year={2025},
eprint={2503.17039},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.17039},
}