Skip to content

hitz-zentroa/summarization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BASSE: BAsque and Spanish Summarization Evaluation

BASSE BasqueSumm Paper

This GitHub repository hosts evaluation results and codebase for Summarization Metrics for Spanish and Basque: Do Automatic Scores and LLM-Judges Correlate with Humans?.

Table of contents

  1. BASSE
  2. Codebase
  3. BasqueSumm
  4. Licensing
  5. Citation

BASSE

BASSE is a multilingual (Basque and Spanish) dataset designed primarily for the meta-evaluation of automatic summarization metrics and LLM-as-a-Judge models. We generated automatic summaries for 90 news documents in these two languages (45 each) using Anthropic's Claude, OpenAI's GPT-4o, Reka AI's Reka, Meta's Llama 3.1 Instruct and Cohere's Command R+. For each of these models, we use four different prompts (base, core, 5W1H, tldr; see paper for more details), with the goal of generating a diverse array of summaries, both regarding quality and style. We also include human-generated reference summaries for each news document.

After generating these summaries, we annotated them for Coherence, Consistency, Fluency, Relevance, and 5W1H on a 5-point Likert scale, largely following the annotation protocol from SummEval.

BASSE is available in 🤗 HuggingFace: https://huggingface.co/datasets/HiTZ/BASSE. Use it from the datasets library:

from datasets import load_dataset

ds_basse_eu = load_dataset("HiTZ/BASSE", "eu")
ds_basse_es = load_dataset("HiTZ/BASSE", "es")

or the pandas library:

import pandas as pd

df_basse_eu = pd.read_parquet("hf://datasets/HiTZ/BASSE/eu/test-00000-of-00001.parquet")
df_basse_es = pd.read_parquet("hf://datasets/HiTZ/BASSE/es/test-00000-of-00001.parquet")

Codebase

Setup

We base our experimental setup on the codebases of SummEval for automatic metrics and Prometheus-Eval for LLM-as-a-Judge evaluation.

To use this repo, first clone it and install the necessary dependencies:

git clone --recursive https://github.com/hitz-zentroa/summarization.git
cd summarization
pip install -r requirements.txt

The --recursive flag ensures that SummEval is cloned into the lib directory. If you prefer, you can instead attempt to install it directly via pip:

pip install summ-eval

Environment variables

If you're planning to run automatic metrics, you need to set the following environment variable (required by ROUGE):

export ROUGE_HOME=$PWD/lib/SummEval/evaluation/summ_eval/ROUGE-1.5.5

Required assets

You will also need to download a couple of assets.

NLTK's punkt_tab resources are necessary to preprocess BASSE for automatic metric scoring. Download them by doing:

import nltk
nltk.download('punkt_tab')

FastText embeddings are used to compute ROUGE scores. Download them into assets:

wget -P assets https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.eu.300.vec.gz
wget -P assets https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.es.300.vec.gz

0. Analysis of the BASSE corpus

Inter-annotator agreement

We measured agreement between human annotators for each evaluation criterion in terms of Cohen's quadratic kappa, κ, Krippendorff's ordinal alpha, ɑ, and agreement percentage.

Use the notebook agreement.ipynb to reproduce the tables and plots included in our article, which are located under results/agreement.

Note that agreement was computed over three annotation rounds: rounds 0 and 1 involve annotations on the same set of summaries, before and after agreement discussions between annotators, respectively. Round 0 is provided separately from the BASSE corpus for reproducibility (HiTZ/BASSE configurations eu-round-0 and es-round-0). Round 2 covers a new set of summaries. In the paper, we refer to rounds 0, 1, and 2 as 1, 2, and 3, respectively, as a matter of style.

Quantitative and qualitative analysis

The notebook corpus.ipynb contains code for both quantitative and qualitative analysis of the BASSE corpus.

The quantitative analysis includes computing the number of documents, tokens, and sentences per document, along with vocabulary size and the number of summaries generated in the target language. These metrics are aggregated by LLM, prompt, and language. Additionally, we use SummEval’s DataStats metric to assess summary compression and novel bigram usage.

The qualitative analysis reports human ratings on the criteria Coherence, Consistency, Fluency, Relevance, and 5W1H broken down by LLM and prompt pair.

All plots and tables generated by the notebook, as included in our paper, are available under results/corpus.

1. Score summaries with automatic metrics

Following the SummEval framework, we apply a range of standard automatic metrics (some with minor adaptations or fixes provided in src/metrics), to later meta-evaluate their correlation with human judgments in Spanish and Basque.

To reproduce the metric-based evaluations, run the script:

python -m src.01_apply_automatic_metrics --language LANGUAGE --metrics METRIC [METRIC ...]

where LANGUAGE is one of {eu,es} and METRIC is one of {rouge,m_rouge_we,m_bert_score,bleu,chrf,m_meteor,cider,stats} or all to compute all implemented metrics.

The output of this script is located under pred/metrics. It consists of one CSV file per language and metric, with the following columns:

  • model (str): the model whose summaries have been evaluated. As with the BASSE corpus, it is a combination of the actual model name and prompt, e.g "claude-base".
  • metric (str): the name of the specific metric; m_bert_score, for instance, produces BertScore precision, recall and F1-score.
  • score (float): the score obtained by the model's summaries (usually the mean of the individual summaries' scores, but the aggregation depends on the metric).

Each CSV contains one row per model name and prompt pair.

2. Score summaries with judge LLMs

Using the Prometheus-Eval codebase, we run a range of open and proprietary LLMs to evaluate summaries across five criteria, to later meta-evaluate their correlation with human judgments in Spanish and Basque.

To reproduce the judge-based evaluations, run the script:

python -m src.02_apply_llm_as_a_judge --model MODEL --language LANGUAGE --criterion CRITERION [--num_gpus NUM_GPUS] [--use_async_lite]

where LANGUAGE is one of {eu,es} and CRITERION is one of {Coherence,Consistency,Fluency,Relevance,5W1H}. NUM_GPUS controls the number of GPUs used by vLLM to load the judge locally. Use the flag --use_async_lite instead when using a commercial LLM API for judging (see supported providers here: https://docs.litellm.ai/docs/providers).

The output of this script is located under pred/judges. It consists of one CSV file per language, judge, and criterion, with the following columns:

  • model (str): the model whose summaries have been evaluated. As with the BASSE corpus, it is a combination of the actual model name and prompt, e.g "claude-base".
  • feedback (str): the feedback given by the model to explain, support or justify the produced score.
  • score (float): the score obtained by the summary.

Each CSV contains one row per summary judged.

3. Compute correlations between metric/judge scores and human ratings

To reproduce the correlations between the automatic and human evaluations, run the script:

python -m src.03_compute_correlations --language LANGUAGE [--metrics METRIC [METRIC ...]] [--judges JUDGE [JUDGE ...]]

where LANGUAGE is one of {eu,es}, METRIC is the name of a metric in pred/metrics or all, and JUDGE is the name of a judge in pred/judges or all.

The output of this script is located under results/correlation. It consists of one LaTeX table per language, evaluator type (metric or judge LLM), and correlation statistic (Spearman's rank correlation coefficient, ρ, and Kendall's rank correlation coefficient, τ).

Use the notebook correlations.ipynb to reproduce the correlations plot included in our article.

BasqueSumm

BasqueSumm is the first collection of news documents and their corresponding subheads (often used as a proxy for summaries) for Basque. It was automatically compiled from www.berria.eus. It is available in 🤗 HuggingFace: https://huggingface.co/datasets/HiTZ/BasqueSumm. Use it from the datasets library:

from datasets import load_dataset

ds_basque_summ = load_dataset("HiTZ/BasqueSumm")

or the pandas library:

import pandas as pd

df_basque_summ = pd.read_json("hf://datasets/HiTZ/BasqueSumm/data/berria_summ.jsonl", lines=True)

Licensing

We release BASSE and BasqueSumm under a CC BY-NC-SA 4.0 license

Citation

Please cite the following paper if you use the BASSE corpus or its associated codebase:

@misc{barnes2025summarizationmetricsspanishbasque,
      title={Summarization Metrics for {S}panish and {B}asque: Do Automatic Scores and {LLM}-Judges Correlate with Humans?}, 
      author={Jeremy Barnes and Naiara Perez and Alba Bonet-Jover and Begoña Altuna},
      year={2025},
      eprint={2503.17039},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.17039}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •