A Meta-evaluation of Automatic Metrics for Elaborative Simplification

May 11, 2026·
Abdullah Alshatti
,
Steven Schockaert
Fernando Alva-Manchego
Fernando Alva-Manchego
· 0 min read
Abstract
Elaborative simplification aims to improve the readability of texts by adding content that helps the readers. However, evaluating these elaborations remains challenging due to their subjective nature and the lack of suitable annotated datasets. To support the evaluation of elaborative simplification models, we introduce a new dataset with human ratings of elaborations generated by Large Language Models (LLMs), focusing on two quality criteria: cohesion and informativeness. Using these human judgments as a reference, we conduct a meta-evaluation of existing automatic evaluation approaches, with a focus on LLM-as-a-judge strategies. Our experiments suggest that evaluations made by smaller LLMs correlate poorly with human judgments, while larger models with structured prompting exhibit higher agreement. Informativeness evaluation proved to be challenging due to its subjectivity, as evidenced by the low inter-annotator agreement compared to cohesion.
Type
Publication
READIxTSAR @ LREC 2026
publication
Authors
PhD Researcher
Fernando Alva-Manchego
Authors
Researcher in Natural Language Processing
My research interests include text simplification, readability assessment, multilingual NLP, Welsh language technology, and NLP for education and social care.