A Meta-evaluation of Automatic Metrics for Elaborative Simplification

May 11, 2026·

Abdullah Alshatti

Steven Schockaert

Fernando Alva-Manchego

· 0 min read

Abstract

Elaborative simplification aims to improve the readability of texts by adding content that helps the readers. However, evaluating these elaborations remains challenging due to their subjective nature and the lack of suitable annotated datasets. To support the evaluation of elaborative simplification models, we introduce a new dataset with human ratings of elaborations generated by Large Language Models (LLMs), focusing on two quality criteria: cohesion and informativeness. Using these human judgments as a reference, we conduct a meta-evaluation of existing automatic evaluation approaches, with a focus on LLM-as-a-judge strategies. Our experiments suggest that evaluations made by smaller LLMs correlate poorly with human judgments, while larger models with structured prompting exhibit higher agreement. Informativeness evaluation proved to be challenging due to its subjectivity, as evidenced by the low inter-annotator agreement compared to cohesion.

Type

Conference paper

Publication

READIxTSAR @ LREC 2026

Last updated on May 11, 2026

Text Simplification Evaluation

Authors

Abdullah Alshatti

PhD Researcher

Authors

Fernando Alva-Manchego

Researcher in Natural Language Processing

My research interests include text simplification, readability assessment, multilingual NLP, Welsh language technology, and NLP for education and social care.

← ComplexityMT: Benchmarking the Interaction Between Text Complexity and Machine Translation June 3, 2026

CEFR-Cymraeg: A Dataset and Baseline Models for Language Proficiency Assessment in Welsh May 11, 2026 →