An Extensible Massively Multilingual Lexical Simplification Pipeline Dataset using the MultiLS Framework
May 20, 2024·
,,,,,,,,,,,,,,,,,,,·
0 min read
Matthew Shardlow

Fernando Alva-Manchego
Riza Batista-Navarro
Stefan Bott
Saul Calderon Ramirez
Rémi Cardon
Thomas François
Akio Hayakawa
Andrea Horbach
Anna Hülsing
Yusuke Ide
Joseph Marvin Imperial
Adam Nohejl
Kai North
Laura Occhipinti
Nelson Peréz Rojas
Nishat Raihan
Tharindu Ranasinghe
Martin Solis Salazar
Marcos Zampieri
Horacio Saggion
Abstract
We present preliminary findings on the MultiLS dataset, developed in support of the 2024 Multilingual Lexical Simplification Pipeline (MLSP) Shared Task. This dataset currently comprises of 300 instances of lexical complexity prediction and lexical simplification across 10 languages. In this paper, we (1) describe the annotation protocol in support of the contribution of future datasets and (2) present summary statistics on the existing data that we have gathered. Multilingual lexical simplification can be used to support low-ability readers to engage with otherwise difficult texts in their native, often low-resourced, languages.
Type
Publication
LREC-COLING 2024