An Extensible Massively Multilingual Lexical Simplification Pipeline Dataset using the MultiLS Framework

May 20, 2024·

Matthew Shardlow

Fernando Alva-Manchego

Riza Batista-Navarro

Stefan Bott

Saul Calderon Ramirez

Rémi Cardon

Thomas François

Akio Hayakawa

Andrea Horbach

Anna Hülsing

Yusuke Ide

Joseph Marvin Imperial

Adam Nohejl

Kai North

Laura Occhipinti

Nelson Peréz Rojas

Nishat Raihan

Tharindu Ranasinghe

Martin Solis Salazar

Marcos Zampieri

Horacio Saggion

· 0 min read

PDF Cite Dataset ACL Anthology

Abstract

We present preliminary findings on the MultiLS dataset, developed in support of the 2024 Multilingual Lexical Simplification Pipeline (MLSP) Shared Task. This dataset currently comprises of 300 instances of lexical complexity prediction and lexical simplification across 10 languages. In this paper, we (1) describe the annotation protocol in support of the contribution of future datasets and (2) present summary statistics on the existing data that we have gathered. Multilingual lexical simplification can be used to support low-ability readers to engage with otherwise difficult texts in their native, often low-resourced, languages.

Type

Conference paper

Publication

LREC-COLING 2024