Learning How to Simplify From Explicit Labeling of Complex-Simplified Text Pairs

November 1, 2017·

Fernando Alva-Manchego

Equal Contribution

Joachim Bingel

Equal Contribution

Gustavo Paetzold

Carolina Scarton

Lucia Specia

· 0 min read

ACL Anthology Code PDF

Abstract

Current research in text simplification has been hampered by two central problems: (i) the small amount of high-quality parallel simplification data available, and (ii) the lack of explicit annotations of simplification operations, such as deletions or substitutions, on existing data. While the recently introduced Newsela corpus has alleviated the first problem, simplifications still need to be learned directly from parallel text using black-box, end-to-end approaches rather than from explicit annotations. These complex-simple parallel sentence pairs often differ to such a high degree that generalization becomes difficult. End-to-end models also make it hard to interpret what is actually learned from data. We propose a method that decomposes the task of TS into its sub-problems. We devise a way to automatically identify operations in a parallel corpus and introduce a sequence-labeling approach based on these annotations. Finally, we provide insights on the types of transformations that different approaches can model.

Type

Conference paper

Publication

IJCNLP 2017

Last updated on November 1, 2017

Authors

Fernando Alva-Manchego

Researcher in Natural Language Processing

My research interests include text simplification, readability assessment, multilingual NLP, Welsh language technology, and NLP for education and social care.

← Strong Baselines for Complex Word Identification across Multiple Languages June 1, 2019

MASSAlign: Alignment and Annotation of Comparable Documents November 1, 2017 →