Welsh NLP | Fernando Alva-Manchego

CEFR-Cymraeg: A Dataset and Baseline Models for Language Proficiency Assessment in Welsh

Mon, 11 May 2026 00:00:00 +0000

Proffiliadur: Welsh Language Text Profiling Toolkit

Mon, 11 May 2026 00:00:00 +0000

Translation is Not Enough (TINE): Plain Language Adaptation of Multilingual Science

Fri, 01 May 2026 00:00:00 +0000

CHIST-ERA project developing multilingual NLP methods to make scientific documents genuinely accessible, combining translation, simplification, and terminology clarification across multiple languages.

Funder: CHIST-ERA / UKRI
Period: 2026 – 2029
Role: Principal Investigator (Cardiff University)
Partners: Cardiff University, Manchester Metropolitan University (UK); Universitat Pompeu Fabra (Spain); Institute of Computer Science, Polish Academy of Sciences (Poland); University of Zurich (Switzerland)
Research theme:

Scientific knowledge is publicly available, but is it accessible? Two barriers stand in the way. First, most research is published in English, excluding communities who speak other languages even when the research is about their own lives. Second, translation alone is not enough: even a translated text remains full of jargon and technical language that non-expert readers cannot understand.

TINE addresses both barriers through a three-step pipeline applied to scientific documents in any language:

Understand — extract document structure, text, headings, tables, and figures from complex PDFs
Translate — produce accurate whole-document translation preserving context and terminology
Adapt — simplify style, explain jargon, and fit the language to the reader

The result is an accurate, accessible document in the target language that real people can read and act on.

Welsh-speaking service users routinely receive research consent forms, information sheets, and questionnaires in English, full of technical language. They cannot meaningfully engage with research that directly concerns them, and cannot give fully informed consent. Cardiff’s work focuses on producing plain Welsh versions of these materials — not just translated, but genuinely understandable.

This work is carried out in collaboration with , the Centre for Social Care and Artificial Intelligence Learning, and feeds directly into : plain Welsh research materials produced by TINE make evidence accessible to Welsh-speaking social workers and service users.

What TINE will deliver

Open-source tools for document structure extraction, whole-document translation, and plain language adaptation
Multilingual corpora of scientific documents and annotated plain language examples for training
Language resources for Welsh, Polish, Catalan, and Chinese
Open benchmarks for evaluation
All outputs open access and freely reusable

NLP Tools for Welsh Language Assessment and Learning

Sun, 01 Jan 2023 00:00:00 +0000

Welsh Government-funded project developing computational tools for Welsh text complexity analysis, CEFR proficiency assessment, and morphological analysis to support Welsh-language education.

Funder: Welsh Government
Period: 2025 – 2026
Role: Principal Investigator
Research theme:

Welsh is spoken by approximately 900,000 people and has unique linguistic features, including initial consonant mutation, that pose significant challenges for standard NLP pipelines. This project develops the foundational NLP infrastructure for Welsh language assessment and learning, in partnership with Welsh-language educational institutions and the Welsh Government.

Key outputs include:

Proffiliadur: an open-source toolkit computing 141 linguistic complexity indices for Welsh texts, supporting CEFR-level classification and accessibility analysis
CEFR-Cymraeg: the first CEFR-annotated proficiency dataset for Welsh (A1–B2), enabling automated language proficiency assessment for Welsh learners