Integrating Specialized Bilingual Lexicons of Multiword Expressions for Domain Adaptation in Statistical Machine Translation

Abstract : Domain adaptation consists in adapting Machine Translation (MT) systems designed for one domain to work in another. Multiword expressions generally characterize specific-domains vocabularies. Translating multiword expressions is a challenge for current Statistical Machine Translation (SMT) systems because corpus-based approaches are effective only when large amounts of parallel corpora are available. However, parallel corpora are only available for a limited number of language pairs and domains, and the process of building corpora for several language pairs and domains is time consuming and expensive. This paper describes an experimental evaluation of the impact of using a specialized bilingual lexicon of multiword expressions in order to obtain better domain adaptation for the state of the art statistical machine translation system Moses. Our study concerns the English-French language pair and two kinds of texts: in-domain texts from Europarl (European Parliament Proceedings) and out-of-domain texts from Emea (European Medicines Agency Documents). We introduce three methods to integrate extracted bilingual multiword expressions in Moses. We experimentally show that integrating specialized bilingual lexicons of multiword expressions improve translation quality of Moses for both in-domain and out-of-domain texts.
Document type :
Conference papers
Complete list of metadatas

https://hal-cea.archives-ouvertes.fr/cea-01772655
Contributor : Léna Le Roy <>
Submitted on : Friday, April 20, 2018 - 2:47:12 PM
Last modification on : Wednesday, January 23, 2019 - 2:39:24 PM

Identifiers

Collections

Citation

Nasredine Semmar, Meriama Laib. Integrating Specialized Bilingual Lexicons of Multiword Expressions for Domain Adaptation in Statistical Machine Translation. PACLING 2017: International Conference of the Pacific Association for Computational Linguistics, Aug 2017, Yangon, Myanmar (Burma). ⟨10.1007/978-981-10-8438-6_9⟩. ⟨cea-01772655⟩

Share

Metrics

Record views

55