Statistical Machine Translation

 

On basis of a complete Chinese backfile, dating back to 1985 and of high data quality, 4 million bilingual aligned sentences obtained from human translated patents have been created. We train and evaluate our Chinese to English SMT with a phrase-based approach which utilizes jargon and phrases derived from some 20 million English patents. This allows replicating search strategies and English-driven retrieval technologies to other languages while still achieving useful recall and precision.

Facts

  • Technology: Phrase-based SMT technology (MOSES), and customized pre- and pro-processing modules
  • Language: Chinese to English automatic translation, customized for patents
  • Training data: 4 million bilingual aligned sentences obtained from human translated patents, and 1 million English sentences extracted from WO, EP and US patents
  • Production environment: a scalable grid environment allowing for a current capacity of more than 100.000 documents per day
  • Quality:

– Automatic quality assessment mechanism integrated in the workflow, including BLEU score comparison with Google and CNPat (Chinese Patent Information Centre). Currently, our system demonstrates an average 170% BLEU score improvement as compared with Google.

– Human quality assessment by a team of Chinese native speakers and language specialists feeding back findings to improve the translation engine.

 

Glossary

  • Source language – the language of the text to translate
  • Target language – the language in which the source language text shall be translated
  • Monolingual corpus – a corpus of documents in one single language. Used to create the target language model to optimize jargon and phrases for a specific domain
  • Bilingual corpus – a corpus of paired documents in two languages
  • Sentence alignment – the process which takes a bilingual corpus and produces bilingual paired sentences containing equivalent concepts and information
  • RBMT - Rule based machine translation – system where translation involves “hand-crafted” linguistic rules and dictionaries
  • TM - Translation Memory – a database of paired sentences in source and target languages, and the mechanism which enables to translate sentences using this data
  • SMT - Statistical machine translation – system where translation involves bilingual and monolingual data automatically acquired from corpora
  • Phrase based SMT – a SMT system where bilingual data includes not only words but phrases (sequence of words) of arbitrary lengths
  • Language model – data and statistics which “describe” the word ordering of a given language. SMT uses a target language model during the actual translation.
  • Disambiguation – for a translation task, disambiguation is the process of selecting the most suitable translation of a given source language word in a given sentence context.
  • Pre- and post-processing – designates all the workflow steps which prepare the source document, before the actual sentence-by-sentence translation, and generate a translated document with preserved formatting.
  • BLEU score – an automatic means for measuring the quality of a translation, performed by comparing the amount of common words and phrases between the automatically translated text and a given human translated reference.
  • Fluency – used in human assessment to measure the readability of a machine translated text
  • Adequacy – used in human assessment to measure how much of the original information is conveyed to the machine translated text