Machine translation – 3 buzzwords, Mr. Spock and the Babel fish
Published Mar 4, 2010 by Andreas Tuerk
Machine translation is one of many technologies that are heavily used in Sci-Fi stories. While we are still waiting for some of these fictive technologies to come to fruition, machine translation has made steady progress over the last decades. In many cases the performance of today's machine translation might still pale in comparison to the Babel fish or the Universal Translator in Star Trek, but it can still be of great value in many situations
Machine translation is one of many technologies that are heavily used in Sci-Fi stories. If First Officer Spock wants to communicate with an unknown species he can use a hand-held device called the “Universal Translator” that allows both parties to understand each other. Those familiar with “The Hitchhicker's guide to the Galaxy” might also know a little creature called the Babel fish which, inserted into one's ear, provides translation between any two languages.
While we are still waiting for some of the other Sci-Fi technologies to come to fruition, like wormhole travel or transporter beams, machine translation has made steady progress over the last decades. In many cases the performance of today's machine translation might still pale in comparison to the Babel fish or the Universal Translator in Star Trek, but it can still be of great value in many situations. An English speaking patent searcher interested, for instance, in the content of a Chinese patent, might be satisfied with obtaining an almost correct translation of some key sentences even if the rest of the translation is somewhat garbled. There are many other usage scenarios for today's not quite perfect machine translation technology. But instead of trying to convince you that machine translation is useful by drawing up a long list of use cases, I will instead explain three buzzwords in machine translation research: phrase translation, statistical system and rule-based system. And finally, I will also explain how these buzzwords fit into our current machine translation frame-work.
At a very basic level translation is the substitution of words in the source language by words in the target language. Consider, for instance, the English sentence “This sentence is short”. Translating this into German results in “Dieser Satz ist kurz”. Translation in this case is therefore a simple word-by-word substitution which computers can do fast and reliably. Unfortunately, the case in which word-by-word substitution is sufficient to derive the correct translation is the exception rather than the rule. Consider, for example, the sentence “I went home.” This translates into the German sentence “Ich bin nachhause gegangen.” The only word that can be correctly translated just by replacing it with its translation from a dictionary is “I”, the rest requires some additional knowledge about German grammar. Instead of implementing grammatical rules, however, one can also derive the correct translation by extending the concept of a dictionary. Such a generalized dictionary does not only contain word-by-word translations but also phrase-by-phrase translations and is therefore called a phrase translation table (PTT). In the present example the PTT might contain the following entry “went home → bin nachhause gegangen”. With such a PTT the sentence “I went home” can now be correctly translated. Unfortunately, translating the sentence “He went home” results in “Er bin nachhause gegangen” which is grammatically wrong, because the verb form of “being” in German for the 3rd person singular is “ist”. One possible solution is to add the following entry to the PTT “went home → ist nachhause gegangen”. This gives the correct translation once it is known which translation of “went home” has to be chosen. The above example illustrates that a machine translation system has to have two components:
- A dictionary or more generally a phrase translation table (PTT).
- A method for choosing between the different translation options that are provided by the PTT.
The methods for choosing between different translation options fall into two broad categories: rule-based and statistical. A rule-based translation system is a little bit like the grammar of a language which defines different word forms and determines which of these forms can be combined to obtain a correct sentence. A rule-based system might therefore reject the sentence “Er bin ...” on the basis that it is grammatically incorrect. A statistical system, on the other hand, does not contain the notion of a correct sentence, it only states that certain sentences are more likely than others. Such a system will probably not completely reject the sentence “Er bin ...” but will instead assign to it a very small probability. In the above example the result of the translation process is the same. Since “Er bin ...” has a much smaller probability than “Er ist ...” the statistical system will perform the correct translation.
One major disadvantage of rule-based systems is that they often have to be tweaked by hand. Statistical systems, on the contrary, can be automatically trained on large data sets and therefore typically yield much better translation performance. This fact together with the flexibility of phrase translation tables is the reason why at Matrixware we currently use a statistical phrase-based translation system. This system builds on the open source software Moses which we have trained on data from the patent domain consisting of ~3.5 million bilingual Chinese/English sentence segments and monolingual English data containing ~872 000 sentences. So far the results have been fairly encouraging. On patent data the quality of our translations is considerably better than that of Google or other competitors. In order to determine the quality of our system we use the BLEU (bilingual evaluation understudy) metric. This is the geometric mean of the uni-, bi-, tri- and four-gram accuracies of our translations weighted by a penalty which reflects the mismatch in length between reference and SMT output. In terms of the BLEU metric we obtain an improvement of between 40% and 140% over Google on our various test sets. In addition to the high quality of our translations we also achieve a high throughput on our grid computing environment, which currently churns out 100 000 – 130 000 translated documents a day. This corresponds to an input of about 500 000 Chinese characters a minute or an output of about 200 000 English words a minute.
Currently we extend our efforts to the translation of Japanese patents. For this purpose, we train an SMT system on ~18.5 million bilingual Japanese/English sentence segments. This training set is substantially larger set than the one we used for Chinese-to-English translations and we therefore hope to achieve a similar or maybe even higher translation quality for Japanese-to-English translations.
Of course, our translation engine has been optimized on patent data and we don't know how well it will fare in other domains. But if you plan to communicate with an unknown species in the alien world of Chinese or Japanese patents then you could do worse than using the Matrixware translation system; at least, as long as nobody has inserted a living specimen of the Babel fish into your ear.