Learning-to-Translate Based on the S-SSTC

## Abstract

We present the S-SSTC framework for machine translation (MT), introduced in 2002 and developed since as a set of working MT systems (SiSTeC-ebmt for English—Malay and English—Chinese). Our approach is example-based, but differs from other Example-based Machine Translation (EBMT) in that it uses alignments of string-tree alignments, and that supervised learning is an integral part of the approach.

In this presentation, we would like to stress a particular aspect, namely that this approach is better capable of modeling the translation knowledge of human translators than other example-based approaches. Because the translation knowledge is represented as alignments (synchronizations) between string-tree alignments (SSTCs, or structured string-tree correspondences), it is more natural to translators (and post-editors) than direct word-word, string-string or chunk-chunk correspondences used in classical Statistical Machine Translation (SMT) and EBMT models. It is also totally static, hence more understandable than procedural knowledge embedded in almost all Rule-based Machine Translation (RBMT) approaches.

The learning process which is an integral part of the development of SiSTeC-ebmt MT systems is just like any other machine learning task; it is concerned with modeling and understanding learning phenomena with respect to the ‘world’ — a central aspect of cognition. Traditional theories of Machine Translation systems, however, have assumed that such cognition can be studied separately from learning. It is assumed that the knowledge is given to the system, stored in some representation language with a well-defined meaning, and that there is some mechanism which can be used to determine what source language text can be translated with respect to the given knowledge; the question of how this knowledge might be acquired and whether this should influence how the performance of the machine translation system is measured is not considered. We prove the usefulness of the ‘learning-to-translate’ approach by showing that through interaction with the world, the developed EBMT truly gains additional translating power, over what is possible in more traditional settings.

Bilingual parallel texts which encode the correspondences between source and target sentences have been used extensively in implementing the so called example-based machine translation systems. In order to enhance the quality of example-based systems, sentences of a parallel corpus are normally annotated with their constituent or dependency structures, which in turn allows correspondences between source and target sentences to be established at the structural level. Here, we annotate parallel texts based on the Structured String-Tree Correspondence (SSTC). The SSTC is a general structure that can associate, to strings in a language, arbitrary tree structures as desired by the annotator to be the interpretation structures of the strings, and more importantly is the facility to specify the correspondence between the string and the associated tree which can be interpreted for both analysis and synthesis in the machine translation process. These features are very much desired in the design of an annotation scheme, in particular for the treatment of certain non-standard linguistic phenomena, such as unprojectivity or inversion of dominance.

In this presentation, we will demonstrate how to use the good properties of the SSTC annotation scheme for S-SSTC-based MT, using the example of the SiSTeC-ebmt English—Malay Machine Translation system. We have chosen dependency structures as linguistic representations in the SSTCs, since they provide a natural way of annotating both the tree associated to a string as well as the mapping between the two. We also give a simple means to denote the translation elements between the corresponding source (English) and target (Malay) SSTCs. The dependency structure used here is in fact quite analogous to the use of abstract syntax tree in most of the compiler implementation. However, we note that the SSTCs can easily be extended to keep multiple levels of linguistic representation (e.g. syntagmatic1, functional and logical structures) if that is considered important to enhance the results of the machine translation system. Naturally, the more information annotated in an SSTC, the more difficult is the annotation work; that is why one should try to keep only the annotations contributing most to the task at hand.

In the general case, let $S$ be a string (usually a sentence) and $T$ a tree (its linguistic representation). Instead of simply write $(S,T)$, we want to decompose that ‘large’ correspondence into smaller ones $(S_1, T_1)\ldots(S_n, T_n)$ in a hierarchical fashion; hence the adjective ‘structured’ in ‘SSTC’. If $T$ is an abstract representation of $S$, some nodes may represent discontinuous words or constituents (e.g. He gives the money back to her), or some words are not directly represented (e.g. auxiliaries, articles), or some words omitted (elided) in $S$ may have been restored in $T$. In the SSTC diagrams presented here, any tree node $N$ bears a pair $X/Y$ where $X = \text{SNODE}(N)$ and $Y = STREE(N)$. $X$ and $Y$ are generalized (not necessarily connex) substrings of the string $S$, and are written as minimal2 left-to-right lists of usual intervals, like 1_3+4_5). $\text{SNODE}(N)$ denotes the substring that corresponds to the lexical information contained in node3, while $\text{STREE}(N)$ denotes the (again possibly discontinuous) substring that corresponds to the whole subtree rooted at node $N$.

As for the correspondences between the source (English) and target (Malay) SSTCs, the translation elements between phrases and words are coded in terms of STREE pairs and SNODE pairs, respectively. To illustrate this, we show in Figure 1 a pair of source (English) and target (Malay) SSTCs and the corresponding translation elements. In the example SSTCs given, an interval is assigned to each word in the sentence, i.e. 0_1 to “if”, 1_2 to “the”, etc. The node “not” has $\text{SNODE} =$5_6, meaning that its lexeme corresponds to the word “not” in the sentence. Similarly, the node bearing “is” has $\text{STREE} =$1_5+6_10, meaning that the subtree it dominates corresponds to the discontinuous substring “the oil level is” + “at the ADD mark”.

## Implementation notes

The main purpose of the project described here is to build a general software package that provides an integrated environment for the construction of S-SSTC-based EBMT systems. In this project, we put emphasis on the development of an English->Malay MT system. However, the same methodology can be adapted to develop MT systems for any other language pairs. The current SiSTeC-ebmt platform consists of four major subcomponents, namely

1. the preparation of an annotated bilingual parallel texts to be used for the initial learning process,
2. a set of acquisition tools used to construct the initial bilingual knowledge bank,
3. a general MT system to translate new input sentences (using the bilingual knowledge bank) into the target language, together with all the related annotation,
4. the post-editing process to make corrections (if any) on the translation as well as on the annotations, which in turn will be used by the learning tools to confirm the well translated parts and adjust the translation elements of the BKB corresponding to the corrected parts.

An English -> Malay MT system with 100,000 translation examples annotated in the S-SSTC has been constructed based on the implementation frame as described above. To provide an overview of the performance of this system, a quick comparison of the MT results produced by Google Translate and our SiSTeC-ebmt is given in the table below.

Sample English Text Translation to Malay by Google Translate Translation to Malay by SiSTeC-ebmt (100,000 S-SSTCs)
The main purpose of the project described in this paper is to build a general software package that provides an integrated environment for the construction of S-SSTC based EBMT systems. In this project, we put emphasis on the development of an English->Malay MT system in the domain of computer science texts. However, the same methodology can be adapted to develop MT systems for any other typology of texts, and naturally also for any other language pairs. Tujuan utama projek yang dihuraikan dalam kertas kerja ini adalah untuk membina satu pakej perisian umum yang menyediakan persekitaran bersepadu bagi pembinaan sistem S-SSTC EBMT berasaskan. Dalam projek ini, kami meletakkan penekanan kepada pembangunan bahasa Inggeris> MT sistem bahasa Melayu dalam domain teks sains komputer. Walau bagaimanapun, kaedah yang sama boleh disesuaikan untuk membangunkan sistem MT bagi mana-mana tipologi teks lain, dan secara semulajadi juga untuk mana-mana pasangan bahasa lain. Tujuan utama daripada projek itu digambarkan di dalam kertas ini untuk membina perisian umumnya pakej yang menyediakan mengintegrasikan S-SSTC persekitaran bagi pembinaan berdasarkan EBMT sistem. Dalam projek ini, kami meletakkan teks sains sistem menekankan pembangunan English->Malay MT di domain komputer. Walau bagaimanapun, metodologi sama boleh disesuaikan mengikut merangka sistem Tm untuk tipologi lain teks, dan secara semula jadi juga untuk sebarang pasang bahasa-bahasa lain.

We provide also in the following table a comparison of the results produced by our SiSTeC-ebmt system with different size of its bilingual knowledge bank.

Translation to Malay by SiSTeC-ebmt (1,500 S-SSTCs) Translation to Malay by SiSTeC-ebmt (25,000 S-SSTCs) Translation to Malay by SiSTeC-ebmt (100,000 S-SSTCs)
Tujuan utama projek itu memerikan dengan kertas ini untuk membina bungkusan perisian jeneral yang memberikan mengintegrasikan persekitaran untuk pembinaan S-SSTC menempatkan sistem EBMT. Dalam projek ini, kami menyimpan penekanan terhadap perkembangan-perkembangan English->Malay MT sistem dalam kawasan kekuasaan komputer teks sains. Walau bagaimanapun, metodologi sama boleh menjadi disadur untuk berkembang MT sistem untuk sebarang typology yang lain (-lain) teks, dan semula jadinya juga untuk sebarang pasangan bahasa yang lain (-lain). Tujuan sesalur projek itu dikatakan dengan kertas ini untuk membina perisian jeneral bungkusan memberikan yang mengintegrasikan persekitaran untuk senibina S-SSTC berasaskan sistem EBMT. Dalam projek ini, kami meletakkan penekanan terhadap perkembangan English->Malay MT sistem dalam domain komputer teks sains. Walau bagaimanapun, perkaedahan yang sama boleh menjadi disesuaikan memajukan sistem MT untuk typology yang lain (-lain) teks, dan semula jadinya juga untuk bahasa yang lain (-lain) pasang. Tujuan utama daripada projek itu digambarkan di dalam kertas ini untuk membina perisian umumnya pakej yang menyediakan mengintegrasikan S-SSTC persekitaran bagi pembinaan berdasarkan EBMT sistem. Dalam projek ini, kami meletakkan teks sains sistem menekankan pembangunan English->Malay MT di domain komputer. Walau bagaimanapun, metodologi sama boleh disesuaikan mengikut merangka sistem Tm untuk tipologi lain teks, dan secara semula jadi juga untuk sebarang pasang bahasa-bahasa lain.
page revision: 3, last edited: 05 Jun 2012 09:18