SiSTeC-ebmt

The S-SSTC framework for machine translation (MT) was introduced in 2002 [1], and has been developed since as a set of working MT systems, namely SiSTeC-ebmt for EnglishMalay and EnglishChinese.

Our approach is example-based, but differs from other Example-based Machine Translation (EBMT) in that

  • it uses alignments of string-tree alignments,
  • supervised learning is an integral part of the approach.

We believe our approach is more natural at modelling the translation knowledge of human translators, compared to other example-based approaches.

Features and Advantages

In the SiSTeC-ebmt framework, translation knowledge is represented as alignments (synchronizations) between string-tree alignments (SSTCs). Specifically, we annotate parallel texts as S-SSTC structures, which possess much desired features concerning the treatment of certain non-standard linguistic phenomena.

Such structures come more naturally to translators and post-editors than direct wordword, stringstring or chunkchunk correspondences used in classical Statistical Machine Translation (SMT) and EBMT models. The synchronised SSTC structures are completely static, hence more than understandable procedural knowledge embedded in almost all Rule-based Machine Translation (RBMT) approaches.

The learning process integral to the development of SiSTeC-ebmt MT systems and similar to other machine learning tasks is concerned with modelling and understanding learning phenomena with respect to the ‘world’: a central aspect of cognition. Traditional theories of MT systems, however, have assumed that such cognition can be studied separately from learning. It is assumed that the knowledge is given to the system, stored in some representation language with a well-defined meaning, and that there is some mechanism which can be used to determine what source language text can be translated with the given knowledge. The question of how this knowledge might be acquired, and whether this should influence how the the MT system's performance is measured, is not considered. We prove the usefulness of the ‘learning-to-translate’ approach by showing that through interaction with the world, the developed SiSTeC-ebmt systems truly gain additional translating power, over what is possible in more traditional settings.

Annotation of Translation Elements

In the SiSTeC-ebmt framwork, bilingual parallel texts are annotated as S-SSTCs. We have chosen dependency structures as linguistic tree representations in the SSTCs. We note that the SSTCs can be easily extended to record multiple levels of linguistic representations (e.g. syntagmatic, functional and logical structures), if these are deemed important to enhance the MT outputs. Naturally, the more information annotated in an SSTC, the more difficult the annotation work is: that is why we usually endeavor to record only the annotations contributing most to the taks at hand.

The following shows an example English—Malay translation example, annotated as an S-SSTC. The translation elements between phrases and words are coded as SNODE and STREE pairs. The tree node not has SNODE = 5_6, meaning its lexeme corresponds to the word “not” in the English sentence. Similarly, the tree node is has STREE = 1_5+6_10, meaning that the subtree it dominates corresponds to the discontinuous substring “the oil level is … at the ADD mark”.

ssstc-adv.png
SNODE correspondences (word) STREE correspondences (phrase)
(0_1+10_11, 0_1+8_9) (2_3, 2_3) (3_4, 1_2) (4_5, 4_5) (5_6, 3_4) (6_7, 5_6) (8_9, 7_8) (9_10, 6_7) (11_12, 9_10) (13_14, 12_13) (14_15, 10_11) (15_16, 13_14) (17_18, 14_16) (0_18, 0_16) (1_4, 1_3) (1_5+6_10, 1_3+4_8) (1_10, 1_8) (6_10, 5_8) (7_10, 6_8) (11_18, 9_16) (12_15, 10_13) (15_18, 13_16) (16_18, 14_16)

Subcomponents

The SiSTeC-ebmt platform consists of four major subcomponents:

  • Preparation of annotated bilingual parallel texts;
  • Set of acquisition tools to construct initial bilingual knowledge bank (BKB) from the annotated parallel texts;
  • A general MT system to translate input sentences into the target language using the BKB, together with all the related annotations;
  • Post-editing process to correct (if any) the translations and/or the annotations. This will be used by the learning tools to confirm correctly translation parts, and to adjust the translation elements in the BKB based only the corrections.

Example Output

An English Malay MT system with 100,000 translation examples (annotated as S-SSTCs) has been constructed. Following is a quick comparison of the MT results produced by Google Translate and SiSTeC-ebmt.

Sample English Text The main purpose of the project described in this paper is to build a general software package that provides an integrated environment for the construction of S-SSTC based EBMT systems. In this project, we put emphasis on the development of an English->Malay MT system in the domain of computer science texts. However, the same methodology can be adapted to develop MT systems for any other typology of texts, and naturally also for any other language pairs.
Translation to Malay by Google Translate Tujuan utama projek yang dihuraikan dalam kertas kerja ini adalah untuk membina satu pakej perisian umum yang menyediakan persekitaran bersepadu bagi pembinaan sistem S-SSTC EBMT berasaskan. Dalam projek ini, kami meletakkan penekanan kepada pembangunan bahasa Inggeris> MT sistem bahasa Melayu dalam domain teks sains komputer. Walau bagaimanapun, kaedah yang sama boleh disesuaikan untuk membangunkan sistem MT bagi mana-mana tipologi teks lain, dan secara semulajadi juga untuk mana-mana pasangan bahasa lain.
Translation to Malay by SiSTeC-ebmt (100k S-SSTCs) Tujuan utama daripada projek itu digambarkan di dalam kertas ini untuk membina perisian umumnya pakej yang menyediakan mengintegrasikan S-SSTC persekitaran bagi pembinaan berdasarkan EBMT sistem. Dalam projek ini, kami meletakkan teks sains sistem menekankan pembangunan English->Malay MT di domain komputer. Walau bagaimanapun, metodologi sama boleh disesuaikan mengikut merangka sistem Tm untuk tipologi lain teks, dan secara semula jadi juga untuk sebarang pasang bahasa-bahasa lain.

The following is a comparison of the results produced by our SiSTeC-ebmt system using different BKB sizes.

Translation to Malay by SiSTeC-ebmt (1.5k S-SSTCs) Tujuan utama projek itu memerikan dengan kertas ini untuk membina bungkusan perisian jeneral yang memberikan mengintegrasikan persekitaran untuk pembinaan S-SSTC menempatkan sistem EBMT. Dalam projek ini, kami menyimpan penekanan terhadap perkembangan-perkembangan English->Malay MT sistem dalam kawasan kekuasaan komputer teks sains. Walau bagaimanapun, metodologi sama boleh menjadi disadur untuk berkembang MT sistem untuk sebarang typology yang lain (-lain) teks, dan semula jadinya juga untuk sebarang pasangan bahasa yang lain (-lain).
Translation to Malay by SiSTeC-ebmt (25k S-SSTCs) Tujuan sesalur projek itu dikatakan dengan kertas ini untuk membina perisian jeneral bungkusan memberikan yang mengintegrasikan persekitaran untuk senibina S-SSTC berasaskan sistem EBMT. Dalam projek ini, kami meletakkan penekanan terhadap perkembangan English->Malay MT sistem dalam domain komputer teks sains. Walau bagaimanapun, perkaedahan yang sama boleh menjadi disesuaikan memajukan sistem MT untuk typology yang lain (-lain) teks, dan semula jadinya juga untuk bahasa yang lain (-lain) pasang.
Translation to Malay by SiSTeC-ebmt (100k S-SSTCs) Tujuan utama daripada projek itu digambarkan di dalam kertas ini untuk membina perisian umumnya pakej yang menyediakan mengintegrasikan S-SSTC persekitaran bagi pembinaan berdasarkan EBMT sistem. Dalam projek ini, kami meletakkan teks sains sistem menekankan pembangunan English->Malay MT di domain komputer. Walau bagaimanapun, metodologi sama boleh disesuaikan mengikut merangka sistem Tm untuk tipologi lain teks, dan secara semula jadi juga untuk sebarang pasang bahasa-bahasa lain.
Bibliography
1. Al-Adhaileh, M., Tang, E. K. & Zaharin, Y. (2002). A synchronization structure of SSTC and its applications in machine translation. In COLING 2002 Post-Conference Workshop “Workshop on Machine Translation in Asia”. Taipei, Taiwan.
2. Boitet, C., Zaharin, Y. & Tang, E. K. (2011). Learning-to-translate based on the S-SSTC annotation schema. In Proceedings of PACLIC 2011. Singapore.
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License