Synchronous Structured String-Tree Correspondence (S-SSTC)

The Synchronous Structured String-Tree Correspondence (S-SSTC) [2] is a flexible annotation schema that declaratively describes (possibly irregular and non-standard) correspondences between a pair of SSTCs, i.e. between a pair of strings and their respective arbitrary tree representations. The S-SSTC has applications in various NLP tasks, including example-based machine translation, question answering, etc.

An S-SSTC is a general structure, comprising a pair of SSTCs. The ‘synchronised’ correspondences between the two SSTCs are specified on two levels:

• synchronous correspondences between the tree nodes (i.e. lexical alignments) are described by SNODE interval correspondences;
• synchronous correspondences between the subtrees (i.e. structural alignments) are described by STREE interval correspondences.

## Formal Definition

Let $S$ and $T$ be two SSTCs. An S-SSTC is a triple $(S,T,\varphi_{S,T})$ where $\varphi_{S,T}$ is a set of links defining the synchronous correspondences between $S$ and $T$ at different internal levels of the two SSTC structures.

A synchronous correspondence link $\ell \in \varphi_{S,T}$ can be of type $\mathop{\ell}\limits_\text{sn}$ or $\mathop{\ell}\limits_\text{st}$.

### SNODE Correspondences

$\mathop{\ell}\limits_\text{sn}$ records the synchronous correspondences at level of nodes in $S$ and $T$ (i.e. lexical correspondences between specified nodes) and normally $\mathop{\ell}\limits_\text{sn} = (X_1, X_2)$ where $X_1$ and $X_2$ are sequences of SNODE correspondences in $\text{co}$, which may be empty.

• More specifically, $\mathop{\ell}\limits_\text{sn}$ is a pair $( \mathop{\ell}\limits_{\text{sn}_S}, \mathop{\ell}\limits_{\text{sn}_T} )$ where $\mathop{\ell}\limits_{\text{sn}_S}$ is from the first SSTC ($S$) and $\mathop{\ell}\limits_{\text{sn}_T}$ is from the second SSTC ($T$).
• $\mathop{\ell}\limits_\mathit{sn}$ is represented by sets of intervals such that:
• $\mathop{\ell}\limits_{\text{sn}_S} = \lbrace i_1\_j_1 + \ldots + i_k\_j_k + \ldots + i_p\_j_p \rbrace$ where $i_k\_j_k \in X:\text{SNODE}$ correspondence in $\text{co}$ of $S$
• $\mathop{\ell}\limits_{\text{sn}_T} = \lbrace i_1\_j_1 + \ldots + i_k\_j_k + \ldots + i_p\_j_p \rbrace$ where $i_k\_j_k \in X:\text{SNODE}$ correspondence in $\text{co}$ of $T$

### STREE Correspondences

Similarly, $\mathop{\ell}\limits_\text{st}$ records the synchronous correspondences at level of nodes in $S$ and $T$ (i.e. structural correspondences between specified nodes) and normally $\mathop{\ell}\limits_\text{st} = (Y_1, Y_2)$ where $Y_1$ and $Y_2$ are sequences of STREE correspondences in $\text{co}$, which may be empty.

More specifically, $\mathop{\ell}\limits_\text{st}$ is a pair $( \mathop{\ell}\limits_{\text{st}_S}, \mathop{\ell}\limits_{\text{st}_T} )$ where $\mathop{\ell}\limits_{\text{st}_S}$ is from the first SSTC ($S$) and $\mathop{\ell}\limits_{\text{st}_T}$ is from the second SSTC ($T$) as defined below:

• $\mathop{\ell}\limits_{\text{st}_S} = \lbrace i_1\_j_1 + \ldots + i_k\_j_k + \ldots + i_p\_j_p \rbrace$ where $i_k\_j_k \in \text{Y:STREE}$ correspondence in $\text{co}$ of $S$, or $(i_k\_j_k) = (i_k\_j_k) - (i_u\_j_v) \quad|\quad i_u \geq i_k , j_v \geq j_k$; i.e. $( i_u\_j_v) \subseteq(i_k\_j_k)$ which corresponds to an incomplete subtree.
• $\mathop{\ell}\limits_{\text{st}_T} = \lbrace i_1\_j_1 + \ldots + i_k\_j_k + \ldots + i_p\_j_p \rbrace$ where $i_k\_j_k \in \text{Y:STREE}$ correspondence in $\text{co}$ of $T$, or $(i_k\_j_k) = (i_k\_j_k) - (i_u\_j_v) \quad|\quad i_u \geq i_k , j_v \geq j_k$; i.e. $( i_u\_j_v) \subseteq(i_k\_j_k)$ which corresponds to an incomplete subtree.

## Bilingual Translation Example Annotation

The figure below shows an S-SSTC representing an annotated English—Malay translation example.

SNODE correspondences STREE correspondences
(0_1, 0_1) (1_2+4_5, 1_2) (3_4, 2_3) (2_3, 3_4) (0_5, 0_5) (0_1, 0_1) (2_4, 2_4) (3_4, 3_4)

Figure 1: Example S-SSTC

The SNODE correspondence (0_1, 0_1)) states that English ‘He’ corresponds to Malay ‘Dia’, while the SNODE correspondence (1_2+4_5, 1_2) captures the correspondence between the discontiguous ‘picked…up’ and ‘kutip’. The STREE correspondence (2_4, 2_4) captures the correspondence between the English noun phrase ‘the ball’ and Malay noun phrase ‘bola itu’, as well as between the subtrees containing these phrases.

## Handling Non-Standard Phenomena

The S-SSTC annotation framework is flexible and robust in handling non-standard translation phenomena. We describe some example cases, which are drawn from the problem of using synchronous formalisms to define translations between languages (see [1])

### Many-to-One Mapping

Figure 1 illustrates a case where the English sentence has non-standard cases of featurisation, crossed dependency and a many-to-one synchronous correspondence in “picks up”. Another case is reordering of words in the phrases, which is clear in the phrase “the heavy box” and it corresponding phrase “kotak berat itu” in the target.

[todo]

[todo]

### Inversion of Dominance

[todo]

Bibliography
1. Shieber, S. (1994). Restricting the Weak Generative Capacity of Synchronous Tree Adjoining Grammar. Computational Intelligence, 10(4): 371-385.
2. Al-Adhaileh, M., Tang, E. K. & Zaharin, Y. (2002). A synchronization structure of SSTC and its applications in machine translation. In COLING 2002 Post-Conference Workshop “Workshop on Machine Translation in Asia”. Taipei, Taiwan.
page revision: 13, last edited: 14 Dec 2012 02:32