Structured String-Tree Correspondence (SSTC)

The Structured String-Tree Correspondence (SSTC) [1] is a flexible annotation schema, suitable for declaratively describing non-standard linguistic constructions, including non-projective correspondence, scrambling, cross-dependencies, etc.

An SSTC is a general structure that associates strings in a language to arbitrary tree structures, chosen by the annotator as the interpretation structures of the strings. It records the correspondences between a string of terms and its chosen tree representation structure on two levels, i.e.

  • between tree nodes and (possibly non-contiguous) subtrings; and
  • between (possibly incomplete) subtrees and (possibly non-contiguous) substrings.

These correspondences are annotated at each tree node $N$ as the SNODE interval $\text{SNODE}(N)$ and STREE interval $\text{STREE}(N)$.

Formal Definition

An SSTC is a triple $(\mathit{st}, \mathit{tr}, \mathit{co})$ where

  • $\mathit{st}$ is a string in one language,
  • $\mathit{tr}$ is its associated tree structure,
  • $\mathit{co}$ is the correspondence between $\mathit{st}$ and $\mathit{tr}$.

$\mathit{co}$ can be encoded on the tree by attaching to each node $N$ in $\mathit{tr}$ two sequences of intervals:

  • $\mathit{SNODE}(N)$: an interval of the substring in $\mathit{st}$ that corresponds to the node $N$ in $\mathit{tr}$.
  • $\mathit{STREE}(N)$: an interval of the substring in $\mathit{st}$ that corresponds to the subtree in $\mathit{tr}$ having the node $N$ as root.

Intervals are written as a minimal list, from left to right. That means that any occurrence of $n_1\_n_2+n_2\_n_3$ is replaced by $n_1\_n_3$, $n_i$ being a position between two typographical words (word-based), or more generally (to handle writing systems without word delimiters such as Chinese, Japanese, Korean, Vietnamese, Thai, Lao, or Khmer), between two characters (character-based).

For example, an English text might use a word-based interval scheme:

0 We 1 went 2 to 3 school 4 . 5

«school» then has the interval 3_4, and «went to school» has the interval 1_4. The interval of the entire string is 0_5.

On the other hand, a character-based interval scheme would be more suitable for Chinese text:

0123456

(Compare this to the segmented text “我们 上 学校 。”.)

«我们» then has the interval 0_2, and «上学校» has interval 2_5. The interval of the entire string is 0_6.

In addition, each tree node $N$ maybe annotated with further information local to the node (or the subtree rooted by $N$), such as morphological information, category labels, etc.

Freedom of Choice of Tree Structure

SSTC allows the annotator to choose any arbitrary tree representation model to be associated with a string, e.g. phrase structure trees or dependency trees, or other syntagmatic, functional and logical structures, to suit the needs of the task at hand.

sstc-phrase.png
Syntagmatic structure: phrase structure tree

sstc-dep.png
Functional structure: dependency tree

Handling Non-Standard Linguistic Phenomena

One important feature of the SSTC is the facility to declaratively specify the correspondence between the string and the associated tree. This are particularly useful for the treatment of certain non-standard linguistic phenomena, such as non-projectivity or inversion of dominance.

Non-projectivity

sstc-nonprojective.png

The sentence “He picked the ball up” contains a discontiguous verb-particle construction «picked … up». This non-projective construction can associated with a single tree node with the SNODE interval 1_2+4_5 in the SSTC framework.

Ellipses

[todo]

Bibliography
1. Boitet, C. & Zaharin, Y. (1988). Representation Trees and String-Tree Correspondences. In Proceedings of the 12th International Conference on Computational Linguistics, vol. 1, pp. 59-64.
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License