The Structured String-Tree Correspondence (SSTC) [1] is a flexible annotation schema, suitable for declaratively describing non-standard linguistic constructions, including non-projective correspondence, scrambling, cross-dependencies, etc.
An SSTC is a general structure that associates strings in a language to arbitrary tree structures, chosen by the annotator as the interpretation structures of the strings. It records the correspondences between a string of terms and its chosen tree representation structure on two levels, i.e.
- between tree nodes and (possibly non-contiguous) subtrings; and
- between (possibly incomplete) subtrees and (possibly non-contiguous) substrings.
These correspondences are annotated at each tree node $N$ as the SNODE interval $\text{SNODE}(N)$ and STREE interval $\text{STREE}(N)$.
Formal Definition
An SSTC is a triple $(\mathit{st}, \mathit{tr}, \mathit{co})$ where
- $\mathit{st}$ is a string in one language,
- $\mathit{tr}$ is its associated tree structure,
- $\mathit{co}$ is the correspondence between $\mathit{st}$ and $\mathit{tr}$.
$\mathit{co}$ can be encoded on the tree by attaching to each node $N$ in $\mathit{tr}$ two sequences of intervals:
- $\mathit{SNODE}(N)$: an interval of the substring in $\mathit{st}$ that corresponds to the node $N$ in $\mathit{tr}$.
- $\mathit{STREE}(N)$: an interval of the substring in $\mathit{st}$ that corresponds to the subtree in $\mathit{tr}$ having the node $N$ as root.
Intervals are written as a minimal list, from left to right. That means that any occurrence of $n_1\_n_2+n_2\_n_3$ is replaced by $n_1\_n_3$, $n_i$ being a position between two typographical words (word-based), or more generally (to handle writing systems without word delimiters such as Chinese, Japanese, Korean, Vietnamese, Thai, Lao, or Khmer), between two characters (character-based).
For example, an English text might use a word-based interval scheme:
0 We 1 went 2 to 3 school 4 . 5
«school» then has the interval 3_4, and «went to school» has the interval 1_4. The interval of the entire string is 0_5.
On the other hand, a character-based interval scheme would be more suitable for Chinese text:
0 我 1 们 2 上 3 学 4 校 5 。 6
(Compare this to the segmented text “我们 上 学校 。”.)
«我们» then has the interval 0_2, and «上学校» has interval 2_5. The interval of the entire string is 0_6.
In addition, each tree node $N$ maybe annotated with further information local to the node (or the subtree rooted by $N$), such as morphological information, category labels, etc.
Freedom of Choice of Tree Structure
SSTC allows the annotator to choose any arbitrary tree representation model to be associated with a string, e.g. phrase structure trees or dependency trees, or other syntagmatic, functional and logical structures, to suit the needs of the task at hand.
Syntagmatic structure: phrase structure tree
Functional structure: dependency tree
Handling Non-Standard Linguistic Phenomena
One important feature of the SSTC is the facility to declaratively specify the correspondence between the string and the associated tree. This are particularly useful for the treatment of certain non-standard linguistic phenomena, such as non-projectivity or inversion of dominance.
Non-projectivity
The sentence “He picked the ball up” contains a discontiguous verb-particle construction «picked … up». This non-projective construction can associated with a single tree node with the SNODE interval 1_2+4_5 in the SSTC framework.
Ellipses
[todo]