Universal Dependencies Syntactic Graphs¶
The syntactic graphs that form the first layer of annotation in the dataset come from gold UD dependency parses provided in the UD-EWT treebank, which contains sentences from the Linguistic Data Consortium’s constituency parsed EWT. UD-EWT has predefined training (train
), development (dev
), and test (test
) data in corresponding files in CoNLL-U format: en_ewt-ud-train.conllu
, en_ewt-ud-dev.conllu
, and en_ewt-ud-test.conllu
. Henceforth, SPLIT
ranges over train
, dev
, and test
.
In UDS, each dependency parsed sentence in UD-EWT is represented as a rooted directed graph (digraph). Each graph’s identifier takes the form ewt-SPLIT-SENTNUM
, where SENTNUM
is the ordinal position (1-indexed) of the sentence within en_ewt-ud-SPLIT.conllu
.
Each token in a sentence is associated with a node with identifier ewt-SPLIT-SENTNUM-syntax-TOKNUM
, where TOKNUM
is the token’s ordinal position within the sentence (1-indexed, following the convention in UD-EWT). At minimum, each node has the following attributes.
position
(int
): the ordinal position (TOKNUM
) of that node as an integer (again, 1-indexed)
domain
(str
): the subgraph this node is part of (alwayssyntax
)
type
(str
): the type of the object in the particular domain (alwaystoken
)
form
(str
): the actual token
lemma
(str
): the lemma corresponding to the actual token
upos
(str
): the UD part-of-speech tag
xpos
(str
): the Penn TreeBank part-of-speech tagany attribute found in the features column of the CoNLL-U
For information about the values upos
, xpos
, and the attributes contained in the features column can take on, see the UD Guidelines.
Each graph also has a special root node with identifier ewt-SPLIT-SENTNUM-root-0
. This node always has a position
attribute set to 0
and domain
and type
attributes set to root
.
Edges within the graph represent the grammatical relations (dependencies) annotated in UD-EWT. These dependencies are always represented as directed edges pointing from the head to the dependent. At minimum, each edge has the following attributes.
domain
(str
): the subgraph this node is part of (alwayssyntax
)
type
(str
): the type of the object in the particular domain (alwaysdependency
)
deprel
(str
): the UD dependency relation tag
For information about the values deprel
can take on, see the UD Guidelines.