Quick Start

To read the Universal Decompositional Semantics (UDS) dataset, use:

from decomp import UDSCorpus

uds = UDSCorpus()

This imports a UDSCorpus object uds, which contains all graphs across all splits in the data. If you would like a corpus, e.g., containing only a particular split, see other loading options in Reading the UDS dataset.

The first time you read UDS, it will take several minutes to complete while the dataset is built from the Universal Dependencies English Web Treebank, which is not shipped with the package (but is downloaded automatically on import in the background), and the UDS annotations, which are shipped with the package. Subsequent uses will be faster, since the dataset is cached on build.

UDSSentenceGraph objects in the corpus can be accessed using standard dictionary getters or iteration. For instance, to get the UDS graph corresponding to the 12th sentence in en-ud-train.conllu, you can use:

uds["ewt-train-12"]

To access documents (UDSDocument objects, each of which has an associated UDSDocumentGraph), you can use:

uds.documents["reviews-112579"]

To get the associated document graph, use:

uds.documents["reviews-112579"].document_graph

More generally, UDSCorpus objects behave like dictionaries. For example, to print all the sentence-level graph identifiers in the corpus (e.g. "ewt-train-12"), you can use:

for graphid in uds:
    print(graphid)

To print all the document identifiers in the corpus, which correspond directly to English Web Treebank file IDs (e.g. "reviews-112579"), you can use:

for documentid in uds.documents:
    print(documentid)

Similarly, to print all the sentence-level graph identifiers in the corpus (e.g. "ewt-train-12") along with the corresponding sentence, you can use:

for graphid, graph in uds.items():
    print(graphid)
    print(graph.sentence)

Likewise, the following will print all document identifiers, along with each document’s entire text:

for documentid, document in uds.documents.items():
    print(documentid)
    print(document.text)

A list of sentence-level graph identifiers can also be accessed via the graphids attribute of the UDSCorpus. A mapping from these identifiers and the corresponding graph can be accessed via the graphs attribute.

# a list of the sentence-level graph identifiers in the corpus
uds.graphids

# a dictionary mapping the sentence-level
# graph identifiers to the corresponding graph
uds.graphs

A list of document identifiers can also be accessed via the document_ids attribute of the UDSCorpus:

uds.document_ids

For sentence-level graphs, there are various instance attributes and methods for accessing nodes, edges, and their attributes in the UDS sentence-level graphs. For example, to get a dictionary mapping identifiers for syntax nodes in a sentence-level graph to their attributes, you can use:

uds["ewt-train-12"].syntax_nodes

To get a dictionary mapping identifiers for semantics nodes in the UDS graph to their attributes, you can use:

uds["ewt-train-12"].semantics_nodes

To get a dictionary mapping identifiers for semantics edges (tuples of node identifiers) in the UDS graph to their attributes, you can use:

uds["ewt-train-12"].semantics_edges()

To get a dictionary mapping identifiers for semantics edges (tuples of node identifiers) in the UDS graph involving the predicate headed by the 7th token to their attributes, you can use:

uds["ewt-train-12"].semantics_edges('ewt-train-12-semantics-pred-7')

To get a dictionary mapping identifiers for syntax edges (tuples of node identifiers) in the UDS graph to their attributes, you can use:

uds["ewt-train-12"].syntax_edges()

And to get a dictionary mapping identifiers for syntax edges (tuples of node identifiers) in the UDS graph involving the node for the 7th token to their attributes, you can use:

uds["ewt-train-12"].syntax_edges('ewt-train-12-syntax-7')

There are also methods for accessing relationships between semantics and syntax nodes. For example, you can get a tuple of the ordinal position for the head syntax node in the UDS graph that maps of the predicate headed by the 7th token in the corresponding sentence to a list of the form and lemma attributes for that token, you can use:

uds["ewt-train-12"].head('ewt-train-12-semantics-pred-7', ['form', 'lemma'])

And if you want the same information for every token in the span, you can use:

uds["ewt-train-12"].span('ewt-train-12-semantics-pred-7', ['form', 'lemma'])

This will return a dictionary mapping ordinal position for syntax nodes in the UDS graph that make of the predicate headed by the 7th token in the corresponding sentence to a list of the form and lemma attributes for the corresponding tokens.

More complicated queries of a sentence-level UDS graph can be performed using the query method, which accepts arbitrary SPARQL 1.1 queries. See Querying UDS Graphs for details.

Queries on document-level graphs are not currently supported. However, each UDSDocument does contain a number of useful attributes, including its genre (corresponding to the English Web Treebank subcorpus); its text (as demonstrated above); its timestamp; the sentence_ids of its constituent sentences; and the sentence-level graphs (sentence_graphs) associated with those sentences. Additionally, one can also look up the semantics node associated with a particular node in the document graph via the semantics_node instance method.

Lastly, iterables for the nodes and edges of a document-level graph may be accessed as follows:

uds.documents["reviews-112579"].document_graph.nodes
uds.documents["reviews-112579"].document_graph.edges

Unlike the nodes and edges in a sentence-level graph, the ones in a document- level graph all share a common (document) domain. By default, document graphs are initialized without edges and with one node for each semantics node in the sentence-level graphs associated with the constituent sentences. Edges may be added by supplying annotations (see Reading the UDS dataset).