Reading the UDS dataset¶

The most straightforward way to read the Universal Decompositional Semantics (UDS) dataset is to import it.

from decomp import UDSCorpus

uds = UDSCorpus()

This loads a UDSCorpus object uds, which contains all graphs across all splits in the data.

As noted in Quick Start, the first time you do read UDS, it will take several minutes to complete while the dataset is built from the Universal Dependencies English Web Treebank (UD-EWT), which is not shipped with the package (but is downloaded automatically when first creating a corpus instance), and the UDS annotations, which are shipped with the package as package data. Normalized annotations are loaded by default. To load raw annotations, specify "raw" as the argument to the UDSCorpus annotation_format keyword arugment as follows:

from decomp import UDSCorpus

uds = UDSCorpus(annotation_format="raw")

(See Adding annotations below for more detail on annotation types.) Subsequent uses of the corpus will be faster after the initial build, since the built dataset is cached.

Standard splits¶

If you would rather read only the graphs in the training, development, or test split, you can do that by specifying the split parameter of UDSCorpus.

from decomp import UDSCorpus

# read the train split of the UDS corpus
uds_train = UDSCorpus(split='train')

Adding annotations¶

Additional annotations beyond the standard UDS annotations can be added using this method by passing a list of UDSAnnotation objects. These annotations can be added at two levels: the sentence level and the document level. Sentence-level annotations contain attributes of UDSSentenceGraph nodes or edges. Document-level annotations contain attributes for UDSDocumentGraph nodes or edges. Document-level edge annotations may relate nodes associated with different sentences in a document, although they are added as annotations only to the the appropriate UDSDocumentGraph.

Sentence-level and document-level annotations share the same two in-memory representations: RawUDSDataset and NormalizedUDSDataset. The former may have multiple annotations for the same node or edge attribute, while the latter must have only a single annotation. Both are loaded from JSON-formatted files, but differ in the expected format (see the from_json() methods of each class for formatting guidelines). For example, if you have some additional normalized sentence-level annotations in a file new_annotations.json, those can be added to the existing UDS annotations using:

from decomp import NormalizedUDSDataset

# read annotations
new_annotations = [NormalizedUDSDataset.from_json("new_annotations.json")]

# read the train split of the UDS corpus and append new annotations
uds_train_plus = UDSCorpus(split='train', sentence_annotations=new_annotations)

If instead you wished to add raw annotations (and supposing those annotations were still in “new_annotations.json”), you would do the following:

from decomp import RawUDSDataset

# read annotations
new_annotations = [RawUDSDataset.from_json("new_annotations.json")]

# read the train split of the UDS corpus and append new annotations
uds_train_plus = UDSCorpus(split='train', sentence_annotations=new_annotations,
                           annotation_format="raw")

If new_annotations.json contained document-level annotations you would pass new_annotations.json to the constructor keyword argument document_annotations instead of to sentence_annotations. Importantly, these annotations are added in addition to the existing UDS annotations that ship with the toolkit. You do not need to add these manually.

Finally, it should be noted that querying is currently not supported for document-level graphs or for sentence-level graphs containing raw annotations.

Reading from an alternative location¶

If you would like to read the dataset from an alternative location—e.g. if you have serialized the dataset to JSON, using the to_json() instance method—this can be accomplished using UDSCorpus class methods (see Serializing the UDS dataset for more information on serialization). For example, if you serialize uds_train to the files uds-ewt-sentences-train.json (for sentences) and uds-ewt-documents-train.json (for the documents), you can read it back into memory using:

# serialize uds_train to JSON
uds_train.to_json("uds-ewt-sentences-train.json", "uds-ewt-documents-train.json")

# read JSON serialized uds_train
uds_train = UDSCorpus.from_json("uds-ewt-sentences-train.json", "uds-ewt-documents-train.json")

Rebuilding the corpus¶

If you would like to rebuild the corpus from the UD-EWT CoNLL files and some set of JSON-formatted annotation files, you can use the analogous from_conll() class method. Importantly, unlike the standard instance initialization described above, the UDS annotations are not automatically added. For example, if en-ud-train.conllu is in the current working directory and you have already loaded new_annotations as above, a corpus containing only those annotations (without the UDS annotations) can be loaded using:

# read the train split of the UD corpus and append new annotations
uds_train_annotated = UDSCorpus.from_conll("en-ud-train.conllu", sentence_annotations=new_annotations)

This also means that if you only want the semantic graphs as implied by PredPatt (without annotations), you can use the from_conll class method to load them.

# read the train split of the UD corpus
ud_train = UDSCorpus.from_conll("en-ud-train.conllu")

Note that, because PredPatt is used for predicate-argument extraction, only versions of UD-EWT that are compatible with PredPatt can be used here. Version 1.2 is suggested.

Though other serialization formats are available (see Serializing the UDS dataset), these formats are not yet supported for reading.