decomp.semantics.uds.corpus¶

Module for representing UDS corpora with sentence and document collections.

This module provides the UDSCorpus class for managing collections of Universal Decompositional Semantics (UDS) graphs at both sentence and document levels. It includes:

Loading corpora from various formats (CoNLL, JSON)
Managing sentence-level and document-level graphs
Adding annotations to existing graphs
Querying graphs using SPARQL
Serialization and deserialization functionality

The UDSCorpus extends PredPattCorpus to support UDS-specific annotations and document-level semantic relationships.

class UDSCorpus[source]¶

Bases: PredPattCorpus

A collection of Universal Decompositional Semantics graphs.

Parameters:

sentences (PredPattCorpus | dict[str, UDSSentenceGraph] | None, default: None) – the predpatt sentence graphs to associate the annotations with
documents (dict[str, UDSDocument] | None, default: None) – the documents associated with the predpatt sentence graphs
sentence_annotations (list[UDSAnnotation] | None, default: None) – additional annotations to associate with predpatt nodes on sentence-level graphs; in most cases, no such annotations will be passed, since the standard UDS annotations are automatically loaded
document_annotations (list[UDSAnnotation] | None, default: None) – additional annotations to associate with predpatt nodes on document-level graphs
version (str, default: '2.0') – the version of UDS datasets to use
split (str | None, default: None) – the split to load: “train”, “dev”, or “test”
annotation_format (str, default: 'normalized') – which annotation type to load (“raw” or “normalized”)

UD_URL = 'https://github.com/UniversalDependencies/UD_English-EWT/archive/r1.2.zip'¶

ANN_DIR = '/home/docs/checkouts/readthedocs.org/user_builds/decomp/checkouts/stable/decomp/data/'¶

CACHE_DIR = '/home/docs/checkouts/readthedocs.org/user_builds/decomp/checkouts/stable/decomp/data/'¶

__init__(sentences=None, documents=None, sentence_annotations=None, document_annotations=None, version='2.0', split=None, annotation_format='normalized')[source]¶

classmethod from_conll_and_annotations(corpus, sentence_annotations=[], document_annotations=[], annotation_format='normalized', version='2.0', name='ewt')[source]¶

Load UDS graph corpus from CoNLL (dependencies) and JSON (annotations).

This method should only be used if the UDS corpus is being (re)built. Otherwise, loading the corpus from the JSON shipped with this package using UDSCorpus.__init__ or UDSCorpus.from_json is suggested.

Parameters:

corpus (TypeAliasType) – (path to) Universal Dependencies corpus in conllu format
sentence_annotations (Sequence[TypeAliasType], default: []) – a list of paths to JSON files or open JSON files containing sentence-level annotations
document_annotations (Sequence[TypeAliasType], default: []) – a list of paths to JSON files or open JSON files containing document-level annotations
annotation_format (str, default: 'normalized') – Whether the annotation is raw or normalized
version (str, default: '2.0') – the version of UDS datasets to use
name (str, default: 'ewt') – corpus name to be appended to the beginning of graph ids

Return type:

UDSCorpus

classmethod from_json(sentences_jsonfile, documents_jsonfile)[source]¶

Load annotated UDS graph corpus (including annotations) from JSON.

This is the suggested method for loading the UDS corpus.

Parameters:

sentences_jsonfile (TypeAliasType) – file containing Universal Decompositional Semantics corpus sentence-level graphs in JSON format
documents_jsonfile (TypeAliasType) – file containing Universal Decompositional Semantics corpus document-level graphs in JSON format

Return type:

UDSCorpus

add_corpus_metadata(metadata)[source]¶

Add metadata to the corpus.

Parameters:: metadata (UDSCorpusMetadata) – Metadata to merge with existing corpus metadata
Return type:: None

add_annotation(sentence_annotation=None, document_annotation=None)[source]¶

Add annotations to UDS sentence and document graphs.

Parameters:

sentence_annotation (list[UDSAnnotation] | None, default: None) – the annotations to add to the sentence graphs in the corpus
document_annotation (list[UDSAnnotation] | None, default: None) – the annotations to add to the document graphs in the corpus

Return type:

None

add_sentence_annotation(annotation)[source]¶

Add annotations to UDS sentence graphs.

Parameters:: annotation (UDSAnnotation) – the annotations to add to the graphs in the corpus
Return type:: None

add_document_annotation(annotation)[source]¶

Add annotations to UDS documents.

Parameters:: annotation (UDSAnnotation) – the annotations to add to the documents in the corpus
Return type:: None

to_json(sentences_outfile=None, documents_outfile=None)[source]¶

Serialize corpus to json.

Parameters:

sentences_outfile (TypeAliasType | None, default: None) – file to serialize sentence-level graphs to
documents_outfile (TypeAliasType | None, default: None) – file to serialize document-level graphs to

Return type:

str | None

query(query, query_type=None, cache_query=True, cache_rdf=True)[source]¶

Query all graphs in the corpus using SPARQL 1.1.

Parameters:

query (str | Query) – a SPARQL 1.1 query
query_type (str | None, default: None) – whether this is a ‘node’ query or ‘edge’ query. If set to None (default), a Results object will be returned. The main reason to use this option is to automatically format the output of a custom query, since Results objects require additional postprocessing.
cache_query (bool, default: True) – whether to cache the query. This should usually be set to True. It should generally only be False when querying particular nodes or edges–e.g. as in precompiled queries.
clear_rdf – whether to delete the RDF constructed for querying against. This will slow down future queries but saves a lot of memory

Return type:

dict[str, Result | dict[str, TypeAliasType] | dict[TypeAliasType, TypeAliasType]]

property documents: dict[str, UDSDocument]¶

The documents in the corpus.

Returns:: Mapping from document IDs to UDSDocument objects
Return type:: dict[str, UDSDocument]

property documentids: list[str]¶

The document IDs in the corpus.

Returns:: List of all document identifiers
Return type:: list[str]

property ndocuments: int¶

The number of documents in the corpus.

Returns:: Total document count
Return type:: int

sample_documents(k)[source]¶

Sample k documents without replacement.

Parameters:: k (int) – the number of documents to sample
Return type:: dict[str, UDSDocument]

property metadata: UDSCorpusMetadata¶

The corpus metadata.

Returns:: Metadata for sentence and document annotations
Return type:: UDSCorpusMetadata

property sentence_node_subspaces: set[str]¶

The UDS sentence node subspaces in the corpus.

Returns:: Set of subspace names for sentence nodes
Return type:: set[str]
Raises:: NotImplementedError – This property is not yet implemented

property sentence_edge_subspaces: set[str]¶

The UDS sentence edge subspaces in the corpus.

Returns:: Set of subspace names for sentence edges
Return type:: set[str]
Raises:: NotImplementedError – This property is not yet implemented

property sentence_subspaces: set[str]¶

All UDS sentence subspaces (node and edge) in the corpus.

Returns:: Union of sentence node and edge subspaces
Return type:: set[str]

property document_node_subspaces: set[str]¶

The UDS document node subspaces in the corpus.

Returns:: Set of subspace names for document nodes
Return type:: set[str]
Raises:: NotImplementedError – This property is not yet implemented

property document_edge_subspaces: set[str]¶

The UDS document edge subspaces in the corpus.

Returns:: Set of subspace names for document edges
Return type:: set[str]

property document_subspaces: set[str]¶

All UDS document subspaces (node and edge) in the corpus.

Returns:: Union of document node and edge subspaces
Return type:: set[str]

sentence_properties(subspace=None)[source]¶

Return the properties in a sentence subspace.

Parameters:: subspace (str | None, optional) – Subspace to query, or None for all properties
Returns:: Property names in the subspace
Return type:: set[str]
Raises:: NotImplementedError – This method is not yet implemented

sentence_property_metadata(subspace, prop)[source]¶

Return the metadata for a property in a sentence subspace.

Parameters:

subspace (str) – The subspace the property is in
prop (str) – The property in the subspace

Return type:

UDSPropertyMetadata

document_properties(subspace=None)[source]¶

Return the properties in a document subspace.

Parameters:: subspace (str | None, optional) – Subspace to query, or None for all properties
Returns:: Property names in the subspace
Return type:: set[str]
Raises:: NotImplementedError – This method is not yet implemented

document_property_metadata(subspace, prop)[source]¶

Return the metadata for a property in a document subspace.

Parameters:

subspace (str) – The subspace the property is in
prop (str) – The property in the subspace

Return type:

UDSPropertyMetadata