decomp.semantics.uds.corpus

Module for representing UDS corpora with sentence and document collections.

This module provides the UDSCorpus class for managing collections of Universal Decompositional Semantics (UDS) graphs at both sentence and document levels. It includes:

  • Loading corpora from various formats (CoNLL, JSON)

  • Managing sentence-level and document-level graphs

  • Adding annotations to existing graphs

  • Querying graphs using SPARQL

  • Serialization and deserialization functionality

The UDSCorpus extends PredPattCorpus to support UDS-specific annotations and document-level semantic relationships.

class UDSCorpus[source]

Bases: PredPattCorpus

A collection of Universal Decompositional Semantics graphs.

Parameters:
  • sentences (PredPattCorpus | dict[str, UDSSentenceGraph] | None, default: None) – the predpatt sentence graphs to associate the annotations with

  • documents (dict[str, UDSDocument] | None, default: None) – the documents associated with the predpatt sentence graphs

  • sentence_annotations (list[UDSAnnotation] | None, default: None) – additional annotations to associate with predpatt nodes on sentence-level graphs; in most cases, no such annotations will be passed, since the standard UDS annotations are automatically loaded

  • document_annotations (list[UDSAnnotation] | None, default: None) – additional annotations to associate with predpatt nodes on document-level graphs

  • version (str, default: '2.0') – the version of UDS datasets to use

  • split (str | None, default: None) – the split to load: “train”, “dev”, or “test”

  • annotation_format (str, default: 'normalized') – which annotation type to load (“raw” or “normalized”)

UD_URL = 'https://github.com/UniversalDependencies/UD_English-EWT/archive/r1.2.zip'
ANN_DIR = '/home/docs/checkouts/readthedocs.org/user_builds/decomp/checkouts/stable/decomp/data/'
CACHE_DIR = '/home/docs/checkouts/readthedocs.org/user_builds/decomp/checkouts/stable/decomp/data/'
__init__(sentences=None, documents=None, sentence_annotations=None, document_annotations=None, version='2.0', split=None, annotation_format='normalized')[source]
classmethod from_conll_and_annotations(corpus, sentence_annotations=[], document_annotations=[], annotation_format='normalized', version='2.0', name='ewt')[source]

Load UDS graph corpus from CoNLL (dependencies) and JSON (annotations).

This method should only be used if the UDS corpus is being (re)built. Otherwise, loading the corpus from the JSON shipped with this package using UDSCorpus.__init__ or UDSCorpus.from_json is suggested.

Parameters:
  • corpus (TypeAliasType) – (path to) Universal Dependencies corpus in conllu format

  • sentence_annotations (Sequence[TypeAliasType], default: []) – a list of paths to JSON files or open JSON files containing sentence-level annotations

  • document_annotations (Sequence[TypeAliasType], default: []) – a list of paths to JSON files or open JSON files containing document-level annotations

  • annotation_format (str, default: 'normalized') – Whether the annotation is raw or normalized

  • version (str, default: '2.0') – the version of UDS datasets to use

  • name (str, default: 'ewt') – corpus name to be appended to the beginning of graph ids

Return type:

UDSCorpus

classmethod from_json(sentences_jsonfile, documents_jsonfile)[source]

Load annotated UDS graph corpus (including annotations) from JSON.

This is the suggested method for loading the UDS corpus.

Parameters:
  • sentences_jsonfile (TypeAliasType) – file containing Universal Decompositional Semantics corpus sentence-level graphs in JSON format

  • documents_jsonfile (TypeAliasType) – file containing Universal Decompositional Semantics corpus document-level graphs in JSON format

Return type:

UDSCorpus

add_corpus_metadata(metadata)[source]

Add metadata to the corpus.

Parameters:

metadata (UDSCorpusMetadata) – Metadata to merge with existing corpus metadata

Return type:

None

add_annotation(sentence_annotation=None, document_annotation=None)[source]

Add annotations to UDS sentence and document graphs.

Parameters:
  • sentence_annotation (list[UDSAnnotation] | None, default: None) – the annotations to add to the sentence graphs in the corpus

  • document_annotation (list[UDSAnnotation] | None, default: None) – the annotations to add to the document graphs in the corpus

Return type:

None

add_sentence_annotation(annotation)[source]

Add annotations to UDS sentence graphs.

Parameters:

annotation (UDSAnnotation) – the annotations to add to the graphs in the corpus

Return type:

None

add_document_annotation(annotation)[source]

Add annotations to UDS documents.

Parameters:

annotation (UDSAnnotation) – the annotations to add to the documents in the corpus

Return type:

None

to_json(sentences_outfile=None, documents_outfile=None)[source]

Serialize corpus to json.

Parameters:
  • sentences_outfile (TypeAliasType | None, default: None) – file to serialize sentence-level graphs to

  • documents_outfile (TypeAliasType | None, default: None) – file to serialize document-level graphs to

Return type:

str | None

query(query, query_type=None, cache_query=True, cache_rdf=True)[source]

Query all graphs in the corpus using SPARQL 1.1.

Parameters:
  • query (str | Query) – a SPARQL 1.1 query

  • query_type (str | None, default: None) – whether this is a ‘node’ query or ‘edge’ query. If set to None (default), a Results object will be returned. The main reason to use this option is to automatically format the output of a custom query, since Results objects require additional postprocessing.

  • cache_query (bool, default: True) – whether to cache the query. This should usually be set to True. It should generally only be False when querying particular nodes or edges–e.g. as in precompiled queries.

  • clear_rdf – whether to delete the RDF constructed for querying against. This will slow down future queries but saves a lot of memory

Return type:

dict[str, Result | dict[str, TypeAliasType] | dict[TypeAliasType, TypeAliasType]]

property documents: dict[str, UDSDocument]

The documents in the corpus.

Returns:

Mapping from document IDs to UDSDocument objects

Return type:

dict[str, UDSDocument]

property documentids: list[str]

The document IDs in the corpus.

Returns:

List of all document identifiers

Return type:

list[str]

property ndocuments: int

The number of documents in the corpus.

Returns:

Total document count

Return type:

int

sample_documents(k)[source]

Sample k documents without replacement.

Parameters:

k (int) – the number of documents to sample

Return type:

dict[str, UDSDocument]

property metadata: UDSCorpusMetadata

The corpus metadata.

Returns:

Metadata for sentence and document annotations

Return type:

UDSCorpusMetadata

property sentence_node_subspaces: set[str]

The UDS sentence node subspaces in the corpus.

Returns:

Set of subspace names for sentence nodes

Return type:

set[str]

Raises:

NotImplementedError – This property is not yet implemented

property sentence_edge_subspaces: set[str]

The UDS sentence edge subspaces in the corpus.

Returns:

Set of subspace names for sentence edges

Return type:

set[str]

Raises:

NotImplementedError – This property is not yet implemented

property sentence_subspaces: set[str]

All UDS sentence subspaces (node and edge) in the corpus.

Returns:

Union of sentence node and edge subspaces

Return type:

set[str]

property document_node_subspaces: set[str]

The UDS document node subspaces in the corpus.

Returns:

Set of subspace names for document nodes

Return type:

set[str]

Raises:

NotImplementedError – This property is not yet implemented

property document_edge_subspaces: set[str]

The UDS document edge subspaces in the corpus.

Returns:

Set of subspace names for document edges

Return type:

set[str]

property document_subspaces: set[str]

All UDS document subspaces (node and edge) in the corpus.

Returns:

Union of document node and edge subspaces

Return type:

set[str]

sentence_properties(subspace=None)[source]

Return the properties in a sentence subspace.

Parameters:

subspace (str | None, optional) – Subspace to query, or None for all properties

Returns:

Property names in the subspace

Return type:

set[str]

Raises:

NotImplementedError – This method is not yet implemented

sentence_property_metadata(subspace, prop)[source]

Return the metadata for a property in a sentence subspace.

Parameters:
  • subspace (str) – The subspace the property is in

  • prop (str) – The property in the subspace

Return type:

UDSPropertyMetadata

document_properties(subspace=None)[source]

Return the properties in a document subspace.

Parameters:

subspace (str | None, optional) – Subspace to query, or None for all properties

Returns:

Property names in the subspace

Return type:

set[str]

Raises:

NotImplementedError – This method is not yet implemented

document_property_metadata(subspace, prop)[source]

Return the metadata for a property in a document subspace.

Parameters:
  • subspace (str) – The subspace the property is in

  • prop (str) – The property in the subspace

Return type:

UDSPropertyMetadata