decomp.semantics.uds.corpus¶
Module for representing UDS corpora with sentence and document collections.
This module provides the UDSCorpus class for managing collections of Universal Decompositional Semantics (UDS) graphs at both sentence and document levels. It includes:
Loading corpora from various formats (CoNLL, JSON)
Managing sentence-level and document-level graphs
Adding annotations to existing graphs
Querying graphs using SPARQL
Serialization and deserialization functionality
The UDSCorpus extends PredPattCorpus to support UDS-specific annotations and document-level semantic relationships.
- class UDSCorpus[source]¶
Bases:
PredPattCorpusA collection of Universal Decompositional Semantics graphs.
- Parameters:
sentences (
PredPattCorpus|dict[str,UDSSentenceGraph] |None, default:None) – the predpatt sentence graphs to associate the annotations withdocuments (
dict[str,UDSDocument] |None, default:None) – the documents associated with the predpatt sentence graphssentence_annotations (
list[UDSAnnotation] |None, default:None) – additional annotations to associate with predpatt nodes on sentence-level graphs; in most cases, no such annotations will be passed, since the standard UDS annotations are automatically loadeddocument_annotations (
list[UDSAnnotation] |None, default:None) – additional annotations to associate with predpatt nodes on document-level graphsversion (
str, default:'2.0') – the version of UDS datasets to usesplit (
str|None, default:None) – the split to load: “train”, “dev”, or “test”annotation_format (
str, default:'normalized') – which annotation type to load (“raw” or “normalized”)
- UD_URL = 'https://github.com/UniversalDependencies/UD_English-EWT/archive/r1.2.zip'¶
- ANN_DIR = '/home/docs/checkouts/readthedocs.org/user_builds/decomp/checkouts/stable/decomp/data/'¶
- CACHE_DIR = '/home/docs/checkouts/readthedocs.org/user_builds/decomp/checkouts/stable/decomp/data/'¶
- __init__(sentences=None, documents=None, sentence_annotations=None, document_annotations=None, version='2.0', split=None, annotation_format='normalized')[source]¶
- classmethod from_conll_and_annotations(corpus, sentence_annotations=[], document_annotations=[], annotation_format='normalized', version='2.0', name='ewt')[source]¶
Load UDS graph corpus from CoNLL (dependencies) and JSON (annotations).
This method should only be used if the UDS corpus is being (re)built. Otherwise, loading the corpus from the JSON shipped with this package using UDSCorpus.__init__ or UDSCorpus.from_json is suggested.
- Parameters:
corpus (
TypeAliasType) – (path to) Universal Dependencies corpus in conllu formatsentence_annotations (
Sequence[TypeAliasType], default:[]) – a list of paths to JSON files or open JSON files containing sentence-level annotationsdocument_annotations (
Sequence[TypeAliasType], default:[]) – a list of paths to JSON files or open JSON files containing document-level annotationsannotation_format (
str, default:'normalized') – Whether the annotation is raw or normalizedversion (
str, default:'2.0') – the version of UDS datasets to usename (
str, default:'ewt') – corpus name to be appended to the beginning of graph ids
- Return type:
- classmethod from_json(sentences_jsonfile, documents_jsonfile)[source]¶
Load annotated UDS graph corpus (including annotations) from JSON.
This is the suggested method for loading the UDS corpus.
- Parameters:
sentences_jsonfile (
TypeAliasType) – file containing Universal Decompositional Semantics corpus sentence-level graphs in JSON formatdocuments_jsonfile (
TypeAliasType) – file containing Universal Decompositional Semantics corpus document-level graphs in JSON format
- Return type:
- add_corpus_metadata(metadata)[source]¶
Add metadata to the corpus.
- Parameters:
metadata (UDSCorpusMetadata) – Metadata to merge with existing corpus metadata
- Return type:
- add_annotation(sentence_annotation=None, document_annotation=None)[source]¶
Add annotations to UDS sentence and document graphs.
- Parameters:
sentence_annotation (
list[UDSAnnotation] |None, default:None) – the annotations to add to the sentence graphs in the corpusdocument_annotation (
list[UDSAnnotation] |None, default:None) – the annotations to add to the document graphs in the corpus
- Return type:
- add_sentence_annotation(annotation)[source]¶
Add annotations to UDS sentence graphs.
- Parameters:
annotation (
UDSAnnotation) – the annotations to add to the graphs in the corpus- Return type:
- add_document_annotation(annotation)[source]¶
Add annotations to UDS documents.
- Parameters:
annotation (
UDSAnnotation) – the annotations to add to the documents in the corpus- Return type:
- query(query, query_type=None, cache_query=True, cache_rdf=True)[source]¶
Query all graphs in the corpus using SPARQL 1.1.
- Parameters:
query_type (
str|None, default:None) – whether this is a ‘node’ query or ‘edge’ query. If set to None (default), a Results object will be returned. The main reason to use this option is to automatically format the output of a custom query, since Results objects require additional postprocessing.cache_query (
bool, default:True) – whether to cache the query. This should usually be set to True. It should generally only be False when querying particular nodes or edges–e.g. as in precompiled queries.clear_rdf – whether to delete the RDF constructed for querying against. This will slow down future queries but saves a lot of memory
- Return type:
dict[str,Result|dict[str,TypeAliasType] |dict[TypeAliasType,TypeAliasType]]
- property documents: dict[str, UDSDocument]¶
The documents in the corpus.
- Returns:
Mapping from document IDs to UDSDocument objects
- Return type:
- property ndocuments: int¶
The number of documents in the corpus.
- Returns:
Total document count
- Return type:
- sample_documents(k)[source]¶
Sample k documents without replacement.
- Parameters:
k (
int) – the number of documents to sample- Return type:
- property metadata: UDSCorpusMetadata¶
The corpus metadata.
- Returns:
Metadata for sentence and document annotations
- Return type:
- property sentence_node_subspaces: set[str]¶
The UDS sentence node subspaces in the corpus.
- Returns:
Set of subspace names for sentence nodes
- Return type:
- Raises:
NotImplementedError – This property is not yet implemented
- property sentence_edge_subspaces: set[str]¶
The UDS sentence edge subspaces in the corpus.
- Returns:
Set of subspace names for sentence edges
- Return type:
- Raises:
NotImplementedError – This property is not yet implemented
- property document_node_subspaces: set[str]¶
The UDS document node subspaces in the corpus.
- Returns:
Set of subspace names for document nodes
- Return type:
- Raises:
NotImplementedError – This property is not yet implemented
- sentence_properties(subspace=None)[source]¶
Return the properties in a sentence subspace.
- Parameters:
subspace (str | None, optional) – Subspace to query, or None for all properties
- Returns:
Property names in the subspace
- Return type:
- Raises:
NotImplementedError – This method is not yet implemented
- sentence_property_metadata(subspace, prop)[source]¶
Return the metadata for a property in a sentence subspace.
- Parameters:
- Return type:
- document_properties(subspace=None)[source]¶
Return the properties in a document subspace.
- Parameters:
subspace (str | None, optional) – Subspace to query, or None for all properties
- Returns:
Property names in the subspace
- Return type:
- Raises:
NotImplementedError – This method is not yet implemented