Reading the UDS dataset
=======================

The most straightforward way to read the Universal Decompositional
Semantics (UDS) dataset is to import it.

.. code-block:: python

   from decomp import UDSCorpus

   uds = UDSCorpus()

This loads a :py:class:`~decomp.semantics.uds.UDSCorpus` object ``uds``, which contains all
graphs across all splits in the data.

As noted in :doc:`quick-start`, the first time you do read UDS, it
will take several minutes to complete while the dataset is built from
the `Universal Dependencies English Web Treebank`_ (UD-EWT), which is not
shipped with the package (but is downloaded automatically when first
creating a corpus instance), and the `UDS annotations`_, which are shipped with
the package as package data. Normalized annotations are loaded by default.
To load raw annotations, specify ``"raw"`` as the argument to the UDSCorpus
``annotation_format`` keyword arugment as follows:

.. code-block:: python

   from decomp import UDSCorpus

   uds = UDSCorpus(annotation_format="raw")

(See `Adding annotations`_ below for more detail on annotation types.)
Subsequent uses of the corpus will be faster after the initial build,
since the built dataset is cached.

.. _Universal Dependencies English Web Treebank: https://github.com/UniversalDependencies/UD_English-EWT
.. _UDS annotations: http://decomp.io/data/

Standard splits
---------------

If you would rather read only the graphs in the training, development,
or test split, you can do that by specifying the ``split`` parameter
of ``UDSCorpus``.

.. code-block:: python

   from decomp import UDSCorpus

   # read the train split of the UDS corpus
   uds_train = UDSCorpus(split='train')

Adding annotations
------------------
   
Additional annotations beyond the standard UDS annotations can be
added using this method by passing a list of :py:class:`~decomp.semantics.uds.UDSAnnotation`
objects. These annotations can be added at two levels: the sentence level
and the document level. Sentence-level annotations contain attributes of
:py:class:`~decomp.semantics.uds.UDSSentenceGraph` nodes or edges. Document-level annotations contain
attributes for :py:class:`~decomp.semantics.uds.UDSDocumentGraph` nodes or edges. Document-level
edge annotations may relate nodes associated with different sentences 
in a document, although they are added as annotations only to the
the appropriate :py:class:`~decomp.semantics.uds.UDSDocumentGraph`.

Sentence-level and document-level annotations share the same two in-memory
representations: ``RawUDSDataset`` and ``NormalizedUDSDataset``. The former 
may have multiple annotations for the same node or edge attribute, while the
latter must have only a single annotation. Both are loaded from 
JSON-formatted files, but differ in the expected format (see the 
:py:meth:`~decomp.semantics.uds.NormalizedUDSDataset.from_json` methods of each class for formatting guidelines). For example,
if you have some additional *normalized* sentence-level annotations in a file
``new_annotations.json``, those can be added to the existing UDS annotations 
using:

.. code-block:: python

   from decomp import NormalizedUDSDataset
		
   # read annotations
   new_annotations = [NormalizedUDSDataset.from_json("new_annotations.json")]

   # read the train split of the UDS corpus and append new annotations
   uds_train_plus = UDSCorpus(split='train', sentence_annotations=new_annotations)

If instead you wished to add *raw* annotations (and supposing those
annotations were still in "new_annotations.json"), you would do the following:

.. code-block:: python

   from decomp import RawUDSDataset
		
   # read annotations
   new_annotations = [RawUDSDataset.from_json("new_annotations.json")]

   # read the train split of the UDS corpus and append new annotations
   uds_train_plus = UDSCorpus(split='train', sentence_annotations=new_annotations,
                              annotation_format="raw")

If ``new_annotations.json`` contained document-level annotations
you would pass ``new_annotations.json`` to the constructor keyword 
argument ``document_annotations`` instead of to ``sentence_annotations``.
Importantly, these annotations are added *in addition* to the existing
UDS annotations that ship with the toolkit. You do not need to add these
manually.

Finally, it should be noted that querying is currently **not** supported 
for document-level graphs or for sentence-level graphs containing raw
annotations.

Reading from an alternative location
------------------------------------

If you would like to read the dataset from an alternative
location—e.g. if you have serialized the dataset to JSON, using the
:py:meth:`~decomp.semantics.uds.UDSCorpus.to_json` instance method—this can be accomplished using
``UDSCorpus`` class methods (see :doc:`serializing` for more
information on serialization). For example, if you serialize
``uds_train`` to the files ``uds-ewt-sentences-train.json`` (for
sentences) and ``uds-ewt-documents-train.json`` (for the documents),
you can read it back into memory using:

.. code-block:: python

   # serialize uds_train to JSON
   uds_train.to_json("uds-ewt-sentences-train.json", "uds-ewt-documents-train.json")

   # read JSON serialized uds_train
   uds_train = UDSCorpus.from_json("uds-ewt-sentences-train.json", "uds-ewt-documents-train.json")   

Rebuilding the corpus
---------------------
   
If you would like to rebuild the corpus from the UD-EWT CoNLL files
and some set of JSON-formatted annotation files, you can use the
analogous :py:meth:`~decomp.semantics.uds.UDSCorpus.from_conll` class method. Importantly, unlike the
standard instance initialization described above, the UDS annotations
are *not* automatically added. For example, if ``en-ud-train.conllu``
is in the current working directory and you have already loaded
``new_annotations`` as above, a corpus containing only those
annotations (without the UDS annotations) can be loaded using:

.. code-block:: python

   # read the train split of the UD corpus and append new annotations
   uds_train_annotated = UDSCorpus.from_conll("en-ud-train.conllu", sentence_annotations=new_annotations)   

This also means that if you only want the semantic graphs as implied
by PredPatt (without annotations), you can use the ``from_conll``
class method to load them.

.. code-block:: python

   # read the train split of the UD corpus
   ud_train = UDSCorpus.from_conll("en-ud-train.conllu")   

Note that, because PredPatt is used for predicate-argument extraction,
only versions of UD-EWT that are compatible with PredPatt can be used
here. Version 1.2 is suggested.
   
Though other serialization formats are available (see
:doc:`serializing`), these formats are not yet supported for reading.