Teanga Core and DB
Teanga is a database and system designed for NLP with pretrained language models.
Teanga Data Model
The core idea of Teanga is the data model which descibes how the data is represented and processed by services and stored in Teanga backends.
Layers
The Teanga data end consists of a set of layers that provide annotations. Layers are typed into the following kinds
- Character Layers: These layers represent text. A character layer consists of the unicode characters in that layer and the indexes correspond to each Unicode character. As such, while these layers are most frequently encoded as UTF-8, other encodings can be handled as well.
- Span Layers: Span layers consist of annotations with a start position, end position and a data value. The indexes refer to the position in the sublayer. Span layers are the most flexible form of annotation and are typically used to represent tokenisation and annotations such as named entities and terms.
- Division Layers: Division layers have a start position and a data value. The end position of a div layer is assumed to be the start position of next annotation in this layer or the largest index in the sublayer. Div layers are typically used to divide the text into sections such as sentences, paragraphs and chapters
- Element Layers: Element layers have a start position and a data value. The end position is assumed to be the start position plus one. Element layers are most typically used for indicating metadata properties and a few annotations
- Sequence Layers: A sequence layer has only a data value on each annotation. Sequence layers are assumed to be in one-to-one correspondance with the indexes of the sublayer. These are typically used when there is a value for every word (or sentence or paragraph) such as in part-of-speech tagging.
An example of each layer type is given in the above image and can be represented in YAML as follows:
_meta:
text:
type: characters
tokens:
type: span
base: text
upos:
type: seq
base: tokens
data: ["ADJ", ... "X"]
document:
type: div
base: text
default: [[0]]
author:
type: element
base: document
data: string
VC90:
text: "Teanga2 data model"
tokens: [[0,7], [8,12], [13,18]]
upos: ["PROPN", "NOUN", "NOUN"]
author: [[0, "John P. McCrae"], [0, "Somebody Else"]]
Data
Each annotation in a Teanga layer can have data. The folllowing types of data are available
- None: No data is associated with annotation. This is frequently used for layers that only divide the text, such as tokens, sentences or chapters
- String: A string value, such as the lemma for an entry.
- Enumeration: A string value, but limited to a list of possible values.
- Link: A reference to another annotation. If not specified this link is
assumed to refer to an annotation in the same layer by its index, however you
may specify another layer by means of the
target
property. - Typed Link: A link with a type, combines the enumeration and link data types
As an example consider this (simplified) encoding of Universal Dependencies data
_meta:
text:
type: characters
words:
type: span
base: text
data: none
upos:
type: seq
base: words
data: ["DET","NOUN","VERB"]
dep:
type: seq
base: words
data: link
link_types: ["root","nsubj","dobj"]
target: dep
kOJl:
text: "this is an example"
words: [[0,4], [5,7], [8,10], [11,17]]
upos: ["DET", "VERB", "DET", "NOUN"]
dep: [[1, "nsubj"], [1, "root"], [2, "det"], [1, "dobj"]]
In addition, the metadata may define a value
for the layer. In this case,
the layer does not need to be specified in the document and will be assumed
to be the default value. The primary use for this is in defining document
layers as above
Corpus Model
The corpus model of Teanga consists of a (ordered)
sequence of documents which in turn
consists of an (unordered) sequence of words. In addition, there are two meta
properties _meta
and _order
which give the layer descriptions and the
order of the documents in the text.
Each document is indexed by initial characters
the Base64 encoding of the SHA-256 of the UTF-8 representation of the text. The
text representation consists of all character layers ordered by their key with
the key appended before the text. Keys and text should be separated by a zero byte (\u0000
).
For example the following document:
en: Hello!
de: Guten Tag!
The string to encode is as follows:
from base64 import b64encode
from hashlib import sha256
rep = "de\x00Guten Tag!\x00en\x00Hello!\x00"
b64encode(sha256(rep.encode("utf-8")).digest()).decode("ascii")
'SpKHmfUJ1IkFXito5Me/ssLZ0Xx+ma5jjXTDb2qXs88='
By default only the first 4 characters of the key are used so the representation of this document would be
SpKH:
en: Hello!
de: Guten Tag!
All keys in the document should be unique and are used to check the validity of the input.
These keys are used by the _order
meta to give the order of documents. In
many serializations this may be omitted and instead the order of the keys in
the document may be used instead of an explicit order.
Documentation and RDF
Teanga is linked-data-aware and this can be used to provide documentation to
the user. This can be done with the special _uri
property that can appear at
several points in the document
_meta:
_uri: https://jmccrae.github.io/teanga2/meta/basic.yaml
author:
base: document
data: string
_uri: https://jmccrae.github.io/teanga2/props/author.html
jjVi:
_uri: corpus/doc1.yaml
As a property directly under _meta
this indicates that this format will build
on another model and includes all the layers of that corpus into this corpus.
As a property of a layer, it indicates an description of the property. This should ideally refer to an HTML page with embedded Turtle or RDFa annotation.
If put directly as a document, this indicates that the document is stored in another file and the YAML document is effectively copied directly in as this document.