webanno_tsv

A python library to parse TSV files as produced by the webanno Software and as described in their documentation.

The following features are supported:

WebAnno's UTF-16 indices for Text indices
Webanno's escape sequences
Multiple span annotation layers with multiple fields
Span annotations over multiple tokens and sentences
Multiple Annotations per field (stacked annotations)
Disambiguation IDs (here called label_id)

The following is not supported:

Relations
Chain annotations
Sub-Token annotations (ignored on reading)

Installation

pip install git+https://github.com/neuged/webanno_tsv

Examples

To construct a Document with annotations you could do:

from webanno_tsv import Document, Annotation
from dataclasses import replace

sentences = [
    ['First', 'sentence'],
    ['Second', 'sentence']
]
doc = Document.from_token_lists(sentences)

layer_defs = [('Layer1', ['Field1']), ('Layer2', ['Field2', 'Field3'])]
annotations = [
    Annotation(tokens=doc.tokens[1:2], layer='Layer1', field='Field1', label='ABC'),
    Annotation(tokens=doc.tokens[1:3], layer='Layer2', field='Field3', label='XYZ', label_id=1)
]
doc = replace(doc, annotations=annotations, layer_defs=layer_defs)
doc.tsv()

The call to doc.tsv() then returns a string:

#FORMAT=WebAnno TSV 3.3
#T_SP=Layer1|Field1
#T_SP=Layer2|Field2|Field3


#Text=First sentence
1-1	0-5	First	_	_	_
1-2	6-14	sentence	ABC	*[1]	XYZ[1]

#Text=Second sentence
2-1	15-21	Second	_	*[1]	XYZ[1]
2-2	22-30	sentence	_	_	_

Supposing that you have a file with the output above as input you could do:

from webanno_tsv import webanno_tsv_read_file, Document

doc = webanno_tsv_read_file('/tmp/input.tsv')

for token in doc.tokens:
    if token.text == 'sentence':
        print(token.sentence_idx, token.idx)

# Prints:
# 1 2
# 2 2

for annotation in doc.match_annotations(layer='Layer2'):
    print(annotation.layer, annotation.field, annotation.label)

# Prints:
# Layer2 Field3 XYZ

for annotation in doc.match_annotations(sentence=doc.sentences[0]):
    print(annotation.layer, annotation.field, annotation.label)

# Prints:
# Layer1 Field1 ABC
# Layer2 Field3 XYZ

# Some lookup functions for convenience are on the Document instance
doc.token_sentence(token[0])
doc.sentence_tokens(doc.sentence[0])
doc.annotation_sentences(doc.annotations[0])

Possible Gotcha: The classes in this library are read-only dataclasses (dataclasses with frozen=True).

This means that their fields are not settable. You can create new versions however with dataclasses.replace().

from dataclasses import replace

t1 = Token(sentence_idx=1, idx=0, start=0, end=3, text='Foo')
t2 = replace(t1, text='Bar')

Development

Run the tests with:

python -m unittest test/*.py

PRs always welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
test		test
webanno_tsv		webanno_tsv
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

webanno_tsv

Installation

Examples

Development

About

Releases

Packages

Languages

License

neuged/webanno_tsv

Folders and files

Latest commit

History

Repository files navigation

webanno_tsv

Installation

Examples

Development

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages