A python library to parse TSV files as produced by the webanno Software and as described in their documentation.
The following features are supported:
- WebAnno's UTF-16 indices for Text indices
- Webanno's escape sequences
- Multiple span annotation layers with multiple fields
- Span annotations over multiple tokens and sentences
- Multiple Annotations per field (stacked annotations)
- Disambiguation IDs (here called
label_id
)
The following is not supported:
- Relations
- Chain annotations
- Sub-Token annotations (ignored on reading)
pip install git+https://github.com/neuged/webanno_tsv
To construct a Document with annotations you could do:
from webanno_tsv import Document, Annotation
from dataclasses import replace
sentences = [
['First', 'sentence'],
['Second', 'sentence']
]
doc = Document.from_token_lists(sentences)
layer_defs = [('Layer1', ['Field1']), ('Layer2', ['Field2', 'Field3'])]
annotations = [
Annotation(tokens=doc.tokens[1:2], layer='Layer1', field='Field1', label='ABC'),
Annotation(tokens=doc.tokens[1:3], layer='Layer2', field='Field3', label='XYZ', label_id=1)
]
doc = replace(doc, annotations=annotations, layer_defs=layer_defs)
doc.tsv()
The call to doc.tsv()
then returns a string:
#FORMAT=WebAnno TSV 3.3
#T_SP=Layer1|Field1
#T_SP=Layer2|Field2|Field3
#Text=First sentence
1-1 0-5 First _ _ _
1-2 6-14 sentence ABC *[1] XYZ[1]
#Text=Second sentence
2-1 15-21 Second _ *[1] XYZ[1]
2-2 22-30 sentence _ _ _
Supposing that you have a file with the output above as input you could do:
from webanno_tsv import webanno_tsv_read_file, Document
doc = webanno_tsv_read_file('/tmp/input.tsv')
for token in doc.tokens:
if token.text == 'sentence':
print(token.sentence_idx, token.idx)
# Prints:
# 1 2
# 2 2
for annotation in doc.match_annotations(layer='Layer2'):
print(annotation.layer, annotation.field, annotation.label)
# Prints:
# Layer2 Field3 XYZ
for annotation in doc.match_annotations(sentence=doc.sentences[0]):
print(annotation.layer, annotation.field, annotation.label)
# Prints:
# Layer1 Field1 ABC
# Layer2 Field3 XYZ
# Some lookup functions for convenience are on the Document instance
doc.token_sentence(token[0])
doc.sentence_tokens(doc.sentence[0])
doc.annotation_sentences(doc.annotations[0])
Possible Gotcha: The classes in this library are read-only dataclasses (dataclasses with frozen=True
).
This means that their fields are not settable. You can create new versions however with dataclasses.replace()
.
from dataclasses import replace
t1 = Token(sentence_idx=1, idx=0, start=0, end=3, text='Foo')
t2 = replace(t1, text='Bar')
Run the tests with:
python -m unittest test/*.py
PRs always welcome!