Skip to content
This repository has been archived by the owner on May 2, 2023. It is now read-only.
/ iswoc-treebank Public archive

Official releases of the ISWOC treebank

Notifications You must be signed in to change notification settings

iswoc/iswoc-treebank

Repository files navigation

As of April 2023, releases of the ISWOC Treebank have moved to https://github.com/syntacticus/syntacticus-treebank-data.

The ISWOC Treebank

The ISWOC Treebank is a dependency treebank with morphosyntactic and information-structure annotation. It includes texts in several older Indo-European languages and is freely available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.

Please cite as

Bech, Kristin and Kristine Eide. 2014. The ISWOC corpus. Department of Literature, Area Studies and European Languages, University of Oslo. http:https://iswoc.github.com.

Releases of the ISWOC Treebank are hosted on Github.

Contents

The following texts are included in this release of the treebank:

Text Language Filename Size
Ælfric's Lives of Saints Old English æls 3137 tokens
Apollonius of Tyre Old English apt 5541 tokens
Anglo-Saxon Chronicles Old English chrona 5939 tokens
Orosius Old English or 1728 tokens
West-Saxon Gospels Old English wscp 13061 tokens
La Vie Saint Eustace Old French eustace 2340 tokens
Crónica Geral de Espanha 2-12 Portuguese cge1 12074 tokens
Crónica Geral de Espanha 155-167 Portuguese cge2 10547 tokens
Décadas Livro 5, VIII, 9-14 Portuguese coutdec-v-8 13794 tokens
Crónica de Alfonso XI Spanish alfonso-xi 7942 tokens
Crónica de España Spanish ce 4627 tokens
El Conde Lucanor Spanish cdeluc 17551 tokens
Estoria de Espanna I Spanish ee1 9488 tokens
General Estoria parte IV Daniel Spanish ge4 9233 tokens
Libro delos claros varones Spanish varones 5820 tokens

(The 'size' column in the table above shows the number of annotated tokens in a text. The number of tokens will be slightly larger than the number of words in the original printed edition as some words have been split into multiple tokens and some tokens have been inserted during annotation.)

Please see the XML files for detailed metadata and a full list of contributors.

Data formats

The texts are available on two formats:

  1. PROIEL XML: These files are the authoritative source files and the only ones that contain all available annotation. They contain the complete morphological, syntactic and information-structure annotation, as well as the complete text, including punctuation, section headers etc. The schema is defined in proiel.xsd.

  2. CoNLL-X format