Skip to content

Releases: flammie/omorfi

20160515

19 May 12:55
Compare
Choose a tag to compare

Omorfi version 20160515 has been released, this is the first release in series to follow Universal dependencies schedule and standard, the future releases are intended to match universal dependencies when needed/possible. The important updates:

  • Universal Dependencies for Finnish is the new standard format we now follow:
    • POS is now UPOS and classes were changed accordingly (new classes: AUX,
      PROPN, DET, CONJ, SCONJ, PUNCT, SYM, and VERB, NOUN, ADP, ADV as before)
    • other features mostly match the feature field in UD documentation
    • release cycle aims to be same six month cycle as with UD
    • the automatic tests verify compatibility with UD; 92 % of lemmas, primary
      POS tags and morphological features are the same as Finnish UD corpus,
      75 % same as Finnish FTB UD corpus
    • analyser for reading and writing CONLL-U format
  • tokenisation as script and more hacks to token stripping in corner cases
  • continuous integration with travis-ci, currently only testing basic script
    programming conventions
  • added a lot of high coverage words and forms by hand
  • by popular request, some of the words can now be blacklisted, when you don't
    want that guy named Mutta to ambiguate your conjunction analyses or the odd
    new guinean bird to clash with some common verb
  • the "database" is now only keyed on lemma + homonym number; paradigm is extra
    information like anything else
  • a lot of work on morphological segmentation towards statistical machine
    translation; check proceedings of WMT shared tasks 2015 and 2016 to see why
  • started refactoring some python code into classes

Omorfi 20150904 released

04 Sep 16:32
Compare
Choose a tag to compare

I’ve decided to release a new stable version of omorfi before some major changes in the lexical data caused by transitioning towards universal part-of-speech and universal dependencies schemes. The changes to previous release are not very substantial. From the NEWS file:

Significant changes in 20150904

  • allomorphy can be tagged again to distinguish e.g. -iden and -itten when
    generating
  • FinnTreeBank-1 format provided by Miikka Silfverberg is available but not
    built by default since it lacks a test set
  • lexicalised inflections can have separate tag, e.g. kännissä can be lexical
    inessive distinguished from regular inessive
  • preliminary VISL CG-3 support, with original grammar by Fred Karlsson;
    convenience bash scripts available for disambiguated parsing
  • preliminary support for conllu and conllx analysis formats
  • paradigm categorisation is now verified by regular expressions
  • lots of paradigm fixes and some added words