Skip to content

Releases: flammie/omorfi

Omorfi-0.9.10

23 Jan 22:19
f5e7467
Compare
Choose a tag to compare

Significant changes in 0.9.10

  • new words from wiktionaries and open name databases: 77,000 new lexemes,
    mostly proper nouns from the two government's open access name databases:
    • all first names and surnames used in Finland from dvv.fi name registry
    • all place names from the GML data by maanmittauslaitos
    • get details from statistics page
  • minor fixes to tags and formats, disambiguation and words:
    • apertium format has more subcategories
    • more words from wiktionaries have better paradigms (mainly consonant-final nouns)
    • few minor tweaks to prevent odd plurals and singulars for personal pronouns that do not have them in normal ue
  • test results show same compatibility as always, except:
    • FTB-3.1 is down to 88 % from 90 % and
    • UD vs. Finnish DTD is down to 92 % from 94 %
  • python stuff should only use hfst package and not (legacy?) libhfst
    • newest hfst-python should again be installable from pip and other packaging sources
  • big thanks to Patreons and Github Sponsors for contiinued support

omorfi-0.9.9 release

12 Jun 17:05
Compare
Choose a tag to compare

This is the first release of omorfi towards semantic version, some functionality dependent of version number might be slightly broken (automatic downloads). New features since previous release (from NEWS):

Significant changes in 0.9.9

  • slight updates to convenience bash scripts
    • bash scripts default to large coverage analyser now, use -Z for old
      behaviour
  • Unimorph 4 compatible
  • added the name database from Finnish governments open data repository:
    approx. 20,000 new names and 20,000 existing names verified
  • Changed to semver and not so bi-yearly schedule, and to main branch instead
    of outdated git flow model
  • nearly 10,000 words moved from main lexicon to MWE; added MWE fragments that
    were previously not in main lexicon (e.g. "Records", "Air", "Las", "Agia",
    "Group", "Air", ...)
  • few thousands of words from fiwkikt, enwikt and joukahainen including new
    paradigms for cool and chic (i.e. loan consonant final adjectives), galanga
    root, cisgender, genetic scissors, gay drumming, hybrid influencing,
    spike protein and a lot of birds, mice and compounds
  • preliminary support for conda
  • homonyms dropped cross part-of-speech, only lemmas within same pos get
    homonym code and analysis now
  • Removed multi-words from main lexicon, if a lexeme has space in it all parts
    are analysed separately
  • canonic sort order for TSV files based on python sort (since bash sort is
    not portable across OSes or stable)
  • minor fixes for c++ demo and api
  • updated words from wiktionaries, joukahainen...
  • basic NER parser (~90 % of finer covered)
  • improvements in documentation based on feedback
  • big thanks to Patreons and GitHub sponsors for continued support

Please note that the git branches have changed: the old develop is now main and old master or stable is oldstable and will remain unusued.

20200511 release

28 Jul 03:32
Compare
Choose a tag to compare

Significant changes in 20200511

  • Universal dependencies version 2.6 compatible
  • 3021 new words, some related to 2020
  • preliminary support for pip / pypi / venv
  • new logo
  • next version will be in semantic versioning scheme, and few breaks in API are to be expected

20191111

20 Dec 17:18
Compare
Choose a tag to compare

Significant changes in 20191111

  • Universal dependencies version 2.5 is a reference for recall tests
  • 11,343 new words
  • Fixed ordinals as adjectives
  • Minor overhaul of documentation
  • Fixed injection vuln. in python OOV handling
  • Fixed tokeniser regression related to initial punctuations
  • No other big changes and no API changes

(Changed the model download name to match download helper's filename 2020-03-20)

Omorfi-20190511

16 May 16:22
Compare
Choose a tag to compare
  • Universal dependencies version 2.4 is a reference for recall tests
  • 2879 new words
  • No other big changes and no API changes

Release version for 20181111 / UD v. 2.3

18 Nov 18:23
Compare
Choose a tag to compare

This is a scheduled update release to follow UD release, but also has lots of new lexical data, and API for python, support for parsing sentence at a time instead of word at a time is necessary for most future stuff, and a downloader (beta) for limited systems and people who don't want to compile the language models

Significant changes in 20180111

  • Universal dependencies version 2.3 is a reference for recall tests
  • At least 18,380 new words: 340,931 insertions(+), 322,551 deletions(-)
    • Imported enwikt data on top of re-importing fiwikt, joukahainen
  • New CG based on UD tags
  • Some universal dependencies guessed (analysers using dep guessing are slower
    and process sentences instead of words)
  • Default processing mode for many analysers is now sentence-based
  • Slightly extended python API (somewhat modeled like SpaCy but not quite)
  • Ability to download compiled FST models from release instead of
    self-compiling (beta)
  • Unimorph used as new Recall / Precision reference gold test set
  • Probably some fixes to recasing
  • Gradle support for java stuff
  • renamed origin unihu → finer

20181111-alpha test for download functionalities

23 Oct 16:18
Compare
Choose a tag to compare

This is not an actual release, we just uploaded a release beforehand to test an upcoming "download language models" functionality. Please expect a new release no earlier than 11.11. 2018, it may also be significantly later.

Omorfi 2018-05-11 release (UD 2.2)

26 Jun 15:14
Compare
Choose a tag to compare

Significant changes in 20180511

  • Universal dependencies version 2.2 is now used as target
  • At least 226 new words: 239 additions and 13 deletions in lexeme database
  • Most changes are in development infra, so not visible to end users...
  • Started rewriting CG from the scratch
  • The APIs for programming language deprecate load(filename) and load(dir) forms
    of filename guessing functions in favour of forthcoming loadAnalyser(file),
    loadLemmatiser(file), loadUDPipe(file) etc. etc. functions
  • Working towards more general tokenise-analyse-disambiguate pipelines maybe,
    or just refactoring
  • lots more automated tests -> lots less human errors
  • By popular request: there are two analysers now, one with small dictionary
    and one with full, use the smaller one when you do not want to see birds or
    languages or tribes analysed. The smaller one replaces the old default, but
    the new tools will require you to select one explicitly anyways
  • fixes and workarounds: java and c++ can now be disabled partially or totally
  • adopted SG0 as possible verb form analysis from UD data
  • The end users are now provided with bash-scripts wrappers for all
    functionalities, whereas the typically python versions allow more control
    of parametres

You can use the attached HFST language model if you cannot build the automata yourself, it does case-insensitive matching mostly. There's also an XML file modeled after kotus-sanalista.xml for reference.

Omorfi 201710515

13 Oct 10:42
Compare
Choose a tag to compare

A new version of omorfi with updated UD2 UPOSes has been added. Included in the release this time is the source code packaging, the kotus-sanalista XML format dictionary and also a large-coverage, case-ignorant analyser automaton (but you should really just compile and build your own).

  • Universal Dependencies version 2 is now used, still mainly lemma, UPOS,
    features fields are analysed
  • At least 2,336 new words (based on diffstat: 38886 additions, 3655 deletions)
  • Preliminary support for various guessing models: python-based, finite-state
    and UDPipe. This means that it is possibly to get analyses for all tokens,
    albeit quality of guesses varies.
  • A minimal C++ library version has been made to match java and python bindings.
    C++-11 and libhfst are required.
  • The dix version can now be compiled with lttoolbox with a lot of memory
  • A restricted "gold" dictionary mode has been added. This is good for both end
    users with limited memory and end users who require higher quality lexemes
    (i.e., only research institute approved, no wiktionary words or other weird
    stuffs)
  • Documentations and automatic testing much reworked with the new modern toys
    from github: travis-ci, jekyll
  • Started weeding the ADP/ADV jungle...
  • Fixed a horrible bug in the corpus coverage testing that terribly
    under-estimated our coverage for corpora where hapax legomena etc. were
    ignored
  • Lot of documentation has been semi-automated, therefore many changes can be
    viewed at the new gh-pages site: https://flammie.github.io/omorfi/

20161115 or UD 1.4

22 Nov 17:57
Compare
Choose a tag to compare

Omorfi version following the new UD release has been tagged, stay tuned for the updates instatistics etc. while I have more time.

Significant changes in 20161115

  • Started drafting more blacklists and known good lexemes subsets for people
    who struggle with rare words and productive compounding, derivation
  • Updated to Universal Dependencies version 1.4
  • A lot of new derivations by the way
  • Preliminary guessers
  • More loopy guessery things for punctuation and digit combos
  • Minor fixes to UD feature sorting
  • Homonym numbers used in some applications
  • Added timeouts where downstream tools support them, so tools don't seem like
    they are freezing at random
  • moved old documentations to github-pages
  • added preliminary hfst-pmatch-based tokeniser