Skip to content

DEPRECATED: now part of vislcg3. Splits constraint grammar-formatted multiwords created by hfst-tokenise

License

Notifications You must be signed in to change notification settings

unhammer/cg-mwesplit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cg-mwesplit

https://travis-ci.org/unhammer/cg-mwesplit.svg

Description

This program reads input in Constraint Grammar format, and splits special “multiword cohorts” into separate cohorts, leaving other cohorts and intervening blanks as they were.

For examples of input/output, see the files in test/, e.g. test/input.simple.cg.

Prerequisites

A C++ compiler that goes all the way to 11.

Tested with gcc-5.2.0, gcc-5.3.1 and clang-703.0.29.

(Should work all the way down to gcc-4.9, but will fail with e.g. gcc-4.8.4 or clang-3.5.0.)

Building

./autogen.sh
./configure  # optionally with argument --prefix=$HOME/my/prefix
make
make install # with sudo if you didn't specify a prefix

Usage

Takes no options, just stdin and stdout:

cg-mwesplit < infile > outfile

More typically, it’ll be in a pipeline after hfst-tokenise and some step that disambiguates multiwords using vislcg3:

echo words go here | hfst-tokenise --gtd tokeniser.pmhfst | vislcg3 -g mwe-dis.cg3 | cg-mwesplit

Troubleshooting

If you get

terminate called after throwing an instance of 'std::regex_error'
  what():  regex_error

then your C++ compiler is too old. See Prerequisites.

About

DEPRECATED: now part of vislcg3. Splits constraint grammar-formatted multiwords created by hfst-tokenise

Topics

Resources

License

Stars

Watchers

Forks

Packages