A Japanese parser (including support for historical Japanese)
First, download and install appropriate UniDic dictionary:
monaka download wabun
Available dictionaries:
name | discription |
---|---|
gendai | 現代書き言葉 |
spoken | 現代話し言葉 |
novel | 近現代口語小説 |
qkana | 旧仮名口語 |
kindai | 近代文語 |
kinsei | 近世江戸口語 |
kyogen | 中世口語 |
wakan | 中世文語 |
wabun | 中古和文 |
manyo | 上代語 |
Then, call parse command:
monaka parse {model} 今日はいい天気ですね
output:
{
"tokens": [
"今日",
"は",
"いい",
"天気",
"です",
"ね"
],
"pos": [
"名詞-普通名詞-副詞可能",
"助詞-係助詞",
"形容詞-非自立可能-形容詞",
"名詞-普通名詞-一般",
"助動詞-助動詞-デス",
"助詞-終助詞"
],
"luw": [
"名詞-普通名詞-一般",
"助詞-係助詞",
"形容詞-一般-形容詞",
"名詞-普通名詞-一般",
"助動詞-助動詞-デス",
"助詞-終助詞"
],
"chunk": [
"B",
"I",
"B",
"B",
"I",
"I"
],
"sentence": "今日はいい天気ですね"
}
You can specify output format ("bunsetsu-split" and "luw-split" )
monaka parse {model} 今日はいい天気ですね --output-format bunsetu-split
今日は いい 天気ですね
The author will provide trained model upon a request. Please contact the author.
A dataset should be JSON-L formatted and its each line shoud contains following fields:
{
"sentence": "str",
"tokens": ["a list of SUW"],
"pos": ["POS-tag labels for each SUW"],
"labels": ["Target labels for each SUW"]
}
We provide data conversion script for UD-Japanese data. Here is an example command to convert UD-Japanese-GSD train data.
monaka_train ud2jsonl ja_gsd-ud-train.conllu ja_gsd-ud-train.jsonl
After creating the dataset files, then create label and pos-tag dictionaries:
monaka_train create-vocab [output_dir] ja_gsd-ud-train.jsonl ja_gsd-ud-dev.jsonl ja_gsd-ud-test.jsonl