Skip to content

A Japanese Parser (including historical Japanese)

License

Notifications You must be signed in to change notification settings

komiya-lab/monaka

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Monaka

A Japanese parser (including support for historical Japanese)

Installation

Parse

First, download and install appropriate UniDic dictionary:

monaka download wabun

Available dictionaries:

name discription
gendai 現代書き言葉
spoken 現代話し言葉
novel 近現代口語小説
qkana 旧仮名口語
kindai 近代文語
kinsei 近世江戸口語
kyogen 中世口語
wakan 中世文語
wabun 中古和文
manyo 上代語

Then, call parse command:

monaka parse {model} 今日はいい天気ですね

output:

{
  "tokens": [
    "今日",
    "",
    "いい",
    "天気",
    "です",
    ""
  ],
  "pos": [
    "名詞-普通名詞-副詞可能",
    "助詞-係助詞",
    "形容詞-非自立可能-形容詞",
    "名詞-普通名詞-一般",
    "助動詞-助動詞-デス",
    "助詞-終助詞"
  ],
  "luw": [
    "名詞-普通名詞-一般",
    "助詞-係助詞",
    "形容詞-一般-形容詞",
    "名詞-普通名詞-一般",
    "助動詞-助動詞-デス",
    "助詞-終助詞"
  ],
  "chunk": [
    "B",
    "I",
    "B",
    "B",
    "I",
    "I"
  ],
  "sentence": "今日はいい天気ですね"
}

You can specify output format ("bunsetsu-split" and "luw-split" )

monaka parse {model} 今日はいい天気ですね --output-format bunsetu-split

今日は いい 天気ですね

Model download

The author will provide trained model upon a request. Please contact the author.

Training monaka model

LUW and Bunsetsu tokenizer/chunker

Creating Dataset

A dataset should be JSON-L formatted and its each line shoud contains following fields:

    {
        "sentence": "str", 
        "tokens": ["a list of SUW"],
        "pos": ["POS-tag labels for each SUW"],
        "labels": ["Target labels for each SUW"]
    }

We provide data conversion script for UD-Japanese data. Here is an example command to convert UD-Japanese-GSD train data.

monaka_train ud2jsonl ja_gsd-ud-train.conllu ja_gsd-ud-train.jsonl

After creating the dataset files, then create label and pos-tag dictionaries:

monaka_train create-vocab [output_dir] ja_gsd-ud-train.jsonl ja_gsd-ud-dev.jsonl ja_gsd-ud-test.jsonl

Creating training configuration JSON file