Skip to content

Compare the speed of various Japanese tokenizers in Python.

Notifications You must be signed in to change notification settings

polm/ja-tokenizer-benchmark

Repository files navigation

Tokenizer Benchmarks

This repository has scripts for benchmarking Japanese tokenizers. It was originally created to make sure that fugashi wasn't slower than mecab-python3.

The benchmark task is to get a per-word word count from the Aozora Bunko edition of "I am a Cat", stored in the wagahai.txt file, using a Counter object.

I suggest using hyperfine for benchmarking, though anything that can run the scripts is adequate.

To run:

# install hyperfine with your OS package manager
pip install -r requirements.txt
# benchmark
hyperfine -w 10 ./bench*.py

Results on my machine:

Command Mean [ms] Min [ms] Max [ms] Relative
./benchmark-fugashi.py 332.0 ± 6.2 324.6 344.2 1.00
./benchmark-mecab-python3.py 353.7 ± 4.4 347.1 360.0 1.07 ± 0.02
./benchmark-sudachi.py 662.0 ± 7.9 652.2 680.5 1.99 ± 0.04
./benchmark-natto.py 1119.2 ± 16.6 1095.6 1148.7 3.37 ± 0.08

About

Compare the speed of various Japanese tokenizers in Python.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages