Skip to content

ayaka14732/wordshk-parallel-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Words.hk Cantonese-English Parallel Corpus

Design

TODO

Project Structure

all (41859) -> minus15 (29487)
            |
            -> plus15 -> train (9372)
                      |
                      -> dev (1500)
                      |
                      -> test (1500)

Build

Download the latest version of words.hk data from the download page. Then run:

gzip -d all-*.csv.gz
python extract.py
python split_train_dev_test.py
python split_15.py

Special Credits

About

A Cantonese-English parallel corpus extracted from words.hk

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages