Welcome to my repository!
This repository is a C++ library includes quality filtering, deduplication, and unnecessary vocabulary removal for Japanese corpus.
The features are following.
- Normalizer: Sentence normalization created by mecab-neologd
- URL Remover: Remove URLs matching regular expression
- Special Characters Remover: Remove certain special characters (☀, ♡, ☆, etc.)
- Emoji Remover: Remove emoji characters that is \U0001F300 to \U0001F9FF.
- Quotes Remover: Remove quotes ([1], {245})
- Length Filter: Remove too long sentence and too short sentence
- Language Filter: Determine whether it is a Japanese document
- Minhash Deduplicator: Deduplication using Minhash
- ZeroPunctuationFilter: Remove documents without punctuation
- NounRatioFilter: Remove documents with more than 80% nouns by morphological analysis.
- Sentence Segmenter: Divide the corpus into sentences based on rules
- Perplexity Filter: Perplexity filter using kENLM
git clone https://github.com/ce-lery/corpus-cleaner.git
cd corpus-cleaner
Build a python environment using Docker files.
docker build -t corpus-cleaner-image ./
docker run -v ./:/home/corpus-cleaner/ -it --gpus all corpus-cleaner-image
sudo apt-get update
sudo apt-get install cmake gdb libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev pkg-config curl wget build-essential nano flex bison
Run the shell script with the following command.
In this script, you install third party library.
bash scripts/setup.sh
Build source code of corpus-cleaner.
bash scripts/build.sh
Please place the files to be cleaned in "./results/dataset/original". The file format is ".txt". For example, "wiki.txt", "cc100_train.txt", and so on.
mkdir -p results/dataset/original/
# Please place the files to be cleaned in "./results/dataset/original".
Run corpus_cleaner. Please wait a minute.
bash scripts/run.sh
The cleaning result files will be created in "results/data/cleaned".
The file format is jsonl.
Document is here.
If you want to see this tool's specification and API references, please refer here.
The basic usage of corpus-cleaner is same as Getting Started.
If you want to disable Sentence Segmenter, please set "bool sentence_segment=false", and create instance of CorpusCleaner class.
CorpusCleaner corpus_cleaner(input_folder_path,
output_folder_path,
min_length,
max_length,
accept_language,
store_rejected,
execute_sentence_segment, // <--------- switch here to false
language_threshold,
perplexity_threshold,
&generate_dedup_lsh,
&deduplicator);
If you want to disable filter, please Comment out the corresponding filter function in the variable TODO:.
int32_t CorpusCleaner::CleanPipeline(void)
{
// Set CorpusCleaner process that will be executed.
// They will be executed in the order you set them.
vector<void (CorpusCleaner::*)(Document &)> cleaner_list = {
&CorpusCleaner::Normalizer,
&CorpusCleaner::URLRemover,
&CorpusCleaner::EmojiRemover,
# &CorpusCleaner::SpecialCharacterRemover,
# &CorpusCleaner::QuotesRemover, // <- If you comment or exclude function of
# &CorpusCleaner::LengthFilter, // <- cleaner_list, the functions are disabled.
# &CorpusCleaner::ZeroPunctuationFilter,
&CorpusCleaner::LanguageFilter,
&CorpusCleaner::MinhashDeduplication,
&CorpusCleaner::PerplexityFilter,
};
// ~ ommit ~
}
Maybe, this step will be changed in the future.
This repository is licensed under the Apache License, Version2.0 .
In this repository, I use following third party library.
Please note the lisence.
Library | License | Purpose |
---|---|---|
icu | UNICODE LICENSE V3 | For NFKC normalization of Normalizer. |
kenlm | LGPL license | For perplexity filtering. Since I have not embedded this tool in this repository (installed it when I use it), I think that this repository is not covered by the LGPL license. Judging from the fact that cc_net, which also uses KENLM, is under the MIT license |
SentencePiece | Apache-2.0 license | For tokenization in perplexity filtering. |
smhasher | MIT licensed. | For hash value generation for Mihash processing. |
simdjson | Apache-2.0 license | For jsonl parsing. |
jagger | BSD 2-Clause License | For japanese morphological analysis. |
fastText | MIT license | For language filtering. |
GoogleTest | BSD-3-Clause license | For test. |
doxygen | (GPL-2.0 license) | For Documentation. This license does not apply to works produced by doxygen |
bash scripts/test.sh
We welcome your contributions to this repository. To contribute, please see CONTRIBUTING.md.
- Set Github Action's CI/CD(build)
- Morphological analysis (by jagger)
- Write document & create doxygen
- Remove blacklist of minhash
- (Implement pybind & python code)
- (Implement loader json & jsonl)
- (Remove ad header and footer)
- (Remove HTML mark)
- (Implement dump .txt format file(only is_removed=false).)
- (Remove repeated expressions)
- Speedup?