corpus-cleaner

@mainpage

Overview

Welcome to my repository!
This repository is a C++ library includes quality filtering, deduplication, and unnecessary vocabulary removal for Japanese corpus.
The features are following.

Normalizer: Sentence normalization created by mecab-neologd
URL Remover: Remove URLs matching regular expression
Special Characters Remover: Remove certain special characters (☀, ♡, ☆, etc.)
Emoji Remover: Remove emoji characters that is \U0001F300 to \U0001F9FF.
Quotes Remover: Remove quotes ([1], {245})
Length Filter: Remove too long sentence and too short sentence
Language Filter: Determine whether it is a Japanese document
Minhash Deduplicator: Deduplication using Minhash
ZeroPunctuationFilter: Remove documents without punctuation
NounRatioFilter: Remove documents with more than 80% nouns by morphological analysis.
Sentence Segmenter: Divide the corpus into sentences based on rules
Perplexity Filter: Perplexity filter using kENLM

Getting Started

Clone Repository

git clone https://github.com/ce-lery/corpus-cleaner.git
cd corpus-cleaner

Install Step

Docker

Build a python environment using Docker files.

docker build -t corpus-cleaner-image ./
docker run -v ./:/home/corpus-cleaner/ -it --gpus all corpus-cleaner-image

Other (Local Install)

sudo apt-get update
sudo apt-get install cmake gdb libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev  pkg-config  curl wget build-essential nano flex bison

Common Step

Run the shell script with the following command.
In this script, you install third party library.

bash scripts/setup.sh

Build source code of corpus-cleaner.

bash scripts/build.sh

Please place the files to be cleaned in "./results/dataset/original". The file format is ".txt". For example, "wiki.txt", "cc100_train.txt", and so on.

mkdir -p results/dataset/original/
# Please place the files to be cleaned in "./results/dataset/original".

Run corpus_cleaner. Please wait a minute.

bash scripts/run.sh

The cleaning result files will be created in "results/data/cleaned".
The file format is jsonl.

Specification

Document is here.
If you want to see this tool's specification and API references, please refer here.

Usage

Basic Usage

The basic usage of corpus-cleaner is same as Getting Started.

Select Filtering Feature

If you want to disable Sentence Segmenter, please set "bool sentence_segment=false", and create instance of CorpusCleaner class.

CorpusCleaner corpus_cleaner(input_folder_path,
                             output_folder_path,
                             min_length,
                             max_length,
                             accept_language,
                             store_rejected,
                             execute_sentence_segment, // <--------- switch here to false
                             language_threshold,
                             perplexity_threshold,
                             &generate_dedup_lsh,
                             &deduplicator);

If you want to disable filter, please Comment out the corresponding filter function in the variable TODO:.

int32_t CorpusCleaner::CleanPipeline(void)
{
    // Set CorpusCleaner process that will be executed.
    // They will be executed in the order you set them.
    vector<void (CorpusCleaner::*)(Document &)> cleaner_list = { 
        &CorpusCleaner::Normalizer,
        &CorpusCleaner::URLRemover,
        &CorpusCleaner::EmojiRemover, 
        # &CorpusCleaner::SpecialCharacterRemover,
        # &CorpusCleaner::QuotesRemover,   // <- If you comment or exclude function of 
        # &CorpusCleaner::LengthFilter,    // <- cleaner_list, the functions are disabled.
        # &CorpusCleaner::ZeroPunctuationFilter,
        &CorpusCleaner::LanguageFilter,
        &CorpusCleaner::MinhashDeduplication,
        &CorpusCleaner::PerplexityFilter,
    }; 
    // ~ ommit ~
}

Maybe, this step will be changed in the future.

License

This repository is licensed under the Apache License, Version2.0 .

Third Party Library

In this repository, I use following third party library.
Please note the lisence.

Library	License	Purpose
icu	UNICODE LICENSE V3	For NFKC normalization of Normalizer.
kenlm	LGPL license	For perplexity filtering. Since I have not embedded this tool in this repository (installed it when I use it), I think that this repository is not covered by the LGPL license. Judging from the fact that cc_net, which also uses KENLM, is under the MIT license
SentencePiece	Apache-2.0 license	For tokenization in perplexity filtering.
smhasher	MIT licensed.	For hash value generation for Mihash processing.
simdjson	Apache-2.0 license	For jsonl parsing.
jagger	BSD 2-Clause License	For japanese morphological analysis.
fastText	MIT license	For language filtering.
GoogleTest	BSD-3-Clause license	For test.
doxygen	(GPL-2.0 license)	For Documentation. This license does not apply to works produced by doxygen

Test

bash scripts/test.sh

Contribution

We welcome your contributions to this repository. To contribute, please see CONTRIBUTING.md.

TODO

ver.0.1.0

Set Github Action's CI/CD（build）
Morphological analysis (by jagger)
Write document & create doxygen

ver.0.2.0

Remove blacklist of minhash
(Implement pybind & python code)
(Implement loader json & jsonl)
(Remove ad header and footer)
(Remove HTML mark)
(Implement dump .txt format file(only is_removed=false).)
(Remove repeated expressions)

ver.0.3.0

Speedup?

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github/workflows		.github/workflows
corpus_cleaner		corpus_cleaner
docs		docs
python		python
scripts		scripts
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

corpus-cleaner

Overview

Getting Started

Clone Repository

Install Step

Docker

Other (Local Install)

Common Step

Specification

Usage

Basic Usage

Select Filtering Feature

License

Third Party Library

Test

Contribution

TODO

ver.0.1.0

ver.0.2.0

ver.0.3.0

About

Releases

Packages

Languages

License

ce-lery/corpus-cleaner

Folders and files

Latest commit

History

Repository files navigation

corpus-cleaner

Overview

Getting Started

Clone Repository

Install Step

Docker

Other (Local Install)

Common Step

Specification

Usage

Basic Usage

Select Filtering Feature

License

Third Party Library

Test

Contribution

TODO

ver.0.1.0

ver.0.2.0

ver.0.3.0

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages