Skip to content

Library for corpus generation, reformatting, quality radaring, and deduplication

License

Notifications You must be signed in to change notification settings

ce-lery/corpus-cleaner

Repository files navigation

corpus-cleaner

@mainpage
doxygen deploy Build Test
Apache-2.0 C++ Linux

Overview

Welcome to my repository!
This repository is a C++ library includes quality filtering, deduplication, and unnecessary vocabulary removal for Japanese corpus.
The features are following.

  • Normalizer: Sentence normalization created by mecab-neologd
  • URL Remover: Remove URLs matching regular expression
  • Special Characters Remover: Remove certain special characters (☀, ♡, ☆, etc.)
  • Emoji Remover: Remove emoji characters that is \U0001F300 to \U0001F9FF.
  • Quotes Remover: Remove quotes ([1], {245})
  • Length Filter: Remove too long sentence and too short sentence
  • Language Filter: Determine whether it is a Japanese document
  • Minhash Deduplicator: Deduplication using Minhash
  • ZeroPunctuationFilter: Remove documents without punctuation
  • NounRatioFilter: Remove documents with more than 80% nouns by morphological analysis.
  • Sentence Segmenter: Divide the corpus into sentences based on rules
  • Perplexity Filter: Perplexity filter using kENLM

Getting Started

Clone Repository

git clone https://github.com/ce-lery/corpus-cleaner.git
cd corpus-cleaner

Install Step

Docker

Build a python environment using Docker files.

docker build -t corpus-cleaner-image ./
docker run -v ./:/home/corpus-cleaner/ -it --gpus all corpus-cleaner-image

Other (Local Install)

sudo apt-get update
sudo apt-get install cmake gdb libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev  pkg-config  curl wget build-essential nano flex bison

Common Step

Run the shell script with the following command.
In this script, you install third party library.

bash scripts/setup.sh

Build source code of corpus-cleaner.

bash scripts/build.sh

Please place the files to be cleaned in "./results/dataset/original". The file format is ".txt". For example, "wiki.txt", "cc100_train.txt", and so on.

mkdir -p results/dataset/original/
# Please place the files to be cleaned in "./results/dataset/original".

Run corpus_cleaner. Please wait a minute.

bash scripts/run.sh

The cleaning result files will be created in "results/data/cleaned".
The file format is jsonl.

Specification

Document is here.
If you want to see this tool's specification and API references, please refer here.

Usage

Basic Usage

The basic usage of corpus-cleaner is same as Getting Started.

Select Filtering Feature

If you want to disable Sentence Segmenter, please set "bool sentence_segment=false", and create instance of CorpusCleaner class.

CorpusCleaner corpus_cleaner(input_folder_path,
                             output_folder_path,
                             min_length,
                             max_length,
                             accept_language,
                             store_rejected,
                             execute_sentence_segment, // <--------- switch here to false
                             language_threshold,
                             perplexity_threshold,
                             &generate_dedup_lsh,
                             &deduplicator);

If you want to disable filter, please Comment out the corresponding filter function in the variable TODO:.

int32_t CorpusCleaner::CleanPipeline(void)
{
    // Set CorpusCleaner process that will be executed.
    // They will be executed in the order you set them.
    vector<void (CorpusCleaner::*)(Document &)> cleaner_list = { 
        &CorpusCleaner::Normalizer,
        &CorpusCleaner::URLRemover,
        &CorpusCleaner::EmojiRemover, 
        # &CorpusCleaner::SpecialCharacterRemover,
        # &CorpusCleaner::QuotesRemover,   // <- If you comment or exclude function of 
        # &CorpusCleaner::LengthFilter,    // <- cleaner_list, the functions are disabled.
        # &CorpusCleaner::ZeroPunctuationFilter,
        &CorpusCleaner::LanguageFilter,
        &CorpusCleaner::MinhashDeduplication,
        &CorpusCleaner::PerplexityFilter,
    }; 
    // ~ ommit ~
}

Maybe, this step will be changed in the future.

License

This repository is licensed under the Apache License, Version2.0 .

Third Party Library

In this repository, I use following third party library.
Please note the lisence.

Library License Purpose
icu UNICODE LICENSE V3 For NFKC normalization of Normalizer.
kenlm LGPL license For perplexity filtering.
Since I have not embedded this tool in this repository (installed it when I use it),
I think that this repository is not covered by the LGPL license.
Judging from the fact that cc_net, which also uses KENLM, is under the MIT license
SentencePiece Apache-2.0 license For tokenization in perplexity filtering.
smhasher MIT licensed. For hash value generation for Mihash processing.
simdjson Apache-2.0 license For jsonl parsing.
jagger BSD 2-Clause License For japanese morphological analysis.
fastText MIT license For language filtering.
GoogleTest BSD-3-Clause license For test.
doxygen (GPL-2.0 license) For Documentation.
This license does not apply to works produced by doxygen

Test

bash scripts/test.sh

Contribution

We welcome your contributions to this repository. To contribute, please see CONTRIBUTING.md.

TODO

ver.0.1.0

  • Set Github Action's CI/CD(build)
  • Morphological analysis (by jagger)
  • Write document & create doxygen

ver.0.2.0

  • Remove blacklist of minhash
  • (Implement pybind & python code)
  • (Implement loader json & jsonl)
  • (Remove ad header and footer)
  • (Remove HTML mark)
  • (Implement dump .txt format file(only is_removed=false).)
  • (Remove repeated expressions)

ver.0.3.0

  • Speedup?

About

Library for corpus generation, reformatting, quality radaring, and deduplication

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published