Name		Name	Last commit message	Last commit date
parent directory ..
utils		utils
README.md		README.md
benchmark_data.py		benchmark_data.py
find_substrings.py		find_substrings.py
minhash.py		minhash.py
requirements.txt		requirements.txt
requirements_minhash.txt		requirements_minhash.txt

README.md

Decontamination

This directory contains a script to decontaminate data with:

Exact substring matching find_substrings.py
Near-matching with MinHash: for next iterations? See https://github.com/bigcode-project/bigcode-analysis/tree/main/data_analysis/decontamination

Exact substring matching

This script was used to decontaminate BigCode training datasets for SantaCoder and StarCoder. For example, to produce StarCoderData, we removed files that contained docstrings or solutions from HumanEval and MBPP, docstrings from APPS, questions from GSM8K, or prompts from DS-1000 benchmark.

pip install -r requirements.txt
python find_substrings.py --dataset-name bigcode/the-stack-subset-py-js-java-450k --output-dir /path/to/output --num-proc 32

Using a cached decontamination run

The results from a previous decontamination run can be used to speed-up the script under the following conditions:

the new dataset is a subset (or equal) of the previously decontaminated dataset
the new set of strings to decontaminate contains the strings from the previous run. (Code does not yet support the case where some strings are no longer decontaminated)

python find_substrings.py --dataset-name bigcode/stack-dedup-alt-filter-no-pii --output-dir /path/to/output  --num-proc 32 --cached-decontamination-dir /path/to/previous/output/ --cache-retrieval-key content --split-languages

Near Matching with MinHash and LSH

Instead of looking for exact matches from the test sets, in this section we look for near duplicates. This is similar to the near deduplication script data_analysis/near-deduplication/minhash_deduplication_alt.py with one modification: we use benchmark datasets as index source instead of the dataset itself.

Usage:

Update the script to include any benchmark (from Hugging Face hub) you want to check against in DATASETS_TO_CHECK variable from minhash.py. Be sure to create a global variable for the index using the same name in that config. Benchmark columns should be of type string or sequence of string, so that they can be concatenated.
Then you can run the script by

pip install -r requirements_minhash.txt
# Quick example
python minhash.py \
  --dataset codeparrot/codeparrot-clean-valid \
  --split train \
  --column content \
  --cache-dir .cache \
  --verbose
# Check parameters with the help message
python minhash.py --help

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decontamination

decontamination

README.md

Decontamination

Exact substring matching

Using a cached decontamination run

Near Matching with MinHash and LSH

Usage:

Files

decontamination

Directory actions

More options

Directory actions

More options

Latest commit

History

decontamination

Folders and files

parent directory

README.md

Decontamination

Exact substring matching

Using a cached decontamination run

Near Matching with MinHash and LSH

Usage: