llm-dataset-converter

For converting large language model (LLM) datasets from one format into another. Filters can be supplied as well, e.g., for cleaning up the data.

Installation

Via PyPI:

pip install llm-dataset-converter

The latest code straight from the repository:

pip install git+https://github.com/waikato-llm/llm-dataset-converter.git

Docker

Docker images are available from:

Docker hub: waikatodatamining/llm-dataset-converter
In-house registry: public.aml-repo.cms.waikato.ac.nz:443/tools/llm-dataset-converter

Datasets

The following repository contains a curated list of datasets for LLMs:

https://github.com/Zjh-819/LLMDataHub

The Hugging Face Hub has an abundance of datasets as well:

https://huggingface.co/datasets

Dataset formats

The following dataset formats are supported:

Domain	Format	Read	Write	Compression
classification	CSV	from-csv-cl	to-csv-cl	Y
classification	Jsonlines	from-jsonlines-cl	to-jsonlines-cl	Y
classification	Parquet	from-parquet-cl	to-parquet-cl	N
classification	TSV	from-tsv-cl	to-tsv-cl	Y
pairs	Alpaca	from-alpaca	to-alpaca	Y
pairs	CSV	from-csv-pr	to-csv-pr	Y
pairs	Jsonlines	from-jsonlines-pr	to-jsonlines-pr	Y
pairs	Parquet	from-parquet-pr	to-parquet-pr	N
pairs	TSV	from-tsv-pr	to-tsv-pr	Y
pairs	XTuner	from-xtuner	to-xtuner	Y
pretrain	CSV	from-csv-pt	to-csv-pt	Y
pretrain	Jsonlines	from-jsonlines-pt	to-jsonlines-pt	Y
pretrain	Parquet	from-parquet-pt	to-parquet-pt	N
pretrain	TSV	from-tsv-pt	to-tsv-pt	Y
pretrain	TXT	from-txt-pt	to-txt-pt	Y ¹
translation	CSV	from-csv-t9n	to-csv-t9n	Y
translation	Jsonlines ²	from-jsonlines-t9n	to-jsonlines-t9n	Y
translation	Parquet ³	from-parquet-t9n	to-parquet-t9n	N
translation	TSV	from-tsv-t9n	to-tsv-t9n	Y
translation	TXT	from-txt-t9n	to-txt-t9n	Y ¹

¹ Compression not available when concatenating content in single file.

² Format defined here.

³ Translation data itself is stored as JSON dictionary.

Compression formats

In case a format supports compression, then the following compression formats are automatically supported for loading/saving files:

bzip2: .bz2
gzip: .gz
xz: .xz
zstd: .zst, .zstd

File encodings

Most readers offer the --encoding option to override the automatically determined file encoding, as that can be wrong due to only inspecting a fixed number of bytes. The number of bytes of a file inspected can be influenced via the following environment variable:

LDC_ENCODING_MAX_CHECK_LENGTH

A value of -1 means the complete file. However, that can be very slow and a smaller value of <1MB is recommended.

Tools

Dataset conversion

usage: llm-convert [-h|--help|--help-all|-help-plugin NAME] [-u INTERVAL]
                   [-c {None,bz2,gz,xz,zstd}]
                   [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                   reader
                   [filter [filter [...]]]
                   [writer]

Tool for converting between large language model (LLM) dataset formats.

readers (20):
   from-alpaca, from-csv-cl, from-csv-pr, from-csv-pt, from-csv-t9n, 
   from-jsonlines-cl, from-jsonlines-pr, from-jsonlines-pt, 
   from-jsonlines-t9n, from-parquet-cl, from-parquet-pr, 
   from-parquet-pt, from-parquet-t9n, from-tsv-cl, from-tsv-pr, 
   from-tsv-pt, from-tsv-t9n, from-txt-pt, from-txt-t9n, from-xtuner
filters (38):
   assemble-sentences, change-case, classification-label-map, 
   file-filter, find-substr, inspect, keyword, language, 
   llama2-to-pairs, max-length-pt, max-records, metadata, 
   metadata-from-name, pairs-to-llama2, pairs-to-pretrain, 
   pretrain-sentences-to-classification, pretrain-sentences-to-pairs, 
   randomize-records, record-files, record-window, remove-blocks, 
   remove-empty, remove-patterns, replace-patterns, require-languages, 
   reset-ids, sentences-pt, skip-duplicate-ids, skip-duplicate-text, 
   split-pt, split-records, tee, text-length, text-stats, 
   to-llama2-format, translation-to-pairs, translation-to-pretrain, 
   update-pair-data
writers (20):
   to-alpaca, to-csv-cl, to-csv-pr, to-csv-pt, to-csv-t9n, 
   to-jsonlines-cl, to-jsonlines-pr, to-jsonlines-pt, to-jsonlines-t9n, 
   to-parquet-cl, to-parquet-pr, to-parquet-pt, to-parquet-t9n, 
   to-tsv-cl, to-tsv-pr, to-tsv-pt, to-tsv-t9n, to-txt-pt, to-txt-t9n, 
   to-xtuner

optional arguments:
  -h, --help              show basic help message and exit
  --help-all              show basic help message plus help on all plugins and exit
  --help-plugin NAME      show help message for plugin NAME and exit
  -u INTERVAL, --update_interval INTERVAL
                          outputs the progress every INTERVAL records (default: 1000)
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                          the logging level to use (default: WARN)
  -c {None,bz2,gz,xz,zstd}, --compression {None,bz2,gz,xz,zstd}
                          the type of compression to use when only providing an output
                          directory to the writer (default: None)
  -b, --force_batch       processes the data in batches
  -U, --unescape_unicode  unescape unicode characters in the command-line

Download

usage: llm-download [-h|--help|--help-all|-help-plugin NAME]
                    downloader

Tool for downloading data for large language models (LLMs).

downloaders:
   huggingface

optional arguments:
  -h, --help            show basic help message and exit
  --help-all            show basic help message plus help on all plugins and exit
  --help-plugin NAME    show help message for plugin NAME and exit

Combining multiple files (one-after-the-other)

usage: llm-append [-h] [-i [INPUT [INPUT ...]]]
                  [-I [INPUT_LIST [INPUT_LIST ...]]]
                  [-t {csv,json,jsonlines,plain-text,tsv}] [-o FILE] [-p]
                  [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Tool for combining multiple text files by appending them.

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
                        Path to the text file(s) to append; glob syntax is
                        supported (default: None)
  -I [INPUT_LIST [INPUT_LIST ...]], --input_list [INPUT_LIST [INPUT_LIST ...]]
                        Path to the text file(s) listing the data files to
                        append (default: None)
  -t {csv,json,jsonlines,plain-text,tsv}, --file_type {csv,json,jsonlines,plain-text,tsv}
                        The type of files that are being processed. (default:
                        plain-text)
  -o FILE, --output FILE
                        The path of the file to store the combined data in;
                        outputs it to stdout if omitted or a directory
                        (default: None)
  -p, --pretty_print    Whether to output the JSON in more human-readable
                        format. (default: False)
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        The logging level to use. (default: WARN)

Combining multiple files (side-by-side)

usage: llm-paste [-h] [-i [INPUT [INPUT ...]]]
                 [-I [INPUT_LIST [INPUT_LIST ...]]] [-o FILE]
                 [-s [SEP [SEP ...]]] [-l {DEBUG,INFO,WARN,ERROR,CRITICAL}]

Tool for combining multiple text files by placing them side-by-side.

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
                        Path to the text file(s) to combine; glob syntax is
                        supported (default: None)
  -I [INPUT_LIST [INPUT_LIST ...]], --input_list [INPUT_LIST [INPUT_LIST ...]]
                        Path to the text file(s) listing the data files to
                        combine (default: None)
  -o FILE, --output FILE
                        The path of the file to store the combined data in;
                        outputs it to stdout if omitted or a directory
                        (default: None)
  -s [SEP [SEP ...]], --separator [SEP [SEP ...]]
                        The separators to use between the files; uses TAB if
                        not supplied; use '{T}' as placeholder for tab
                        (default: None)
  -l {DEBUG,INFO,WARN,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARN,ERROR,CRITICAL}
                        The logging level to use (default: WARN)

File encodings

The following tool allows you to determine the encoding of text files.

usage: llm-file-encoding [-h] [-i [INPUT [INPUT ...]]]
                         [-I [INPUT_LIST [INPUT_LIST ...]]]
                         [-m MAX_CHECK_LENGTH] [-o FILE]
                         [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Tool for determining the file encoding of text files.

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
                        Path to the text file(s) to check; glob syntax is
                        supported (default: None)
  -I [INPUT_LIST [INPUT_LIST ...]], --input_list [INPUT_LIST [INPUT_LIST ...]]
                        Path to the text file(s) listing the actual files to
                        check (default: None)
  -m MAX_CHECK_LENGTH, --max_check_length MAX_CHECK_LENGTH
                        The maxmimum number of bytes to use for checking
                        (default: None)
  -o FILE, --output FILE
                        The path of the file to store the determined encodings
                        in; outputs it to stdout if omitted or a directory
                        (default: None)
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        The logging level to use. (default: WARN)

Locating files

Readers tend to support input via file lists. The llm-find tool can generate these.

usage: llm-find [-h] -i DIR [DIR ...] [-r] -o FILE [-m [REGEXP [REGEXP ...]]]
                [-n [REGEXP [REGEXP ...]]]
                [--split_ratios [SPLIT_RATIOS [SPLIT_RATIOS ...]]]
                [--split_names [SPLIT_NAMES [SPLIT_NAMES ...]]]
                [--split_name_separator SPLIT_NAME_SEPARATOR]
                [-l {DEBUG,INFO,WARN,ERROR,CRITICAL}]

Tool for locating files in directories that match certain patterns and store
them in files.

optional arguments:
  -h, --help            show this help message and exit
  -i DIR [DIR ...], --input DIR [DIR ...]
                        The dir(s) to scan for files. (default: None)
  -r, --recursive       Whether to search the directories recursively
                        (default: False)
  -o FILE, --output FILE
                        The file to store the located file names in (default:
                        None)
  -m [REGEXP [REGEXP ...]], --match [REGEXP [REGEXP ...]]
                        The regular expression that the (full) file names must
                        match to be included (default: None)
  -n [REGEXP [REGEXP ...]], --not-match [REGEXP [REGEXP ...]]
                        The regular expression that the (full) file names must
                        match to be excluded (default: None)
  --split_ratios [SPLIT_RATIOS [SPLIT_RATIOS ...]]
                        The split ratios to use for generating the splits
                        (int; must sum up to 100) (default: None)
  --split_names [SPLIT_NAMES [SPLIT_NAMES ...]]
                        The split names to use as filename suffixes for the
                        generated splits (before .ext) (default: None)
  --split_name_separator SPLIT_NAME_SEPARATOR
                        The separator to use between file name and split name
                        (default: -)
  -l {DEBUG,INFO,WARN,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARN,ERROR,CRITICAL}
                        The logging level to use (default: WARN)

Generating help screens for plugins

usage: llm-help [-h] [-c [PACKAGE [PACKAGE ...]]] [-e EXCLUDED_CLASS_LISTERS]
                [-p NAME] [-f FORMAT] [-L INT] [-o PATH] [-i FILE] [-t TITLE]
                [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Tool for outputting help for plugins in various formats.

optional arguments:
  -h, --help            show this help message and exit
  -c [PACKAGE [PACKAGE ...]], --custom_class_listers [PACKAGE [PACKAGE ...]]
                        The names of the custom class listers, uses the
                        default ones if not provided. (default: None)
  -e EXCLUDED_CLASS_LISTERS, --excluded_class_listers EXCLUDED_CLASS_LISTERS
                        The comma-separated list of class listers to excluded.
                        (default: None)
  -p NAME, --plugin_name NAME
                        The name of the plugin to generate the help for,
                        generates it for all if not specified (default: None)
  -f FORMAT, --help_format FORMAT
                        The output format to generate (default: text)
  -L INT, --heading_level INT
                        The level to use for the heading (default: 1)
  -o PATH, --output PATH
                        The directory or file to store the help in; outputs it
                        to stdout if not supplied; if pointing to a directory,
                        automatically generates file name from plugin name and
                        help format (default: None)
  -i FILE, --index_file FILE
                        The file in the output directory to generate with an
                        overview of all plugins, grouped by type (in markdown
                        format, links them to the other generated files)
                        (default: None)
  -t TITLE, --index_title TITLE
                        The title to use in the index file (default: llm-
                        dataset-converter plugins)
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        The logging level to use. (default: WARN)

Plugin registry

usage: llm-registry [-h] [-c CUSTOM_CLASS_LISTERS] [-e EXCLUDED_CLASS_LISTERS]
                    [-l {plugins,custom-class-listers,env-class-listers,downloaders,readers,filters,writers}]

For inspecting/querying the registry.

optional arguments:
  -h, --help            show this help message and exit
  -c CUSTOM_CLASS_LISTERS, --custom_class_listers CUSTOM_CLASS_LISTERS
                        The comma-separated list of custom class listers to
                        use. (default: None)
  -e EXCLUDED_CLASS_LISTERS, --excluded_class_listers EXCLUDED_CLASS_LISTERS
                        The comma-separated list of class listers to excluded.
                        (default: None)
  -l {plugins,custom-class-listers,env-class-listers,downloaders,readers,filters,writers}, --list {plugins,custom-class-listers,env-class-listers,downloaders,readers,filters,writers}
                        For outputting various lists on stdout. (default:
                        None)

Plugins

See here for an overview of all plugins.

Examples

You can find examples for using the library (command-line and code) here:

https://waikato-llm.github.io/llm-dataset-converter-examples/

Additional libraries

Class listers

The llm-dataset-converter uses the class lister registry provided by the seppl library.

Each module defines a function, typically called list_classes that returns a dictionary of names of superclasses associated with a list of modules that should be scanned for derived classes. Here is an example:

from typing import List, Dict


def list_classes() -> Dict[str, List[str]]:
    return {
        "ldc.api.Downloader": [
            "mod.ule1",
        ],
        "ldc.api.Reader": [
            "mod.ule2",
            "mod.ule3",
        ],
        "ldc.api.Filter": [
            "mod.ule4",
        ],
        "seppl.io.Writer": [
            "mod.ule5",
        ],
    }

Such a class lister gets referenced in the entry_points section of the setup.py file:

    entry_points={
        "class_lister": [
            "unique_string=module_name:function_name",
        ],
    },

:function_name can be omitted if :list_classes.

The following environment variables can be used to influence the class listers:

LDC_CLASS_LISTERS
LDC_CLASS_LISTERS_EXCL

Each variable is a comma-separated list of module_name:function_name, defining the class listers.

Name		Name	Last commit message	Last commit date
Latest commit History 428 Commits
plugins		plugins
src/ldc		src/ldc
.gitignore		.gitignore
CHANGES.rst		CHANGES.rst
DESCRIPTION.rst		DESCRIPTION.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
RELEASE.md		RELEASE.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-dataset-converter

Installation

Docker

Datasets

Dataset formats

Compression formats

File encodings

Tools

Dataset conversion

Download

Combining multiple files (one-after-the-other)

Combining multiple files (side-by-side)

File encodings

Locating files

Generating help screens for plugins

Plugin registry

Plugins

Examples

Additional libraries

Class listers

About

Releases 12

Languages

License

waikato-llm/llm-dataset-converter

Folders and files

Latest commit

History

Repository files navigation

llm-dataset-converter

Installation

Docker

Datasets

Dataset formats

Compression formats

File encodings

Tools

Dataset conversion

Download

Combining multiple files (one-after-the-other)

Combining multiple files (side-by-side)

File encodings

Locating files

Generating help screens for plugins

Plugin registry

Plugins

Examples

Additional libraries

Class listers

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 12

Languages