For converting large language model (LLM) datasets from one format into another. Filters can be supplied as well, e.g., for cleaning up the data.
Via PyPI:
pip install llm-dataset-converter
The latest code straight from the repository:
pip install git+https://github.com/waikato-llm/llm-dataset-converter.git
Docker images are available from:
- Docker hub: waikatodatamining/llm-dataset-converter
- In-house registry:
public.aml-repo.cms.waikato.ac.nz:443/tools/llm-dataset-converter
The following repository contains a curated list of datasets for LLMs:
https://github.com/Zjh-819/LLMDataHub
The Hugging Face Hub has an abundance of datasets as well:
https://huggingface.co/datasets
The following dataset formats are supported:
Domain | Format | Read | Write | Compression |
---|---|---|---|---|
classification | CSV | from-csv-cl | to-csv-cl | Y |
classification | Jsonlines | from-jsonlines-cl | to-jsonlines-cl | Y |
classification | Parquet | from-parquet-cl | to-parquet-cl | N |
classification | TSV | from-tsv-cl | to-tsv-cl | Y |
pairs | Alpaca | from-alpaca | to-alpaca | Y |
pairs | CSV | from-csv-pr | to-csv-pr | Y |
pairs | Jsonlines | from-jsonlines-pr | to-jsonlines-pr | Y |
pairs | Parquet | from-parquet-pr | to-parquet-pr | N |
pairs | TSV | from-tsv-pr | to-tsv-pr | Y |
pairs | XTuner | from-xtuner | to-xtuner | Y |
pretrain | CSV | from-csv-pt | to-csv-pt | Y |
pretrain | Jsonlines | from-jsonlines-pt | to-jsonlines-pt | Y |
pretrain | Parquet | from-parquet-pt | to-parquet-pt | N |
pretrain | TSV | from-tsv-pt | to-tsv-pt | Y |
pretrain | TXT | from-txt-pt | to-txt-pt | Y 1 |
translation | CSV | from-csv-t9n | to-csv-t9n | Y |
translation | Jsonlines 2 | from-jsonlines-t9n | to-jsonlines-t9n | Y |
translation | Parquet 3 | from-parquet-t9n | to-parquet-t9n | N |
translation | TSV | from-tsv-t9n | to-tsv-t9n | Y |
translation | TXT | from-txt-t9n | to-txt-t9n | Y 1 |
1 Compression not available when concatenating content in single file.
2 Format defined here.
3 Translation data itself is stored as JSON dictionary.
In case a format supports compression, then the following compression formats are automatically supported for loading/saving files:
Most readers offer the --encoding
option to override the automatically determined
file encoding, as that can be wrong due to only inspecting a fixed number of bytes.
The number of bytes of a file inspected can be influenced via the following
environment variable:
LDC_ENCODING_MAX_CHECK_LENGTH
A value of -1
means the complete file. However, that can be very slow and a smaller
value of <1MB is recommended.
usage: llm-convert [-h|--help|--help-all|-help-plugin NAME] [-u INTERVAL]
[-c {None,bz2,gz,xz,zstd}]
[-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
reader
[filter [filter [...]]]
[writer]
Tool for converting between large language model (LLM) dataset formats.
readers (20):
from-alpaca, from-csv-cl, from-csv-pr, from-csv-pt, from-csv-t9n,
from-jsonlines-cl, from-jsonlines-pr, from-jsonlines-pt,
from-jsonlines-t9n, from-parquet-cl, from-parquet-pr,
from-parquet-pt, from-parquet-t9n, from-tsv-cl, from-tsv-pr,
from-tsv-pt, from-tsv-t9n, from-txt-pt, from-txt-t9n, from-xtuner
filters (38):
assemble-sentences, change-case, classification-label-map,
file-filter, find-substr, inspect, keyword, language,
llama2-to-pairs, max-length-pt, max-records, metadata,
metadata-from-name, pairs-to-llama2, pairs-to-pretrain,
pretrain-sentences-to-classification, pretrain-sentences-to-pairs,
randomize-records, record-files, record-window, remove-blocks,
remove-empty, remove-patterns, replace-patterns, require-languages,
reset-ids, sentences-pt, skip-duplicate-ids, skip-duplicate-text,
split-pt, split-records, tee, text-length, text-stats,
to-llama2-format, translation-to-pairs, translation-to-pretrain,
update-pair-data
writers (20):
to-alpaca, to-csv-cl, to-csv-pr, to-csv-pt, to-csv-t9n,
to-jsonlines-cl, to-jsonlines-pr, to-jsonlines-pt, to-jsonlines-t9n,
to-parquet-cl, to-parquet-pr, to-parquet-pt, to-parquet-t9n,
to-tsv-cl, to-tsv-pr, to-tsv-pt, to-tsv-t9n, to-txt-pt, to-txt-t9n,
to-xtuner
optional arguments:
-h, --help show basic help message and exit
--help-all show basic help message plus help on all plugins and exit
--help-plugin NAME show help message for plugin NAME and exit
-u INTERVAL, --update_interval INTERVAL
outputs the progress every INTERVAL records (default: 1000)
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
the logging level to use (default: WARN)
-c {None,bz2,gz,xz,zstd}, --compression {None,bz2,gz,xz,zstd}
the type of compression to use when only providing an output
directory to the writer (default: None)
-b, --force_batch processes the data in batches
-U, --unescape_unicode unescape unicode characters in the command-line
usage: llm-download [-h|--help|--help-all|-help-plugin NAME]
downloader
Tool for downloading data for large language models (LLMs).
downloaders:
huggingface
optional arguments:
-h, --help show basic help message and exit
--help-all show basic help message plus help on all plugins and exit
--help-plugin NAME show help message for plugin NAME and exit
usage: llm-append [-h] [-i [INPUT [INPUT ...]]]
[-I [INPUT_LIST [INPUT_LIST ...]]]
[-t {csv,json,jsonlines,plain-text,tsv}] [-o FILE] [-p]
[-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
Tool for combining multiple text files by appending them.
optional arguments:
-h, --help show this help message and exit
-i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
Path to the text file(s) to append; glob syntax is
supported (default: None)
-I [INPUT_LIST [INPUT_LIST ...]], --input_list [INPUT_LIST [INPUT_LIST ...]]
Path to the text file(s) listing the data files to
append (default: None)
-t {csv,json,jsonlines,plain-text,tsv}, --file_type {csv,json,jsonlines,plain-text,tsv}
The type of files that are being processed. (default:
plain-text)
-o FILE, --output FILE
The path of the file to store the combined data in;
outputs it to stdout if omitted or a directory
(default: None)
-p, --pretty_print Whether to output the JSON in more human-readable
format. (default: False)
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
The logging level to use. (default: WARN)
usage: llm-paste [-h] [-i [INPUT [INPUT ...]]]
[-I [INPUT_LIST [INPUT_LIST ...]]] [-o FILE]
[-s [SEP [SEP ...]]] [-l {DEBUG,INFO,WARN,ERROR,CRITICAL}]
Tool for combining multiple text files by placing them side-by-side.
optional arguments:
-h, --help show this help message and exit
-i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
Path to the text file(s) to combine; glob syntax is
supported (default: None)
-I [INPUT_LIST [INPUT_LIST ...]], --input_list [INPUT_LIST [INPUT_LIST ...]]
Path to the text file(s) listing the data files to
combine (default: None)
-o FILE, --output FILE
The path of the file to store the combined data in;
outputs it to stdout if omitted or a directory
(default: None)
-s [SEP [SEP ...]], --separator [SEP [SEP ...]]
The separators to use between the files; uses TAB if
not supplied; use '{T}' as placeholder for tab
(default: None)
-l {DEBUG,INFO,WARN,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARN,ERROR,CRITICAL}
The logging level to use (default: WARN)
The following tool allows you to determine the encoding of text files.
usage: llm-file-encoding [-h] [-i [INPUT [INPUT ...]]]
[-I [INPUT_LIST [INPUT_LIST ...]]]
[-m MAX_CHECK_LENGTH] [-o FILE]
[-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
Tool for determining the file encoding of text files.
optional arguments:
-h, --help show this help message and exit
-i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
Path to the text file(s) to check; glob syntax is
supported (default: None)
-I [INPUT_LIST [INPUT_LIST ...]], --input_list [INPUT_LIST [INPUT_LIST ...]]
Path to the text file(s) listing the actual files to
check (default: None)
-m MAX_CHECK_LENGTH, --max_check_length MAX_CHECK_LENGTH
The maxmimum number of bytes to use for checking
(default: None)
-o FILE, --output FILE
The path of the file to store the determined encodings
in; outputs it to stdout if omitted or a directory
(default: None)
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
The logging level to use. (default: WARN)
Readers tend to support input via file lists. The llm-find
tool can generate
these.
usage: llm-find [-h] -i DIR [DIR ...] [-r] -o FILE [-m [REGEXP [REGEXP ...]]]
[-n [REGEXP [REGEXP ...]]]
[--split_ratios [SPLIT_RATIOS [SPLIT_RATIOS ...]]]
[--split_names [SPLIT_NAMES [SPLIT_NAMES ...]]]
[--split_name_separator SPLIT_NAME_SEPARATOR]
[-l {DEBUG,INFO,WARN,ERROR,CRITICAL}]
Tool for locating files in directories that match certain patterns and store
them in files.
optional arguments:
-h, --help show this help message and exit
-i DIR [DIR ...], --input DIR [DIR ...]
The dir(s) to scan for files. (default: None)
-r, --recursive Whether to search the directories recursively
(default: False)
-o FILE, --output FILE
The file to store the located file names in (default:
None)
-m [REGEXP [REGEXP ...]], --match [REGEXP [REGEXP ...]]
The regular expression that the (full) file names must
match to be included (default: None)
-n [REGEXP [REGEXP ...]], --not-match [REGEXP [REGEXP ...]]
The regular expression that the (full) file names must
match to be excluded (default: None)
--split_ratios [SPLIT_RATIOS [SPLIT_RATIOS ...]]
The split ratios to use for generating the splits
(int; must sum up to 100) (default: None)
--split_names [SPLIT_NAMES [SPLIT_NAMES ...]]
The split names to use as filename suffixes for the
generated splits (before .ext) (default: None)
--split_name_separator SPLIT_NAME_SEPARATOR
The separator to use between file name and split name
(default: -)
-l {DEBUG,INFO,WARN,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARN,ERROR,CRITICAL}
The logging level to use (default: WARN)
usage: llm-help [-h] [-c [PACKAGE [PACKAGE ...]]] [-e EXCLUDED_CLASS_LISTERS]
[-p NAME] [-f FORMAT] [-L INT] [-o PATH] [-i FILE] [-t TITLE]
[-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
Tool for outputting help for plugins in various formats.
optional arguments:
-h, --help show this help message and exit
-c [PACKAGE [PACKAGE ...]], --custom_class_listers [PACKAGE [PACKAGE ...]]
The names of the custom class listers, uses the
default ones if not provided. (default: None)
-e EXCLUDED_CLASS_LISTERS, --excluded_class_listers EXCLUDED_CLASS_LISTERS
The comma-separated list of class listers to excluded.
(default: None)
-p NAME, --plugin_name NAME
The name of the plugin to generate the help for,
generates it for all if not specified (default: None)
-f FORMAT, --help_format FORMAT
The output format to generate (default: text)
-L INT, --heading_level INT
The level to use for the heading (default: 1)
-o PATH, --output PATH
The directory or file to store the help in; outputs it
to stdout if not supplied; if pointing to a directory,
automatically generates file name from plugin name and
help format (default: None)
-i FILE, --index_file FILE
The file in the output directory to generate with an
overview of all plugins, grouped by type (in markdown
format, links them to the other generated files)
(default: None)
-t TITLE, --index_title TITLE
The title to use in the index file (default: llm-
dataset-converter plugins)
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
The logging level to use. (default: WARN)
usage: llm-registry [-h] [-c CUSTOM_CLASS_LISTERS] [-e EXCLUDED_CLASS_LISTERS]
[-l {plugins,custom-class-listers,env-class-listers,downloaders,readers,filters,writers}]
For inspecting/querying the registry.
optional arguments:
-h, --help show this help message and exit
-c CUSTOM_CLASS_LISTERS, --custom_class_listers CUSTOM_CLASS_LISTERS
The comma-separated list of custom class listers to
use. (default: None)
-e EXCLUDED_CLASS_LISTERS, --excluded_class_listers EXCLUDED_CLASS_LISTERS
The comma-separated list of class listers to excluded.
(default: None)
-l {plugins,custom-class-listers,env-class-listers,downloaders,readers,filters,writers}, --list {plugins,custom-class-listers,env-class-listers,downloaders,readers,filters,writers}
For outputting various lists on stdout. (default:
None)
See here for an overview of all plugins.
You can find examples for using the library (command-line and code) here:
https://waikato-llm.github.io/llm-dataset-converter-examples/
- Audio transcription using faster-whisper
- Google integration
- HTML handling
- MS Word .doc integration
- MS Word .docx integration
- OpenAI integration
- PDF handling
- TinT
The llm-dataset-converter uses the class lister registry provided by the seppl library.
Each module defines a function, typically called list_classes
that returns
a dictionary of names of superclasses associated with a list of modules that
should be scanned for derived classes. Here is an example:
from typing import List, Dict
def list_classes() -> Dict[str, List[str]]:
return {
"ldc.api.Downloader": [
"mod.ule1",
],
"ldc.api.Reader": [
"mod.ule2",
"mod.ule3",
],
"ldc.api.Filter": [
"mod.ule4",
],
"seppl.io.Writer": [
"mod.ule5",
],
}
Such a class lister gets referenced in the entry_points
section of the setup.py
file:
entry_points={
"class_lister": [
"unique_string=module_name:function_name",
],
},
:function_name
can be omitted if :list_classes
.
The following environment variables can be used to influence the class listers:
LDC_CLASS_LISTERS
LDC_CLASS_LISTERS_EXCL
Each variable is a comma-separated list of module_name:function_name
, defining the class listers.