Releases · e-p-armstrong/augmentoolkit

There's been a massive update: Augmentoolkit has been enhanced with a third pipeline. This one is specialized around making data at scale easier to work with, and giving you a tool to sort through it all: you can now make the dataset for, and train, any conceivable binary classification model quickly and at basically no cost.

Some other features made between now and the last release are also included here.

New pipeline: classifier creator. Generates data for, trains, evaluates, and iterates on a small compute-efficient binary classification model — all within a single script.
- Allows painless classification of massive amounts of unlabelled data using any conceivable labels.
- Achieves results comparable to classifiers trained on human-labelled data.
- Extremely cost-efficient (a classifier costs less than a coffee even when using APIs)
- Fast (takes less than an hour to generate the data and train the classifier; frankly, depending on your settings, often less than ten minutes).
- Fully documented
- Configurable: change the base classifier model you train on, set a cap on the maximum number of iterations you will perform, and classify based on any labels imaginable
Pure synthetic data pipeline (EXPERIMENTAL): Don't have an input text? Describe the kind of conversations you want, and Augmentoolkit will use random combinations of labels and features to make a diversity of synthetic interactions. Useful for aligning the style of the model; not so good for adding facts.
- This pipeline first generates a pipeline for the specific type of conversations the user describes, then runs that pipeline. Currently the generated pipeline needs slightly better prompts to be usable without modification. The pure synthetic pipeline can, therefore, be used usefully but you'll have to polish up the ./pure_synthetic_pipeline/prompts folder's contents first.
Overhauls to generation for improved model training performance.
Prompt overrides for Augmentoolkit's default mode out of the box: generate long-response data, "negative data".
Improved local generation workflow: no longer does local generation rely on two separate files. Now it uses the main processing.py; what section you're working through is controlled through config.yaml.
Miscellanious fixes and improvements.
Axolotl training configs provided as part of the repo so that getting started creating your own LLM is easier.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New Contributors

Contributors

Releases: e-p-armstrong/augmentoolkit

Classifier Creator & General Overhaul

General QA Dataset Generation with Local LLMs (Publishing an Official Release)

What's Changed

New Contributors

Contributors