Skip to content

Releases: e-p-armstrong/augmentoolkit

Classifier Creator & General Overhaul

09 Jul 20:41
Compare
Choose a tag to compare

There's been a massive update: Augmentoolkit has been enhanced with a third pipeline. This one is specialized around making data at scale easier to work with, and giving you a tool to sort through it all: you can now make the dataset for, and train, any conceivable binary classification model quickly and at basically no cost.

Some other features made between now and the last release are also included here.

  • New pipeline: classifier creator. Generates data for, trains, evaluates, and iterates on a small compute-efficient binary classification model — all within a single script.
    • Allows painless classification of massive amounts of unlabelled data using any conceivable labels.
    • Achieves results comparable to classifiers trained on human-labelled data.
    • Extremely cost-efficient (a classifier costs less than a coffee even when using APIs)
    • Fast (takes less than an hour to generate the data and train the classifier; frankly, depending on your settings, often less than ten minutes).
    • Fully documented
    • Configurable: change the base classifier model you train on, set a cap on the maximum number of iterations you will perform, and classify based on any labels imaginable
  • Pure synthetic data pipeline (EXPERIMENTAL): Don't have an input text? Describe the kind of conversations you want, and Augmentoolkit will use random combinations of labels and features to make a diversity of synthetic interactions. Useful for aligning the style of the model; not so good for adding facts.
    • This pipeline first generates a pipeline for the specific type of conversations the user describes, then runs that pipeline. Currently the generated pipeline needs slightly better prompts to be usable without modification. The pure synthetic pipeline can, therefore, be used usefully but you'll have to polish up the ./pure_synthetic_pipeline/prompts folder's contents first.
  • Overhauls to generation for improved model training performance.
  • Prompt overrides for Augmentoolkit's default mode out of the box: generate long-response data, "negative data".
  • Improved local generation workflow: no longer does local generation rely on two separate files. Now it uses the main processing.py; what section you're working through is controlled through config.yaml.
  • Miscellanious fixes and improvements.
  • Axolotl training configs provided as part of the repo so that getting started creating your own LLM is easier.

General QA Dataset Generation with Local LLMs (Publishing an Official Release)

08 Jun 10:08
0974e3d
Compare
Choose a tag to compare

I am creating an official release for the Augmentoolkit project, which allows for QA dataset generation using open source models.

What's Changed

  • first "release" on GitHub, with all features and bugfixes
  • APIs, Local Models, OpenAI, Gemini all supported
  • simplification and rewrite by @darkacorn in #2
  • Gradio Web UI + Extended Input Folder by @cocktailpeanut in #16
  • feat: add gemini api support by @alexandreteles in #18

New Contributors

Full Changelog: https://github.com/e-p-armstrong/augmentoolkit/commits/v1.0.0