Releases: e-p-armstrong/augmentoolkit
Release list
Augmentoolkit 3.0
Augmentoolkit 3.0 is essentially an entirely new project.
Before we had 3 pipelines. Now we have 16.
Before we just generated data. Now it automatically trains whole LLMs with autogenerated training configs. Datagen can be done locally, efficiently, on consumer hardware, thanks to a custom-trained dataset generation model.
The factual finetuning process's quality has been completely revolutionized during development -- three separate times, each building on the one before it.
A full changelog is impractical, since everything is changed. Every abstraction has been improved. Every way in which the tool is used has been streamlined and improved. Every pipeline is better. Every outcome is higher-quality and more efficiently delivered.
Instead of a changelog, refer to the documentation, since diffs don't mean much when the project has been effectively rewritten from the ground up.
However, if you've forked the project before to build your own data pipelines, do not despair -- porting pipelines to New Augmentoolkit is easy and there is the pipeline conventions, abstractions primer, and new pipeline primer in the documentation (docs/...) to guide you through the process. Alternatively, you can get help on the Discord.
Augmentoolkit is now the best way in the world to make custom data, and by extension, custom models.
Happy Hacking!
Augmentoolkit 2.5 "Farewell"
This is a tagged release with the final Augmentoolkit update before 3.0.
3.0 changes literally everything and is not at all backwards compatible with previous versions (though incorporating new pipelines into 3.0 isn't terribly hard).
Still, if you've forked off of the commits from 6 months ago before I disappeared to go write 3.0, this is the tagged version you want for compatibility.
Still, I seriously recommend migrating to the newest release.
Augmentoolkit 2.0
- New pipeline: RPToolkit. Generate RP data from any conceivable fictional story!
- Augmentoolkit is no longer one isolated pipeline; it is now an extesible and modular project that can support any number of pipelines. Multiple pipeline executions can be scheduled in sequence.
- Complete refactor, with new abstractions, cleaner code, fewer bugs.
- New interface, with Streamlit, optimized for the new workflow.
- Massively overhauled documentation with a large number of tutorial videos.
What's Changed
- chore: Update file encoding to utf-8 for consistency by @1Etherl in #45
- Usability overhaul by @e-p-armstrong in #49
New Contributors
Full Changelog: v1.5.0...v2.0.0
Classifier Creator & General Overhaul
There's been a massive update: Augmentoolkit has been enhanced with a third pipeline. This one is specialized around making data at scale easier to work with, and giving you a tool to sort through it all: you can now make the dataset for, and train, any conceivable binary classification model quickly and at basically no cost.
Some other features made between now and the last release are also included here.
- New pipeline: classifier creator. Generates data for, trains, evaluates, and iterates on a small compute-efficient binary classification model — all within a single script.
- Allows painless classification of massive amounts of unlabelled data using any conceivable labels.
- Achieves results comparable to classifiers trained on human-labelled data.
- Extremely cost-efficient (a classifier costs less than a coffee even when using APIs)
- Fast (takes less than an hour to generate the data and train the classifier; frankly, depending on your settings, often less than ten minutes).
- Fully documented
- Configurable: change the base classifier model you train on, set a cap on the maximum number of iterations you will perform, and classify based on any labels imaginable
- Pure synthetic data pipeline (EXPERIMENTAL): Don't have an input text? Describe the kind of conversations you want, and Augmentoolkit will use random combinations of labels and features to make a diversity of synthetic interactions. Useful for aligning the style of the model; not so good for adding facts.
- This pipeline first generates a pipeline for the specific type of conversations the user describes, then runs that pipeline. Currently the generated pipeline needs slightly better prompts to be usable without modification. The pure synthetic pipeline can, therefore, be used usefully but you'll have to polish up the ./pure_synthetic_pipeline/prompts folder's contents first.
- Overhauls to generation for improved model training performance.
- Prompt overrides for Augmentoolkit's default mode out of the box: generate long-response data, "negative data".
- Improved local generation workflow: no longer does local generation rely on two separate files. Now it uses the main
processing.py; what section you're working through is controlled throughconfig.yaml. - Miscellanious fixes and improvements.
- Axolotl training configs provided as part of the repo so that getting started creating your own LLM is easier.
General QA Dataset Generation with Local LLMs (Publishing an Official Release)
I am creating an official release for the Augmentoolkit project, which allows for QA dataset generation using open source models.
What's Changed
- first "release" on GitHub, with all features and bugfixes
- APIs, Local Models, OpenAI, Gemini all supported
- simplification and rewrite by @darkacorn in #2
- Gradio Web UI + Extended Input Folder by @cocktailpeanut in #16
- feat: add gemini api support by @alexandreteles in #18
New Contributors
- @darkacorn made their first contribution in #2
- @cocktailpeanut made their first contribution in #16
- @alexandreteles made their first contribution in #18
Full Changelog: https://github.com/e-p-armstrong/augmentoolkit/commits/v1.0.0