Name		Name	Last commit message	Last commit date
parent directory ..
gen_engine_core		gen_engine_core
meta_pipeline_output		meta_pipeline_output
meta_pipeline_prompts		meta_pipeline_prompts
prompts		prompts
README.md		README.md
__init__.py		__init__.py
ai_loop.py		ai_loop.py
config.yaml		config.yaml
create_prompt.py		create_prompt.py
formatted_conversations.jsonl		formatted_conversations.jsonl
input_field_handlers.py		input_field_handlers.py
pipeline.py		pipeline.py
synthetic_data_skeleton.py		synthetic_data_skeleton.py

README.md

Pure Synthetic Data -- For Alignment and Refusals

This pipeline is a sort-of meta-pipeline. You give it a description of the kind of data you want, and it first uses a large and powerful LLM to generate the few-shot examples for a pipeline. Following this, it writes the new pipeline to a file, where you can execute it with a smaller model (like Mixtral) to generate data. This data is purely synthetic -- the only source of variety is Faker inputs. Further, you'll probably have to manually edit the few-shot examples a bit before you actually run the generated pipeline. But this still saves you a lot of time and effort when it comes to generating purely synthetic data with slight variation between the scenarios, for when you want to train a specific behavior like "apologize and do not answer if asked about a community member."

The overall lack of polish is because this was originally an abandoned project that I adapted for alignment purposes while building Verustoolkit. I decided to include it over here, too.

Usage

Requirements should be the same as the main project, except you will need Faker as well:

pip install faker

You must first generate a pipeline, then run the pipeline. To generate the pipeline, you run ai_loop.py, and to run the pipeline, you run the python script set in the config as METAPIPELINE_PY_FILE — by default, this is pipeline.py. Options are defined and thoroughly-documented line-by-line in config.yaml.

So,

Edit config.yaml to your liking.
Run ai_loop.py to generate the pipeline.
Run the generated pipeline.

Note that config.yaml has your typical augmentoolkit fields, and some fields for the pipeline generation. Things for the pipeline, in the PATH section, are indicated by having META in their name. The META fields are used to generate the pipeline, and the rest are used to run the pipeline.

Existing folders

The prompts used to generate the refusals for information that changes all the time are in prompts_current/. The prompts used to generate the refusals for information about absurd things (e.g., "tell me about the Verus space elevator") are in prompts_verus_absurd. The prompts used to generate the refusals for information about the Verus community (for which the model has no actual training data and will therefore hallucinate a ton about) are in prompts_verus_community. You can see that this light-handed 'alignment' is not about making the LLM stupid, but in fact about making it a bit more reliable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pure_synthetic_pipeline

pure_synthetic_pipeline

README.md

Pure Synthetic Data -- For Alignment and Refusals

Usage

Existing folders

Files

pure_synthetic_pipeline

Directory actions

More options

Directory actions

More options

Latest commit

History

pure_synthetic_pipeline

Folders and files

parent directory

README.md

Pure Synthetic Data -- For Alignment and Refusals

Usage

Existing folders