Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthetic data generation from LLMs #7

Open
ablack3 opened this issue Jul 14, 2023 · 1 comment
Open

Synthetic data generation from LLMs #7

ablack3 opened this issue Jul 14, 2023 · 1 comment

Comments

@ablack3
Copy link

ablack3 commented Jul 14, 2023

I’m interested in using this approach to see if we can create accurate synthetic data from a pretrained LLM. First step would be to have an evaluation framework. Opening this issue for discussion of this use case.

@haydenbspence
Copy link

haydenbspence commented Jul 17, 2023

I would suggest PandasAI and Synthea as starting points for this. A vector storage with instructions may be all the is required to get similar or improved results over Synthea -- with the added benefit of being less expensive than tuning a model and allowing the use of larger base models.

Another option is to take real data and generate synthetic data from it. Synthetic Data Vault is a good example of this. From, for example, 100 real records you can expand to >100. with GaussianCopula and CTGAN. It would be interesting to use this framework to add a third method of an LLM an evaluate between the three.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants