Synthetic data generation from LLMs #7

ablack3 · 2023-07-14T15:21:17Z

I’m interested in using this approach to see if we can create accurate synthetic data from a pretrained LLM. First step would be to have an evaluation framework. Opening this issue for discussion of this use case.

haydenbspence · 2023-07-17T14:26:42Z

I would suggest PandasAI and Synthea as starting points for this. A vector storage with instructions may be all the is required to get similar or improved results over Synthea -- with the added benefit of being less expensive than tuning a model and allowing the use of larger base models.

Another option is to take real data and generate synthetic data from it. Synthetic Data Vault is a good example of this. From, for example, 100 real records you can expand to >100. with GaussianCopula and CTGAN. It would be interesting to use this framework to add a third method of an LLM an evaluate between the three.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synthetic data generation from LLMs #7

Synthetic data generation from LLMs #7

ablack3 commented Jul 14, 2023

haydenbspence commented Jul 17, 2023 •

edited

Loading

Synthetic data generation from LLMs #7

Synthetic data generation from LLMs #7

Comments

ablack3 commented Jul 14, 2023

haydenbspence commented Jul 17, 2023 • edited Loading

haydenbspence commented Jul 17, 2023 •

edited

Loading