Skip to content

davanstrien/data-for-fine-tuning-llms

Repository files navigation

Datasets for fine-tuning LLMs

This repo is to accompany a session run as part of the Mastering LLMs: A Conference For Developers & Data Scientists conference. The session focused on some of the data issues related to fine-tuning LLMs.

The goals of the notebooks are focused on balancing the requirement to have sufficiently diverse data, with high quality and the right quantity i.e. avoid duplication.

Diagram showing goals of the notebooks

Slides

Notebooks

Synthetic data pipelines

  • dataset-card-summaries: This folder contains a pipeline for generating a synthetic dataset focused on generating tl;dr summaries of datasets based on their dataset card.

Other resources for synthetic data generation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published