Daniel van Strien PRO

davanstrien

AI & ML interests

Machine Learning Librarian

Articles

Organizations

Posts 17

view post
Post
1484
Do you need a high-quality dataset to train a custom sentence transformer model? Look no further! I've developed a pipeline that leverages an LLM to create a synthetic dataset of negative and positive sentence pairs based on domain-specific anchors.

Here's what the pipeline offers:
- **Dataset Generation**: Automatically create synthetic sentence pairs
- **Mine hard negatives**: Use an existing embedding model to mine hard negatives
- **Model Training**: Train a model using the latest release of Sentence Transformers.

Check out this collection ( davanstrien/sentence-transformers-from-synthetic-data-66571a6133480d1b70066b70) to see an example of what you can achieve with this pipeline. It features a sentence transformer model to detect coding prompt similarities in a @bigcode dataset.

Excited to get started? Find a tutorial here: https://github.com/davanstrien/awesome-synthetic-datasets/tree/main/examples/embedding-datasets.
view post
Post
1654
How can we use open LLMs to create data for training sentence similarity models?

One of the most exciting use cases for LLMs is generating synthetic datasets that can be used to train non-LLM models. In the past, gathering enough data was one of the most significant barriers to training task-specific models. LLMs can potentially help in this area.

I've just written a new blog post on using meta-llama/Meta-Llama-3-70B-Instruct to generate synthetic similarity data based on the approach from Retrieving Texts based on Abstract Descriptions (2305.12517).

https://huggingface.co/blog/davanstrien/synthetic-similarity-datasets