This repo is for our paper: Sim-GPT: Text Similarity via GPT Annotated Data.
In this repo you can find:
- Scripts to reproduce our results.
- Non-labeled and labeled data used in our paper.
- Best checkpoints demonstrated in our paper.
- December 12, 2023 we released our scripts, checkpoints and data.
- December 12, 2023 we released our paper in arxiv.
- Sim-GPT: Text Similarity via GPT Annotated Data
To address the longstanding issue with the STS task: the lack of a large collection of high-quality labeled training data, we propose Sim-GPT. This approach involves using GPT-4 to generate data with STS labels, upon which an STS model is subsequently trained.
Sim-GPT does not directly ask LLMs (e.g., GPT-4) to provide STS scores for a newly-encountered sentence pair. But rather, it firstly asks LLMs to generate a relatively large set of training data; secondly, a smaller model (e.g., backboned by RoBERTa) is trained based on the synthesized data from LLMs; At test time, the trained model is used for inference.
Illustraitions for our Sim-GPT is following:
- python>=3.7.3
- openai>=0.27.2
For training SimCSE models, we followed: SimCSE. To be efficient, we have directly copied them as follows for your convenience:
transformers==4.2.1
scipy
datasets
pandas
scikit-learn
prettytable
gradio
torch
setuptools
For training PromCSE models, we followed: PromCSE. To be efficient, we have directly copied them as follows for your convenience:
transformers==4.2.1
scipy==1.5.4
datasets==1.2.1
pandas==1.1.5
scikit-learn==0.24.0
prettytable==2.1.0
gradio
torch
setuptools==49.3.0
In this part, we offer links to download the source data and provide prompts that guide GPT-4 in the annotation process.
Three types of data:
- Captions (Flickr30K)
- Questions (Quora Question Pairs)
- For Multi-genre long sentences, we only release the annotated version.
Prompts:
- Captions:
./prompts/captions.txt
- Questions:
./prompts/questions.txt
- Multi-genre Sentences:
./prompts/multi_genre_sentences.txt
It's worth noting that for several reasons while accessing GPT-4 (e.g, bacth size, network), the data re-created using the above prompts may vary slightly from the dataset we have released. As mentioned in our paper, despite significant variations in the prompt, the performance of the model, when trained on the generated data, tends to remain consistent for the STS task.
- Clone the related project:
- Download backboned RoBERTa models:
- Fill the
roberta path
,input file
andoutput directory
into our provided training scripts:- SimCSE-RoBERTa
- base:
./training-parameters/simcse/sup_roberta_base.sh
- large:
./training-parameters/simcse/sup_roberta_large.sh
- base:
- PromCSE-RoBERTa
- base:
./training-parameters/promcse/sup_roberta_base.sh
- large:
./training-parameters/promcse/sup_roberta_large.sh
- base:
- SimCSE-RoBERTa
- Move modified scripts to the directory of the related projects, such as:
move ./training-parameters/simcse/sup_roberta_base.sh SimCSE/
- Run the training scirpt, such as:
bash SimCSE/sup_roberta_base.sh
We evaluate Sim-GPT on 7 STS tasks, and report the score of Spearman's correlation.
python SimCSE/evaluation.py \
--model_name_or_path simgpt-simcse-roberta-large \
--pooler cls \
--task_set sts \
--mode test
which is expected to output the results in a tabular format:
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 78.79 | 88.22 | 83.48 | 88.32 | 85.48 | 87.91 | 81.07 | 84.75 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
Table 1: Results reported in our paper on 7 STS tasks.
Model | STS12 | STS13 | STS14 | STS15 | STS16 | STS-B | SICK-R | Avg |
Supervised Model | ||||||||
InferSent-GloVe | 52.86 | 66.75 | 62.15 | 72.77 | 66.87 | 68.03 | 65.65 | 65.01 |
Universal Sentence Encoder | 64.49 | 67.80 | 64.61 | 76.83 | 73.18 | 74.92 | 76.69 | 71.22 |
SRoBERTa-base | 71.54 | 72.49 | 70.80 | 78.74 | 73.69 | 77.77 | 74.46 | 74.21 |
SBERT-base | 70.97 | 76.53 | 73.19 | 79.09 | 74.30 | 77.03 | 72.91 | 74.89 |
CT-SBERT-base | 74.84 | 83.20 | 78.07 | 83.84 | 77.93 | 81.46 | 76.42 | 79.39 |
SimGPT - SimCSE | ||||||||
SimCSE-RoBERTa-base | 76.53 | 85.21 | 80.95 | 86.03 | 82.57 | 85.83 | 80.50 | 82.52 |
SimGPT - SimCSE-RoBERTa-base | 77.65 (+1.12) | 86.15 (+0.94) | 80.58 (-0.37) | 86.47 (+0.44) | 84.08 (+1.51) | 86.20 (+0.37) | 80.88 (+0.38) | 83.14 (+0.62) |
SimCSE-RoBERTa-large | 77.46 | 87.27 | 82.36 | 86.66 | 83.93 | 86.70 | 81.95 | 83.76 |
SimGPT - SimCSE-RoBERTa-large | 78.79 (+1.33) | 88.22 (+0.95) | 83.48 (+1.12) | 88.32 (+1.66) | 85.48 (+1.55) | 87.91 (+1.21) | 81.07 (-0.88) | 84.75 (+0.99) |
SimGPT - PromCSE | ||||||||
PromCSE-RoBERTa-base | 77.51 | 86.15 | 81.59 | 86.92 | 83.81 | 86.35 | 80.49 | 83.26 |
SimGPT - PromCSE-RoBERTa-base | 77.74 (+0.23) | 86.82 (+0.77) | 81.36 (-0.23) | 87.01 (+0.09) | 84.58 (+0.77) | 86.98 (+0.63) | 80.48 (-0.01) | 83.57 (+0.31) |
PromCSE-RoBERTa-large | 79.56 | 88.97 | 83.81 | 88.08 | 84.96 | 87.87 | 82.43 | 85.10 |
SimGPT - PromCSE-RoBERTa-large | 79.92 (+0.36) | 88.87 (-0.10) | 84.29 (+0.48) | 88.64 (+0.56) | 85.94 (+0.98) | 88.18 (+0.31) | 82.79 (+0.36) | 85.52 (+0.42) |
Model | Avg. STS |
---|---|
simgpt-simcse-roberta-base | 83.14 |
simgpt-simcse-roberta-large | 84.75 |
simgpt-promcse-roberta-base | 83.57 |
simgpt-promcse-roberta-large | 85.52 |
Our released annotated data are:
If you have any issues or questions about this repo, feel free to contact [email protected]