Skip to content

Implementation of Simple Contrastive Learning-based Unsupervised approach to generate sentence embeddings and to perform text similarity in Tensorflow

License

Notifications You must be signed in to change notification settings

fork123aniket/Contrastive-Learning-for-Sentence-Embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contrastive Learning for Sentence Embeddings

This repository provides a simple code to implement unsupervised contrastive learning framework for generating sentence embeddings from the paper "SimCSE: Simple Contrastive Learning of Sentence Embeddings".

Setup Environment Requirements

  • Python 3.9
  • transformers 4.16.2
  • tensorflow 2.8
  • tensorflow-addons 0.16.1
  • scikit-learn 1.0.2
  • numpy 1.22.2
  • pandas 1.4.1

Usage

Data

Dataset used can be downloaded directly by using the command, written in data/data.txt file, in Windows PowerShell. The dataset comprises 1 million sentences randomly sampled from English Wikipedia. While running the code, ensuring that the data folder has wiki1m_for_simcse.txt file is required.

Training

  • To train the unsupervised contrastive learning approach, run main.py.
  • All hyperparameters to control model training and the paths to input and output data directories are provided in the given main.py file. The values of these hyperparameters can be altered to see how the approach performs in different hyperparameter settings.
  • model.py file contains the implemented unsupervised contrastive learning-based language model approach that is being imported to main.py file in order to train and test the same approach.
  • This unsupervised contrastive learning approach can also be extended to perform text similarity using the preTrained language model, the example of which can be found in main.py file.
  • The average loss for the trained model is printed after every epoch.

Text Similarity

Input Sentences

"chocolates are my favourite items.",
"The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.",
"The person box was packed with jelly many dozens of months later.",
"white chocolates and dark chocolates are favourites for many people.",
"I love chocolates.",
"Let me help you.",
"There are some who influenced many.",
"Chips are getting more popular these days.",
"There are tools which help us get our work done.",
"Electric vehicles are worth buying given their mileage on the road.",
"NATO is the most powerful military alliance.",
"Gone are the days when people got worry about their diets."

Sentence to be compared with the other given sentences

"chocolates are my favourite items."

Similarity Scores

0.84871584, 0.8847387, 0.874104, 0.96159446, 0.87748206,
0.88612396, 0.9087229, 0.86401033, 0.90140533, 0.8532164,
0.8539922

Analysis

Results, mentioned above, show higher similarity between 1st and 5th sentences than any other possible 1st sentence combination with the remaining 10 input sentences.

About

Implementation of Simple Contrastive Learning-based Unsupervised approach to generate sentence embeddings and to perform text similarity in Tensorflow

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages