ValenTwin is a schema matching framework that uses self-supervised contrastive learning to train the model, uses the model to generate embeddings of table columns, then uses different similarity measures to match the column embeddings.
- Clone the repository
git clone https://github.com/albertus-andito/valentwin.git
cd valentwin
- Install the required packages. It is recommended to use a virtual environment.
pip install -r requirements.txt
pip install -e .
The datasets can be downloaded from https://zenodo.org/records/11413479
We provide two types of zip files for the datasets:
data.zip
contains the raw data files, the ground truth files, the sampled data (n=[100, 200, 300, 400, 500]) used in the experiments, as well as the contrastive data used to train the model.data-raw.zip
contains only the raw data files and the ground truth files. You can sample the data and generate the contrastive dataset yourself by following step 1 and 2 in theHow to Run
section. Download and unzip one of the zip files to thedata
folder.
Click me
The sample data is in the data
folder. If you don't already have it and want to sample the data yourself, you can run the split_and_sample_datasets.py
from the scripts
folder, e.g.:
python scripts/split_and_sample_datasets.py --dataset_dir_paths data/magellan/books/formatted \
--sample_dataset_dir_paths data/magellan/books/sample \
--sample_sizes 100 \
--split_ratio 0.4 0.2 0.4 \
--include_all_samples \
--drop_duplicates \
--seed 42
or just run the bash script
cd scripts/shell
sh sample_data.sh
The contrastive data is in the data
folder. If you don't already have it and want to generate the contrastive data yourself, you can run the generate_contrastive_data.py
from the scripts
folder, e.g.:
python scripts/generate_contrastive_data.py --input_dir_paths data/magellan/books/sample/100-train \
--output_dir_paths data/magellan/books/contrastive-selective/100 \
--hard_neg_size 10 \
--with_col_table_names \
--use_selective_negatives \
--pretrained_model_name_or_path princeton-nlp/sup-simcse-roberta-base \
--pooling cls \
--device cuda:0 \
--num_values_per_item 1
or just run the bash script
cd scripts/shell
sh generate_contrastive_data.sh
We have provided the trained models in the link in the Trained Models
section that you can directly use.
However, you can also train the model yourself.
To train the model, you can run the train_valentwin.py
from the scripts
folder, e.g.:
python scripts/train_valentwin.py --model_name_or_path princeton-nlp/sup-simcse-roberta-base \
--train_file data/magellan/books/contrastive-selective/100/train.csv \
--validation_file data/magellan/books/contrastive-selective/100/val.csv \
--eval_file data/magellan/books/contrastive-selective/100/test.csv \
--output_dir scripts/result/valentwin-books \
--num_train_epochs 3 \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 64 \
--learning_rate 3e-5 \
--max_seq_length 32 \
--pooler_type cls \
--overwrite_output_dir \
--temp 0.05 \
--do_train \
--do_eval \
--label_names [] \
--logging_strategy epoch \
--evaluation_strategy epoch \
--save_strategy epoch \
--metric_for_best_model accuracy \
--load_best_model_at_end \
--fp16 \
--use_in_batch_instances_as_negatives \
--restrictive_in_batch_negatives \
or you can follow the multi-GPU training example in the scripts/shell/train_valentwin.sh
.
Once the model is finished training, you need to convert the model from SimCSE to HuggingFace type model by running the simcse_to_huggingface_and_push.py
script.
To use the model to match the schemas, you can run the valentwin-batch-matching.py
from the scripts
folder, e.g.:
python scripts/valentwin-batch-matching.py \
--pretrained_model_names_or_paths scripts/result/valentwin-books \
--measures euc \
--column_name_weights 0.4 \
--column_name_measures euc \
--holistic \
--tables_root_dir data/magellan/books/sample/100-test \
--output_root_dir data/magellan/books/output/100 \
--device cuda:0
or you can follow the example in the scripts/shell/valentwin-matching.sh
.
To evaluate the matching results, you can run the calculate_metrics.py
from the scripts
folder, e.g.:
python scripts/calculate_metrics.py \
--input_dir_path data/magellan/books/output/100 \
--output_file_path data/magellan/books/metrics/100.csv \
--ground_truth_file_path data/magellan/books/ground-truth-mapping/ground-truth.csv \
--do_annotate_tp_fp \
--split_by_column_types \
--parallel_workers -1
Each subfolder in the experiments
folder contains the scripts to reproduce the experiments with the results in the paper.
It is assumed that you have already sampled the data and generated the contrastive data as described in the How to Run
section.
This repository is largely based on Valentine. We also include code from SimCSE with modifications for the model training. The code for the competitor methods are also taken from their respective repositories: ALITE and Starmie.