ValenTwin: Using Self-Supervised Contrastive Learning on Language Model for Schema Matching

ValenTwin is a schema matching framework that uses self-supervised contrastive learning to train the model, uses the model to generate embeddings of table columns, then uses different similarity measures to match the column embeddings.

Getting Started

Clone the repository

git clone https://github.com/albertus-andito/valentwin.git
cd valentwin

Install the required packages. It is recommended to use a virtual environment.

pip install -r requirements.txt
pip install -e .

Datasets

The datasets can be downloaded from https://zenodo.org/records/11413479

We provide two types of zip files for the datasets:

data.zip contains the raw data files, the ground truth files, the sampled data (n=[100, 200, 300, 400, 500]) used in the experiments, as well as the contrastive data used to train the model.
data-raw.zip contains only the raw data files and the ground truth files. You can sample the data and generate the contrastive dataset yourself by following step 1 and 2 in the How to Run section. Download and unzip one of the zip files to the data folder.

Trained Models

Click me

How to Run

1. Sample Data

The sample data is in the data folder. If you don't already have it and want to sample the data yourself, you can run the split_and_sample_datasets.py from the scripts folder, e.g.:

python scripts/split_and_sample_datasets.py --dataset_dir_paths data/magellan/books/formatted \
    --sample_dataset_dir_paths data/magellan/books/sample \
    --sample_sizes 100 \
    --split_ratio 0.4 0.2 0.4 \
    --include_all_samples \
    --drop_duplicates \
    --seed 42

or just run the bash script

cd scripts/shell
sh sample_data.sh

2. Generate Contrastive Data

The contrastive data is in the data folder. If you don't already have it and want to generate the contrastive data yourself, you can run the generate_contrastive_data.py from the scripts folder, e.g.:

python scripts/generate_contrastive_data.py --input_dir_paths data/magellan/books/sample/100-train \
    --output_dir_paths data/magellan/books/contrastive-selective/100 \
    --hard_neg_size 10 \
    --with_col_table_names \
    --use_selective_negatives \
    --pretrained_model_name_or_path princeton-nlp/sup-simcse-roberta-base \
    --pooling cls \
    --device cuda:0 \
    --num_values_per_item 1

or just run the bash script

cd scripts/shell
sh generate_contrastive_data.sh

3. Train the Model

We have provided the trained models in the link in the Trained Models section that you can directly use. However, you can also train the model yourself. To train the model, you can run the train_valentwin.py from the scripts folder, e.g.:

python scripts/train_valentwin.py --model_name_or_path princeton-nlp/sup-simcse-roberta-base \
    --train_file data/magellan/books/contrastive-selective/100/train.csv \
    --validation_file data/magellan/books/contrastive-selective/100/val.csv \
    --eval_file data/magellan/books/contrastive-selective/100/test.csv \
    --output_dir scripts/result/valentwin-books \
    --num_train_epochs 3 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 64 \
    --learning_rate 3e-5 \
    --max_seq_length 32 \
    --pooler_type cls \
    --overwrite_output_dir \
    --temp 0.05 \
    --do_train \
    --do_eval \
    --label_names [] \
    --logging_strategy epoch \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --metric_for_best_model accuracy \
    --load_best_model_at_end \
    --fp16 \
    --use_in_batch_instances_as_negatives \
    --restrictive_in_batch_negatives \

or you can follow the multi-GPU training example in the scripts/shell/train_valentwin.sh.

Once the model is finished training, you need to convert the model from SimCSE to HuggingFace type model by running the simcse_to_huggingface_and_push.py script.

4. Run Matching

To use the model to match the schemas, you can run the valentwin-batch-matching.py from the scripts folder, e.g.:

python scripts/valentwin-batch-matching.py \
    --pretrained_model_names_or_paths scripts/result/valentwin-books \
    --measures euc \
    --column_name_weights 0.4 \
    --column_name_measures euc \
    --holistic \
    --tables_root_dir data/magellan/books/sample/100-test \
    --output_root_dir data/magellan/books/output/100 \
    --device cuda:0

or you can follow the example in the scripts/shell/valentwin-matching.sh.

5. Evaluate Matching

To evaluate the matching results, you can run the calculate_metrics.py from the scripts folder, e.g.:

python scripts/calculate_metrics.py \
    --input_dir_path data/magellan/books/output/100 \
    --output_file_path data/magellan/books/metrics/100.csv \
    --ground_truth_file_path data/magellan/books/ground-truth-mapping/ground-truth.csv \
    --do_annotate_tp_fp \
    --split_by_column_types \
    --parallel_workers -1

How to Reproduce Experiments

Each subfolder in the experiments folder contains the scripts to reproduce the experiments with the results in the paper. It is assumed that you have already sampled the data and generated the contrastive data as described in the How to Run section.

Acknowledgements

This repository is largely based on Valentine. We also include code from SimCSE with modifications for the model training. The code for the competitor methods are also taken from their respective repositories: ALITE and Starmie.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
experiments		experiments
scripts		scripts
valentwin		valentwin
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ValenTwin: Using Self-Supervised Contrastive Learning on Language Model for Schema Matching

Getting Started

Datasets

Trained Models

How to Run

1. Sample Data

2. Generate Contrastive Data

3. Train the Model

4. Run Matching

5. Evaluate Matching

How to Reproduce Experiments

Acknowledgements

About

Releases

Packages

Languages

Integration Set	Best Model on Average	Best Model on Each Integration Set
25ksep11	valentwin-25ksep11-n-100-hn-10-selective-neg-lr-3e5-bs-512	valentwin-25ksep11-n-100-hn-10-selective-noneg-lr-3e5-bs-512-ep-10
500spend	valentwin-500spend-n-100-hn-10-selective-neg-lr-3e5-bs-512	valentwin-500spend-n-100-hn-10-selective-noneg-lr-3e5-bs-512-ep-10
1009ipopayments	valentwin-1009ipopayments-n-100-hn-10-selective-neg-lr-3e5-bs-512	valentwin-1009ipopayments-n-100-hn-10-selective-noneg-lr-3e5-bs-512-ep-10
amo-ame	valentwin-amo-ame-n-100-hn-10-selective-neg-lr-3e5-bs-512	valentwin-amo-ame-n-100-hn-10-selective-noneg-lr-3e5-bs-512-ep-10
chicago_parks	valentwin-chicago_parks-n-100-hn-10-selective-neg-lr-3e5-bs-512	valentwin-chicago_parks-n-100-hn-10-selective-noneg-lr-3e5-bs-512-ep-10
cihr	valentwin-cihr-n-100-hn-10-selective-neg-lr-3e5-bs-512	valentwin-cihr-n-100-hn-10-selective-neginter-lr-3e5-bs-512
DCMS_NHM_NHM	valentwin-DCMS_NHM_NHM-n-100-hn-10-selective-neg-lr-3e5-bs-512	valentwin-DCMS_NHM_NHM-n-100-hn-10-selective-noneg-lr-3e5-bs-512-ep-10
organogram-junior	valentwin-organogram-junior-n-100-hn-10-selective-neg-lr-3e5-bs-512	valentwin-organogram-junior-n-100-hn-10-selective-neg-lr-3e5-bs-512-ep-10
school_report	valentwin-school_report-n-100-hn-10-selective-neg-lr-3e5-bs-512	valentwin-school_report-n-100-hn-10-selective-neginter-lr-3e5-bs-512-ep-10
stockport_contracts	valentwin-stockport_contracts-n-100-hn-10-selective-neg-lr-3e5-bs-512	valentwin-stockport_contracts-n-100-hn-10-selective-neg-lr-3e5-bs-512
academic_papers	valentwin-academic_papers-n-100-hn-10-selective-neg-lr-3e5-bs-512	valentwin-academic_papers-n-100-hn-10-selective-neg-lr-3e5-bs-512-ep-10
books	valentwin-books-n-100-hn-10-selective-neg-lr-3e5-bs-512	valentwin-books-n-100-hn-10-selective-neg-lr-3e5-bs-512
cosmetics	valentwin-cosmetics-n-100-hn-10-selective-neg-lr-3e5-bs-512	valentwin-cosmetics-n-100-hn-10-selective-neg-lr-3e5-bs-512
movies	valentwin-movies-n-100-hn-10-selective-neg-lr-3e5-bs-512	valentwin-movies-n-100-hn-10-selective-neg-lr-3e5-bs-512-ep-10
restaurants	valentwin-restaurants-n-100-hn-10-selective-neg-lr-3e5-bs-512	valentwin-restaurants-n-100-hn-10-selective-neg-lr-3e5-bs-512

License

albertus-andito/valentwin

Folders and files

Latest commit

History

Repository files navigation

ValenTwin: Using Self-Supervised Contrastive Learning on Language Model for Schema Matching

Getting Started

Datasets

Trained Models

How to Run

1. Sample Data

2. Generate Contrastive Data

3. Train the Model

4. Run Matching

5. Evaluate Matching

How to Reproduce Experiments

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages