Skip to content

Latest commit

 

History

History

dataset_gen

Data Preparation

Our datasets are generated as the following procedures.

SPMotif Datasets

We adopt the codes of DIR to generate the SPMotif datasets. SPMotif-Struc is basically the same as the SPMotif datasets in DIR and can be generated by running dataset_gen/gen_struc.py, with a bias configuration specifying the value of global_b:

cd dataset_gen
python gen_struc.py

The generated data will be stored as in ./data/SPMotif-{global_b} at the root directory of this repo. To use the dataset in main.py, specify the --dataset option and --bias option as mSPMotif and a corresponding bias, respectively.

To generate the SPMotif-Mixed datasets, simply running the similar codes, with a bias configuration specifying the value of global_b:

cd dataset_gen
python gen_mixed.py

The generated data will be stored as in ./data/mSPMotif-{global_b} at the root directory of this repo. The gen_mixed.py will add the graph size shifts and structure-level shifts while the ./datasets/spmotif_dataset.py will automatically add node feature-level shifts during the data preparation. To use the dataset in main.py, specify the --dataset option and --bias option as mSPMotif and a corresponding bias, respectively.

DrugOOD Datasets

To obtain the DrugOOD datasets tested in our paper, i.e., drugood_lbap_core_ic50_assay, drugood_lbap_core_ic50_scaffold and drugood_lbap_core_ic50_size, we use the DrugOOD curation codes based on the commit eeb00b8da7646e1947ca7aec93041052a48bd45e and chembl_29 database. After curating the datasets, put the corresponding json files under ./data/DrugOOD, and specify the --dataset option as the corresponding dataset name to use, e.g., drugood_lbap_core_ic50_assay.

CMNIST-sp

The CMNIST dataset is generated following the Invariant Risk Minimization and then converted into graphs using the SLIC superpixels algorithm. To generate the dataset, simply run the codes as the following:

cd dataset_gen
python prepare_mnist.py  --dataset 'cmnist'  -t 8 -s 'train'
python prepare_mnist.py  --dataset 'cmnist'  -t 8 -s 'test'

and the generated data will be put into ./data/CMNISTSP at the root directory of this repo. Note that two auxiliary datasets ./data/MNIST and ./data/ColoredMNIST will also be created as the base for the generation of ./data/CMNISTSP. To use the dataset, simply specify --dataset option as CMNIST.

Graph-SST5 and Twitter

Both of Graph-SST5 and Twitter are based on the datasets provided by DIG. To get the datasets, you may download via this link provided by DIG and the GNN explainability survey authors. Then unzip the data into ./data/Graph-SST2/raw and ./data/Graph-Twitter/raw. By specifying --dataset as the dataset name in main.py, the data loading process will add the degree biases automatically.

NCI1, NCI109, PROTEINS and DD

We use the datasets provided by size-invariant-GNNs authors, who already sampled the datasets with graph size distribution shifts injected. The datasets can be downloaed via this link. After downloading, simply unzip the datasets into ./data/TU. To use the datasets, simply specify --dataset as the dataset name in main.py.