Skip to content
/ S2AND Public
forked from allenai/S2AND

Semantic Scholar's Author Disambiguation Algorithm & Evaluation Suite

License

Notifications You must be signed in to change notification settings

atypon/S2AND

 
 

Repository files navigation

S2AND

This repository provides access to the S2AND dataset and S2AND reference model described in the paper S2AND: A Benchmark and Evaluation System for Author Name Disambiguation by Shivashankar Subramanian, Daniel King, Doug Downey, Sergey Feldman.

The reference model is live on semanticscholar.org, and the trained model is available now as part of the data download (see below).

Installation

To install this package, run the following:

git clone https://github.com/atypon/S2AND.git
cd S2AND
conda create -y --name s2and python==3.7
conda activate s2and
pip install -e .

If you run into cryptic errors about GCC on macOS while installing the requirments, try this instead:

CFLAGS='-stdlib=libc++' pip install -r requirements.in

Data

To obtain the S2AND dataset, run the following command after the package is installed (from inside the S2AND directory):

aws s3 sync --no-sign-request s3:https://ai2-s2-research-public/s2and-release data/

(Alternatively, you can run gsutil -m cp -r gs:https://pkg-datasets-regional-3da58327/datasets/S2AND/data data/)

Note that this software package comes with tools specifically designed to access and model the dataset.

For the data extended with PKG's info space, run the following command :

gsutil -m cp -r gs:https://pkg-datasets-regional-3da58327/datasets/S2AND/extended_data/ extended_data/

Configuration

Modify the config file at data/path_config.json. This file should look like this

{
    "main_data_dir": "absolute path to wherever you downloaded the data to",
    "internal_data_dir": "ignore this one unless you work at AI2"
}

As the dummy file says, main_data_dir should be set to the location of wherever you downloaded the data to, and internal_data_dir can be ignored, as it is used for some scripts that rely on unreleased data, internal to Semantic Scholar.

Run

There are three main run scripts to perform the disambiguation process.

  • run_inference.py: Produces embeddings with the selected transformer model defined by an onnx file for all signatures in the data.
  • run_and.py: Runs complete AND procedure by training a pairwise classifier and optimizing a clustering algorithm.

About

Semantic Scholar's Author Disambiguation Algorithm & Evaluation Suite

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Jupyter Notebook 83.8%
  • Python 16.2%