S2AND

This repository provides access to the S2AND dataset and S2AND reference model described in the paper S2AND: A Benchmark and Evaluation System for Author Name Disambiguation by Shivashankar Subramanian, Daniel King, Doug Downey, Sergey Feldman.

The reference model is live on semanticscholar.org, and the trained model is available now as part of the data download (see below).

Installation

To install this package, run the following:

git clone https://github.com/atypon/S2AND.git
cd S2AND
conda create -y --name s2and python==3.7
conda activate s2and
pip install -e .

If you run into cryptic errors about GCC on macOS while installing the requirments, try this instead:

CFLAGS='-stdlib=libc++' pip install -r requirements.in

Data

To obtain the S2AND dataset, run the following command after the package is installed (from inside the S2AND directory):

aws s3 sync --no-sign-request s3:https://ai2-s2-research-public/s2and-release data/

(Alternatively, you can run gsutil -m cp -r gs:https://pkg-datasets-regional-3da58327/datasets/S2AND/data data/)

Note that this software package comes with tools specifically designed to access and model the dataset.

For the data extended with PKG's info space, run the following command :

gsutil -m cp -r gs:https://pkg-datasets-regional-3da58327/datasets/S2AND/extended_data/ extended_data/

Configuration

Modify the config file at data/path_config.json. This file should look like this

{
    "main_data_dir": "absolute path to wherever you downloaded the data to",
    "internal_data_dir": "ignore this one unless you work at AI2"
}

As the dummy file says, main_data_dir should be set to the location of wherever you downloaded the data to, and internal_data_dir can be ignored, as it is used for some scripts that rely on unreleased data, internal to Semantic Scholar.

Run

There are three main run scripts to perform the disambiguation process.

run_inference.py: Produces embeddings with the selected transformer model defined by an onnx file for all signatures in the data.
run_and.py: Runs complete AND procedure by training a pairwise classifier and optimizing a clustering algorithm.

Name		Name	Last commit message	Last commit date
Latest commit History 219 Commits
clustering_results		clustering_results
configs		configs
data		data
extended_data		extended_data
external_embeddings		external_embeddings
models		models
notebooks		notebooks
pickled_datasets		pickled_datasets
results		results
s2and		s2and
scripts		scripts
spark		spark
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.in		requirements.in
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

S2AND

Installation

Data

Configuration

Run

About

Releases

Packages

Languages

License

atypon/S2AND

Folders and files

Latest commit

History

Repository files navigation

S2AND

Installation

Data

Configuration

Run

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages