Inference and benchmarking of higher-order community structure
This repository contains the implementation of the Hy-MMSBM model presented in
[1] Community Detection in Large Hypergraphs
Nicolò Ruggeri, Martina Contisciani, Federico Battiston and Caterina De Bacco
[
Science Advances
-
ArXiv
]
[2] A framework to generate hypergraphs with community structure
Nicolò Ruggeri, Federico Battiston and Caterina De Bacco
[ArXiv]
Respectively, these works present efficient inference and sampling methods based on
Hy-MMSBM, a stochastic block model for higher-order interactions.
This code is made available for the public, if you make use of it please cite our work
in the form of the references above.
The code was developed utilizing Python 3.9, and can be downloaded and used locally as-is.
To install the necessary packages, run the following command
pip install -r requirements.txt
The inference of the affinity matrix w and community assignments u is
performed by running the code in main_inference.py
.
The most basic run only needs a dataset, the number of communities K, and a path to store the results.
For example, to perform inference on the Senate Bills dataset with K=2
communities, one can run the following command:
python main_inference.py --K 2 --out_dir ./out_inference/senate_bills --real_dataset senate-bills
It is possible to provide the input dataset in three formats.
1. Real preprocessed dataset
We make all the real datasets we utilize in the
experiments in [1] available, see the data release section.
After the download, any of these datasets can be utilized for inference by running the
command above and specifying the dataset name using the --real_dataset
argument.
The complete list of such datasets is available at
src.data.data_io.PREPROCESSED_DATASETS
.
2. Text format
Alternatively, a hypergraph can be provided as input via two .txt files,
containing the list of hyperedges, and the relative weights.
This allows the user to provide arbitrary datasets as inputs.
We provide examples for the format expected in ./data/examples/justice_dataset/hyperedges.txt
and ./data/examples/justice_dataset/weights.txt.
To perform inference on a dataset specified in text format, provide the path to the two
files as
python main_inference.py
--K 2
--out_dir ./out_inference/justice
--hyperedge_file ./data/examples/justice_dataset/hyperedges.txt
--weight_file ./data/examples/justice_dataset/weights.txt
In this case, the command above is equivalent to running with --real_dataset justice
.
3. Pickle format
Finally, one can provide a Hypergraph
instance, which is the main representation
utilized internally in the code (see src.data.representation
), serialized via the
pickle Python library.
An example equivalent to the above is
python main_inference.py
--K 2
--out_dir ./out_inference/justice
--pickle_file ./data/examples/justice_dataset/justice.pkl
Similarly to the text format, this allows to provide arbitrary hypergraphs as input.
Additional options can be specified, the full documentation is shown by running
python main_inference.py --help
Among the important ones we list:
--assortative
whether to run inference with a diagonal affinity matrix w.--max_hye_size
to keep only hyperedges up to a given size for inference. IfNone
, all hyperedges are utilized.--w_prior
and--u_prior
the rates for the exponential priors on the parameters. A value of zero is equivalent to no prior, any positive value is utilized for MAP inference.
For non-uniform priors, the path to a file containing a NumPy array can be specified, which will be loaded vianumpy.load
.--em_rounds
number of EM steps during optimization. It is sometimes useful when the model doesn't converge rapidly.--training_rounds
the number of models to train with different random initializations. The one with the highest log-likelihood is returned and saved.--seed
integer random seed.
Sampling of synthetic data from the generative model can be performed by running the
code in main_sampling.py
. As explained in the reference paper [2], various inputs can
be provided to condition the sampling procedure.
We also refer to the notebook for additional examples.
The simplest form of sampling, simply requires the affinity matrix w and
community assignments u. These need to be provided in text files to be opened
via numpy.loadtxt
, for example
python main_sampling.py
--w ./data/examples/fixed_dimension_sequence_hard/K=2/w.txt
--u ./data/examples/fixed_dimension_sequence_hard/K=2/u.txt
--out_dir ./out_sampling
Samples can also be conditioned on the degree sequence (i.e. the array specifying each
node's degree) , size sequence (i.e. the count of hyperedges for any given size), or both.
These need to be stored in text files, the degree sequence via numpy.savetxt
, and the
size sequence in a text file specifying, for every line, the size of the hyperedges and
the number of the hyperedges of such size. Examples for these formats are given in
./data/examples at fixed_dimension_sequence_hard
and justice_dataset
.
The command line arguments are specified as follows:
python main_sampling.py
--w ./data/examples/fixed_dimension_sequence_hard/K=2/w.txt
--u ./data/examples/fixed_dimension_sequence_hard/K=2/u.txt
--out_dir ./out_sampling
--deg_seq ./data/examples/fixed_dimension_sequence_hard/K=2/deg_seq.txt
--dim_seq ./data/examples/fixed_dimension_sequence_hard/K=2/dim_seq.txt
Notice that, if conditioning on both the sequences, and the sequences are extracted from an already available dataset, it is more convenient and computationally faster to directly provide the dataset as explained in the following section.
Conditioning the sampling on an existing dataset means having the MCMC procedure start
from the dataset itself, and is statistically equivalent to conditioning on its degree
and size sequences. However, it is computationally faster since the sequences don't need
to be arranged into a first hypergraph proposal (see reference paper [2]).
The format to provide a dataset as input is the same as the one specified in the
relative section of the inference procedure.
Additional arguments that can be provided are:
--max_hye_size
the maximum hyperedges size allowed.--burn_in_steps
and--intermediate_steps
the number of MCMC steps to be performed before returning the first sample, and in between returned samples.--exact_dyadic_sampling
whether to perform the sampling of dyadic interaction (i.e. standard edges between two nodes) exactly from their Poisson distribution, as opposed to approximate Central Limit Theorem-based sampling.--allow_rescaling
in case the degree or dimension sequences are provided, whether to rescale the model parameters to match the expected sequences with the input ones.--n_samples
number of generated samples.--seed
integer random seed.
All the preprocessed real datasets utilized in [1, 2], as well as some of the synthetic
data generated for various experiments, are publicly available.
The data is stored and distributed via Edmond,
the Open Research Data Repository of the Max Planck Society, and is available at the
following
link.
Alternatively, it can be directly downloaded
(approximately 1.7GB compressed, then uncompressed to 12.5GB),
by running the relative script as
python download_data.py