Skip to content

Inference and sampling on the Hy-MMSBM probabilistic model for hypergraphs.

License

Notifications You must be signed in to change notification settings

nickruggeri/Hy-MMSBM

Repository files navigation

Hy-MMSBM
Hypergraph Mixed-Membership Stochastic Block Model

Inference and benchmarking of higher-order community structure

License: MIT Made with Python Code style: black ARXIV: 2301.11226 ARXIV: 2212.08593 Treedom

This repository contains the implementation of the Hy-MMSBM model presented in

   [1] Community Detection in Large Hypergraphs
        Nicolò Ruggeri, Martina Contisciani, Federico Battiston and Caterina De Bacco
        [ Science Advances - ArXiv ]

   [2] A framework to generate hypergraphs with community structure
        Nicolò Ruggeri, Federico Battiston and Caterina De Bacco
        [ArXiv]

Respectively, these works present efficient inference and sampling methods based on Hy-MMSBM, a stochastic block model for higher-order interactions.
This code is made available for the public, if you make use of it please cite our work in the form of the references above.

Code installation

The code was developed utilizing Python 3.9, and can be downloaded and used locally as-is.
To install the necessary packages, run the following command

pip install -r requirements.txt

Inference of community structure

The inference of the affinity matrix w and community assignments u is performed by running the code in main_inference.py.

The most basic run only needs a dataset, the number of communities K, and a path to store the results.
For example, to perform inference on the Senate Bills dataset with K=2 communities, one can run the following command:

python main_inference.py --K 2 --out_dir ./out_inference/senate_bills --real_dataset senate-bills

Input dataset format

It is possible to provide the input dataset in three formats.

1. Real preprocessed dataset
We make all the real datasets we utilize in the experiments in [1] available, see the data release section.
After the download, any of these datasets can be utilized for inference by running the command above and specifying the dataset name using the --real_dataset argument.
The complete list of such datasets is available at src.data.data_io.PREPROCESSED_DATASETS.

2. Text format
Alternatively, a hypergraph can be provided as input via two .txt files, containing the list of hyperedges, and the relative weights. This allows the user to provide arbitrary datasets as inputs. We provide examples for the format expected in ./data/examples/justice_dataset/hyperedges.txt and ./data/examples/justice_dataset/weights.txt. To perform inference on a dataset specified in text format, provide the path to the two files as

python main_inference.py 
--K 2 
--out_dir ./out_inference/justice 
--hyperedge_file ./data/examples/justice_dataset/hyperedges.txt 
--weight_file ./data/examples/justice_dataset/weights.txt

In this case, the command above is equivalent to running with --real_dataset justice.

3. Pickle format
Finally, one can provide a Hypergraph instance, which is the main representation utilized internally in the code (see src.data.representation), serialized via the pickle Python library.
An example equivalent to the above is

python main_inference.py 
--K 2 
--out_dir ./out_inference/justice 
--pickle_file ./data/examples/justice_dataset/justice.pkl

Similarly to the text format, this allows to provide arbitrary hypergraphs as input.

Additional options

Additional options can be specified, the full documentation is shown by running

python main_inference.py --help

Among the important ones we list:

  • --assortative whether to run inference with a diagonal affinity matrix w.
  • --max_hye_size to keep only hyperedges up to a given size for inference. If None, all hyperedges are utilized.
  • --w_prior and --u_prior the rates for the exponential priors on the parameters. A value of zero is equivalent to no prior, any positive value is utilized for MAP inference.
    For non-uniform priors, the path to a file containing a NumPy array can be specified, which will be loaded via numpy.load.
  • --em_rounds number of EM steps during optimization. It is sometimes useful when the model doesn't converge rapidly.
  • --training_rounds the number of models to train with different random initializations. The one with the highest log-likelihood is returned and saved.
  • --seed integer random seed.

Sampling of synthetic data

Sampling of synthetic data from the generative model can be performed by running the code in main_sampling.py. As explained in the reference paper [2], various inputs can be provided to condition the sampling procedure.
We also refer to the notebook for additional examples.

Vanilla sampling

The simplest form of sampling, simply requires the affinity matrix w and community assignments u. These need to be provided in text files to be opened via numpy.loadtxt, for example

python main_sampling.py 
--w ./data/examples/fixed_dimension_sequence_hard/K=2/w.txt 
--u ./data/examples/fixed_dimension_sequence_hard/K=2/u.txt 
--out_dir ./out_sampling

Degree and size sequences

Samples can also be conditioned on the degree sequence (i.e. the array specifying each node's degree) , size sequence (i.e. the count of hyperedges for any given size), or both.
These need to be stored in text files, the degree sequence via numpy.savetxt, and the size sequence in a text file specifying, for every line, the size of the hyperedges and the number of the hyperedges of such size. Examples for these formats are given in ./data/examples at fixed_dimension_sequence_hard and justice_dataset.
The command line arguments are specified as follows:

python main_sampling.py 
--w ./data/examples/fixed_dimension_sequence_hard/K=2/w.txt 
--u ./data/examples/fixed_dimension_sequence_hard/K=2/u.txt 
--out_dir ./out_sampling
--deg_seq ./data/examples/fixed_dimension_sequence_hard/K=2/deg_seq.txt
--dim_seq ./data/examples/fixed_dimension_sequence_hard/K=2/dim_seq.txt

Notice that, if conditioning on both the sequences, and the sequences are extracted from an already available dataset, it is more convenient and computationally faster to directly provide the dataset as explained in the following section.

Conditioning on a dataset

Conditioning the sampling on an existing dataset means having the MCMC procedure start from the dataset itself, and is statistically equivalent to conditioning on its degree and size sequences. However, it is computationally faster since the sequences don't need to be arranged into a first hypergraph proposal (see reference paper [2]).
The format to provide a dataset as input is the same as the one specified in the relative section of the inference procedure.

Additional arguments

Additional arguments that can be provided are:

  • --max_hye_size the maximum hyperedges size allowed.
  • --burn_in_steps and --intermediate_steps the number of MCMC steps to be performed before returning the first sample, and in between returned samples.
  • --exact_dyadic_sampling whether to perform the sampling of dyadic interaction (i.e. standard edges between two nodes) exactly from their Poisson distribution, as opposed to approximate Central Limit Theorem-based sampling.
  • --allow_rescaling in case the degree or dimension sequences are provided, whether to rescale the model parameters to match the expected sequences with the input ones.
  • --n_samples number of generated samples.
  • --seed integer random seed.

Data release

All the preprocessed real datasets utilized in [1, 2], as well as some of the synthetic data generated for various experiments, are publicly available.
The data is stored and distributed via Edmond, the Open Research Data Repository of the Max Planck Society, and is available at the following link.

Alternatively, it can be directly downloaded (approximately 1.7GB compressed, then uncompressed to 12.5GB), by running the relative script as
python download_data.py

About

Inference and sampling on the Hy-MMSBM probabilistic model for hypergraphs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published