Inference and theoretical study of higher-order community structures
This repository contains the implementation of the methods presented in
[1] Message-Passing on Hypergraphs: Detectability, Phase Transitions and
Higher-Order Information
Nicolò Ruggeri*, Alessandro Lonardi*, and Caterina De Bacco
[
arXiv
]
[
Publication
]
Below, we explain how to run the Message-Passing (MP), Expectation-Maximization (EM) and
sampling procedures presented in the paper.
The code is made available for the public, if you make use of it please cite our work
in the form of the reference [1] above.
The code was developed utilizing Python 3.9, and can be downloaded and used locally as-is.
To install the necessary packages, run the following command
pip install -r requirements.txt
It is possible to run message passing and infer the marginal probabilities of nodes to
belong to certain communities, while conditioning on given community priors n
and
affinity matrix p
. To do so, execute the command:
python main_message_passing.py --em_iter 1 --mp_iter 2000 --mp_thresh 1e-5
--mp_patience 50 --dropout 0.75 --seed 123 --save_dir ./mp_results
--hye_file ./data/synthetic/sample.txt --n ./data/synthetic/n_prior.txt
--p ./data/synthetic/p_phase_transition_1.txt --K 4
The results will be saved in the output folder, and contain the node marginal probabilities, messages from nodes to hyperedges, messages from hyperedges to nodes, and free energy. Also convergence diagnostics are included, namely the difference in estimated marginal between consecutive MP iterations.
The main arguments accepted are:
real_dataset
orhye_file
: the dataset to run inference on (see more on input data format below).max_hye_size
: the maximum hyperedge size accepted. All hyperedges with higher size are discarded during inference.n
,p
: fixed parameters given to the model, they will not be updated.n
is the prior distribution of the communities,p
the affinity matrix, both in a format to be loaded vianumpy.loadtxt
.K
: number of communities in the model. Ifn
and/orp
are provided, they must respectively have shapes(K,)
and(K, K)
.mp_iter
,mp_thresh
,mp_patience
: setting the maximum number of MP iterations. If for more than a consecutive number of iterations (specified bymp_patience
) the change in marginal probabilities falls below the threshold, inference is stopped. Otherwise, inference stops when the maximum number of iterationsmp_iter
has been reached.dropout
: the percentage of randomly dropped messages in the MP updates.save_dir
: the folder where results are saved.seed
: random seed.
Input data format
A hypergraph is saved as a list of hyperedges. Therefore, the
script accepts hyperedge lists both in pickle and text format, as in the example
files ./data/synthetic/sample.pkl
and ./data/synthetic/sample.txt
.
Importantly, nodes are numbered from 0 to N-1, where N is the total number of
nodes.
Running EM inference allows inferring all the parameters of the model directly from the data. Specifically, if K is the number of communities, one can infer the following values:
- n, the vector of length K where the entries specify the frequencies of the inferred communities
- p, the K x K symmetric affinity matrix
as well as the values inferred from message passing, see above. Convergence diagnostics for EM are included: the difference in n and p values between consecutive EM iterations.
Inference is run via the relative script main_message_passing.py
. For example, to run
inference on the High School dataset utilized in the paper, run
python main_message_passing.py --real_dataset high_school --seed 234 --max_hye_size 5
--em_iter 20 --em_thresh 1e-5 --mp_iter 2000 --mp_thresh 1e-5 --mp_patience 50 --K 10
--dropout 0.99 --save_dir ./high_school_results
In the paper, we also present a new sampling algorithm that generates hypergraphs from
the probabilistic model. For example, samples similar to those in ./data/synthetic
are
obtained as:
python main_sampling.py --p ./data/synthetic/p_phase_transition_1.txt --max_hye_size 50
--node_assignments ./data/synthetic/node_assignment_phase_transition.txt
--allow_repeated True --seed 345 --save_dir ./sampling_results
The main arguments are:
p
the affinity matrix.max_hye_size
the maximum size of the sampled hyperedges.node_assignments
the array of node assignments to the communities. See the referenced file for an example structure.allow_repeated
whether to allow repeated hyperedges. If False, it allows for faster sampling and the approximation is negligible in sparse regimes.seed
random seed.save_dir
output path specifying the directory where results are saved.