-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
127 lines (82 loc) · 5.62 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# fastRG
<!-- badges: start -->
[![Codecov test coverage](https://codecov.io/gh/RoheLab/fastRG/branch/main/graph/badge.svg)](https://app.codecov.io/gh/RoheLab/fastRG?branch=main)
[![CRAN status](https://www.r-pkg.org/badges/version/fastRG)](https://CRAN.R-project.org/package=fastRG)
[![R-CMD-check](https://github.com/RoheLab/fastRG/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/RoheLab/fastRG/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
`fastRG` quickly samples a broad class of network models known as generalized random dot product graphs (GRDPGs). In particular, for matrices $X$, $S$ and $Y$, `fastRG` samples a matrix $A$ with expectation $X S Y^T$ where the entries are independently Poisson distributed conditional on $X$ and $Y$. This is primarily useful when $A$ is the adjacency matrix of a graph. Crucially, the sampling is $\mathcal O(m)$, where $m$ is the number of the edges in graph, as opposed to the naive sampling approach, which is $\mathcal O(n^2)$, where $n$ is the number of nodes in the network. For additional details, see the [paper](https://arxiv.org/abs/1703.02998) [1].
`fastRG` has two primary use cases:
1. Sampling enormous sparse graphs that cannot feasibly be sampled with existing samplers, and
1. validating new methods for random dot product graphs (and variants).
`fastRG` makes the latent parameters of random dot product graphs readily available to users, such that simulation studies for community detection, subspace recovery, etc, are straightforward.
## Installation
You can install the released version of fastRG from [CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("fastRG")
```
And the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("RoheLab/fastRG")
```
## Usage
There are two stages to sampling from generalized random dot product graphs. First, we sample the latent factors $X$ and $Y$. Then we sample $A$ conditional on those latent factors. `fastRG` mimics this two-stage sample structure. For example, to sample from a stochastic blockmodel, we first create the latent factors.
```{r}
library(fastRG)
set.seed(27)
sbm <- sbm(n = 1000, k = 5, expected_density = 0.01)
```
You can specify the latent factors and the mixing matrix $B$ yourself, but there are also defaults to enable fast prototyping. Here $B$ was randomly generated with `Uniform[0, 1]` entries and nodes were assigned randomly to communities with equal probability of falling in all communities. Printing the result object gives us some additional information:
```{r}
sbm
```
Now, conditional on this latent representation, we can sample graphs. `fastRG` supports several different output types, each of which is specified by the suffix to `sample_*()` functions. For example, we can obtain an edgelist in a `tibble` with:
```{r}
sample_edgelist(sbm)
```
but we can just as easily obtain the graph as a sparse matrix
```{r}
A <- sample_sparse(sbm)
A[1:10, 1:10]
```
or an igraph object
```{r}
sample_igraph(sbm)
```
Note that every time we call `sample_*()` we draw a new sample.
```{r}
A <- sample_sparse(sbm)
B <- sample_sparse(sbm)
all(A == B) # random realizations from the SBM don't match!
```
## Efficient spectral decompositions
If you would like to obtain the singular value decomposition of the population adjacency matrix conditional on latent factors, that is straightforward:
```{r}
s <- eigs_sym(sbm)
s$values
```
Note that eigendecompositions and SVDS (for directed graphs) use `RSpectra` and do not require explicitly forming large dense population adjacency matrices; the population decompositions should be efficient in both time and space for even large graphs.
## Key sampling options
There are several essential tools to modify graph sampling that you should know about. First there are options that affect the latent factor sampling:
- `expected_degree`: Set the expected average degree of the graph by scaling sampling probabilities. We *strongly, strongly* recommend that you always set this option. If you do not, it is easy accidentally sample from large and dense graphs.
- `expected_density`: Set the expected density of the graph by scaling sampling probabilities. You cannot specify both `expected_degree` and `expected_density` at the same time.
In the second stage of graph sampling, the options are:
- `poisson_edges`: Either `TRUE` or `FALSE` depending on whether you would like a Bernoulli graph or a Poisson multi-graph. Scaling via `expected_degree` assumes a Poisson multi-graph, with some limited exceptions.
- `allow_self_edges`: Whether nodes should be allowed to connect to themselves. Either `TRUE` or `FALSE`.
## Related work
[`igraph`](https://igraph.org/r/) allows users to sample SBMs (in $\mathcal O(m + n + k^2)$ time) and random dot product graphs (in $\mathcal O(n^2 k)$ time).
You can find the original research code associated with `fastRG` [here](https://github.com/raningtky/sampleRDPG). There is also a Python translation of the original code in Python [here](https://github.com/yunjhongwu/matrix-routines/blob/master/fastRG.py). Both of these implementations are bare bones.
## References
[1] Rohe, Karl, Jun Tao, Xintian Han, and Norbert Binkiewicz. 2017. "A Note on Quickly Sampling a Sparse Matrix with Low Rank Expectation." Journal of Machine Learning Research; 19(77):1-13, 2018. <https://www.jmlr.org/papers/v19/17-128.html>