Skip to content

Commit

Permalink
Reorganizing pages
Browse files Browse the repository at this point in the history
  • Loading branch information
gvegayon committed May 11, 2024
1 parent 7d8a489 commit 85d712d
Show file tree
Hide file tree
Showing 9 changed files with 311 additions and 295 deletions.
Binary file modified 03.rda
Binary file not shown.
33 changes: 21 additions & 12 deletions _quarto.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,38 +9,46 @@ book:
google-analytics: UA-40608305-4
title: Applied Network Science with R
author: George G. Vega Yon, Ph.D.
date: 2024-05-07
date: today
cover-image: img/front-page-dalle.png
cover-image-alt: 'An AI image generated with Bing: Draw an image of a social network. Include a person examining the network and holding a laptop in one hand. The laptop should have the logo of the R programming language.'
page-footer: Applied Network Science with R - [https://ggvy.cl](https://ggvy.cl){target="_blank"}
repo-url: https://github.com/gvegayon/appliedsnar
repo-branch: master
repo-actions: [edit]
image: img/front-page-dalle.png
twitter-card:
description: 'This (WIP) book is a collection of examples using the R programming for network science. It includes examples of network data processing, visualization, simulation, and modeling.'
creator: "@gvegayon"
site-url: https://book.ggvy.cl
sharing: [twitter, linkedin]
navbar:
background: light
# background: light
search: true

chapters:
- index.qmd
- part: Applications
- part-01-01-intro.qmd
- part-01-02-the-basics.qmd
- part: "**Applications**"
chapters:
- part-01-01-intro.qmd
- part-01-02-the-basics.qmd
- part-01-03-week-1-sns-study.qmd
- part-01-06-network-simulation-and-viz.qmd
- part-01-07-egonets.qmd
- part-01-09-netdiffuser.qmd
- part: "**Statistical inference**"
chapters:
- part-01-04-ergms.qmd
- part-01-05-ergms-constrains.qmd
- part-01-05-stergm.qmd
- part-01-06-network-simulation-and-viz.qmd
- part-01-07-egonets.qmd
- part-01-08-netboot.qmd
- part-01-09-netdiffuser.qmd
- part-01-10-siena.qmd
- part-01-11-power.qmd
- part: Statistical Foundations
- part: "**Foundations**"
chapters:
- part-02-10-statistical-foundations.qmd
- part-02-11-power.qmd
- part: Appendix
- part: "**Appendix**"
chapters:
- part-03-12-data-appendix.qmd
- references.qmd
Expand All @@ -52,11 +60,12 @@ bibliography: book.bib

biblio-style: apalike

format:
format:
html:
html-math-method: mathjax
toc: true
number-sections: true
number-sections: false
theme: cerulean
pdf:
geometry:
- top=1in
Expand Down
10 changes: 9 additions & 1 deletion book.bib
Original file line number Diff line number Diff line change
Expand Up @@ -309,4 +309,12 @@ @Manual{R
address = {Vienna, Austria},
year = {2024},
url = {https://www.R-project.org/},
}
}

@Manual{R-latticeExtra,
title = {latticeExtra: Extra Graphical Utilities Based on Lattice},
author = {Deepayan Sarkar and Felix Andrews},
year = {2022},
note = {R package version 0.6-30},
url = {https://latticeextra.r-forge.r-project.org/},
}
Binary file modified ergm.rda
Binary file not shown.
6 changes: 3 additions & 3 deletions part-01-03-week-1-sns-study.qmd
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
---
date-modified: 2024-05-09
date-modified: 2024-05-10
---

# Network Nomination Data
# School networks

This chapter provides a start-to-finish example for processing survey-type data in R. The chapter features the Social Network Study [SNS] dataset. You can download the data for this chapter [here](https://cdn.rawgit.com/gvegayon/appliedsnar/fdc0d26f/03-sns.dta); and the codebook for the data provided here is in [the appendix](#sns-data).
This chapter provides a start-to-finish example for processing survey-type data in R. The chapter features the Social Network Study [SNS] dataset. You can download the data for this chapter [here](https://cdn.rawgit.com/gvegayon/appliedsnar/fdc0d26f/03-sns.dta), and the codebook for the data provided here is in [the appendix](#sns-data).

The goals for this chapter are:

Expand Down
115 changes: 64 additions & 51 deletions part-01-04-ergms.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
---
date-modified: 2024-05-11
---

# Exponential Random Graph Models

I strongly suggest reading the vignette included in the `ergm` R package.
I strongly suggest reading the vignette in the `ergm` R package.

:::{.content-hidden}
{{< include math.tex >}}
Expand Down Expand Up @@ -44,13 +48,13 @@ $$
,is the normalizing factor that ensures that equation @eq-main-ergm is a legitimate probability distribution. Even after fixing $\mathcal{Y}$ to be all the networks that have size $n$, the size of $\mathcal{Y}$ makes this type of statistical model hard to estimate as there are $N = 2^{n(n-1)}$ possible networks! [@Hunter2008]
Recent developments include new forms of dependency structures to take into account more general neighborhood effects. These models relax the one-step Markovian dependence assumptions, allowing investigation of longer-range configurations, such as longer paths in the network or larger cycles (Pattison and Robins 2002). Models for bipartite (Faust and Skvoretz 1999) and tripartite (Mische and Robins 2000) network structures have been developed. [@Hunter2008 p. 9]
Later developments include new dependency structures to consider more general neighborhood effects. These models relax the one-step Markovian dependence assumptions, allowing investigation of longer-range configurations, such as longer paths in the network or larger cycles (Pattison and Robins 2002). Models for bipartite (Faust and Skvoretz 1999) and tripartite (Mische and Robins 2000) network structures have been developed. [@Hunter2008 p. 9]
## A naïve example
In the simplest case, ERGMs equate a logistic regression. By simple, I mean cases in which there are no Markovian terms--motifs involving more than one edge--for example, the Bernoulli graph. In the Bernoulli graph, ties are independent of each other, so the presence/absence of a tie between nodes $i$ and $j$ won't affect the presence/absence of a tie between nodes $k$ and $l$.
In the simplest case, ERGMs equate a logistic regression. By simple, I mean cases with no Markovian terms--motifs involving more than one edge--for example, the Bernoulli graph. In the Bernoulli graph, ties are independent, so the presence/absence of a tie between nodes $i$ and $j$ won't affect the presence/absence of a tie between nodes $k$ and $l$.
Let's fit an ERGM using the `sampson` dataset included in the `ergm` package.
Let's fit an ERGM using the `sampson` dataset in the `ergm` package.
```{r part-01-04-loading-data, echo=TRUE, collapse=TRUE, message=FALSE}
Expand Down Expand Up @@ -100,7 +104,7 @@ Again, the same result. The Bernoulli graph is not the only ERGM model that can
## Estimation of ERGMs
The ultimate goal is to perform statistical inference on the proposed model. In a *standard* setting, we would be able to use Maximum-Likelihood-Estimation (MLE), which consists of finding the model parameters $\theta$ that, given the observed data, maximize the likelihood of the model. For the latter, we generally use [Newton's method](https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization). Newton's method requires been able to compute the log-likelihood of the model, which in ERGMs can be challenging.
The ultimate goal is to perform statistical inference on the proposed model. In a *standard* setting, we could use Maximum Likelihood Estimation (MLE), which consists of finding the model parameters $\theta$ that, given the observed data, maximize the likelihood of the model. For the latter, we generally use [Newton's method](https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization). Newton's method requires computing the model's log-likelihood, which can be challenging in ERGMs.
For ERGMs, since part of the likelihood involves a normalizing constant that is a function of all possible networks, this is not as straightforward as in the regular setting. Because of this, most estimation methods rely on simulations.
Expand Down Expand Up @@ -144,15 +148,15 @@ For more details, see [@Hunter2008]. A sketch of the algorithm follows:
1. Initialize the algorithm with an initial guess of $\theta$, call it $\theta^{(t)}$ (must be a rather OK guess)
2. While (no convergence) do:
a. Using $\theta^{(t)}$, simulate $M$ networks by means of small changes in the $\mathbf{Y}_{obs}$ (the observed network). This part is done by using an importance-sampling method which weights each proposed network by its likelihood conditional on $\theta^{(t)}$
2. While (no convergence) do:
b. With the networks simulated, we can do the Newton step to update the parameter $\theta^{(t)}$ (this is the iteration part in the `ergm` package): $\theta^{(t)}\to\theta^{(t+1)}$.
c. If convergence has been reached (which usually means that $\theta^{(t)}$ and $\theta^{(t + 1)}$ are not very different), then stop; otherwise, go to step a.
a. Using $\theta^{(t)}$, simulate $M$ networks by means of small changes in the $\mathbf{Y}_{obs}$ (the observed network). This part is done by using an importance-sampling method which weights each proposed network by its likelihood conditional on $\theta^{(t)}$
b. With the networks simulated, we can do the Newton step to update the parameter $\theta^{(t)}$ (this is the iteration part in the `ergm` package): $\theta^{(t)}\to\theta^{(t+1)}$.
c. If convergence has been reached (which usually means that $\theta^{(t)}$ and $\theta^{(t + 1)}$ are not very different), then stop; otherwise, go to step a.
For more details see [@lusher2012;@admiraal2006;@Snijders2002;@Wang2009] provides details on the algorithm used by PNet (which is the same as the one used in `RSiena`). [@lusher2012] provides a short discussion on the differences between `ergm` and `PNet`.
[@lusher2012;@admiraal2006;@Snijders2002;@Wang2009] provides details on the algorithm used by PNet (the same as the one used in `RSiena`), and [@lusher2012] provides a short discussion on the differences between `ergm` and `PNet`.
## The `ergm` package
Expand Down Expand Up @@ -391,56 +395,65 @@ sample_uncentered <- coda::mcmc.list(sample_uncentered)
Under the hood:
1. _Empirical means and sd, and quantiles_:
```{r coda-summary}
summary(sample_uncentered)
```
2. _Cross correlation_:
```{r coda-corr}
coda::crosscorr(sample_uncentered)
```
3. _Autocorrelation_: For now, we will only look at autocorrelation for chain one. Autocorrelation should be small (in a general MCMC setting). If autocorrelation is high, then it means that your sample is not idd (no Markov property). A way out to solve this is *thinning* the sample.
```{r coda-autocorr}
coda::autocorr(sample_uncentered)[[1]]
```
4. _Geweke Diagnostic_: From the function's help file:
> "If the samples are drawn from the stationary distribution of the chain, the two means are equal and Geweke's statistic has an asymptotically standard normal distribution. [...]
The Z-score is calculated under the assumption that the two parts of the chain are asymptotically independent, which requires that the sum of frac1 and frac2 be strictly less than 1.""
>
> ---?coda::geweke.diag
Let's take a look at a single chain:
```{r coda-geweke.diag}
coda::geweke.diag(sample_uncentered)[[1]]
```
5. _(not included) Gelman Diagnostic_: From the function's help file:
> Gelman and Rubin (1992) propose a general approach to monitoring convergence of MCMC output in which m > 1 parallel chains are run with starting values that are overdispersed relative to the posterior distribution. Convergence is diagnosed when the chains have ‘forgotten’ their initial values, and the output from all chains is indistinguishable. The gelman.diag diagnostic is applied to a single variable from the chain. It is based a comparison of within-chain and between-chain variances, and is similar to a classical analysis of variance.
> ---?coda::gelman.diag
As a difference from the previous diagnostic statistic, this uses all chains simulatenously:
```{r coda-gelman.diag}
coda::gelman.diag(sample_uncentered)
```
1. _Empirical means and sd, and quantiles_:
```{r coda-summary}
summary(sample_uncentered)
```
2. _Cross correlation_:
```{r coda-corr}
coda::crosscorr(sample_uncentered)
```
3. Autocorrelation_: For now, we will only look at autocorrelation for chain one. Autocorrelation should be small (in a general MCMC setting). If autocorrelation is high, then it means that your sample is not idd (no Markov property). A way out to solve this is *thinning* the sample.
```{r coda-autocorr}
coda::autocorr(sample_uncentered)[[1]]
```
4. _Geweke Diagnostic_: From the function's help file:
> "If the samples are drawn from the stationary distribution of the chain, the two means are equal and Geweke's statistic has an asymptotically standard normal distribution. [...]
The Z-score is calculated under the assumption that the two parts of the chain are asymptotically independent, which requires that the sum of frac1 and frac2 be strictly less than 1.""
>
> ---?coda::geweke.diag
Let's take a look at a single chain:
```{r coda-geweke.diag}
coda::geweke.diag(sample_uncentered)[[1]]
```
5. _(not included) Gelman Diagnostic_: From the function's help file:
As a rule of thumb, values that are in the $[.9,1.1]$ are good.
> Gelman and Rubin (1992) propose a general approach to monitoring convergence of MCMC output in which m > 1 parallel chains are run with starting values that are overdispersed relative to the posterior distribution. Convergence is diagnosed when the chains have ‘forgotten’ their initial values, and the output from all chains is indistinguishable. The gelman.diag diagnostic is applied to a single variable from the chain. It is based a comparison of within-chain and between-chain variances, and is similar to a classical analysis of variance.
> ---?coda::gelman.diag
As a difference from the previous diagnostic statistic, this uses all chains simultaneously:
```{r coda-gelman.diag}
coda::gelman.diag(sample_uncentered)
```
As a rule of thumb, values in the $[.9,1.1]$ are good.
One nice feature of the `mcmc.diagnostics` function is the nice trace and posterior distribution plots that it generates. If you have the R package `latticeExtra` [@R-latticeExtra], the function will override the default plots used by `coda::plot.mcmc` and use lattice instead, creating a nicer looking plots. The next code chunk calls the `mcmc.diagnostic` function, but we suppress the rest of the output (see figure \@ref(fig:coda-plots)).
One nice feature of the `mcmc.diagnostics` function is the nice trace and posterior distribution plots that it generates. If you have the R package `latticeExtra` [@R-latticeExtra], the function will override the default plots used by `coda::plot.mcmc` and use lattice instead, creating nicer-looking plots. The next code chunk calls the `mcmc.diagnostic` function, but we suppress the rest of the output (see figure @fig-coda-plots).
```{r coda-plots, fig.align='center', fig.height=8, cache=FALSE, echo=TRUE, results='hide', warning=FALSE, fig.cap=c("Trace and posterior distribution of sampled network statistics.", "Trace and posterior distribution of sampled network statistics (cont'd)."), fig.pos='!h'}
```{r}
#| label: fig-coda-plots
#| fig-align: center
#| fig-cap: "Trace and posterior distribution of sampled network statistics."
# [2022-03-13] This line is failing for what it could be an ergm bug
# mcmc.diagnostics(ans0, center = FALSE) # Suppressing all the output
```
If we call the function `mcmc.diagnostics`, this message appears at the end:
>
MCMC diagnostics shown here are from the last round of simulation, prior to computation of final parameter estimates. Because the final estimates are refinements of those used for this simulation run, these diagnostics may understate model performance. To directly assess the performance of the final model on in-model statistics, please use the GOF command: gof(ergmFitObject, GOF=~model).
> MCMC diagnostics shown here are from the last round of simulation, prior to computation of final parameter estimates. Because the final estimates are refinements of those used for this simulation run, these diagnostics may understate model performance. To directly assess the performance of the final model on in-model statistics, please use the GOF command: gof(ergmFitObject, GOF=~model).
>
> ---`mcmc.diagnostics(ans0)`
Expand Down
4 changes: 2 additions & 2 deletions part-01-05-stergm.qmd
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# (Separable) Temporal Exponential Family Random Graph Models
# Temporal Exponential Family Random Graph Models

This tutorial is great! https://statnet.org/trac/raw-attachment/wiki/Sunbelt2016/tergm_tutorial.pdf
This tutorial is great! [https://statnet.org/trac/raw-attachment/wiki/Sunbelt2016/tergm_tutorial.pdf](https://statnet.org/trac/raw-attachment/wiki/Sunbelt2016/tergm_tutorial.pdf){target="_blank"}
2 changes: 1 addition & 1 deletion part-01-06-network-simulation-and-viz.qmd
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Simulating and visualizing networks
# Simulation and vizualization

In this chapter, we will build and visualize artificial networks using Exponential
Random Graph Models [ERGMs.] Together with chapter 3, this will be an extended
Expand Down
Loading

0 comments on commit 85d712d

Please sign in to comment.