-
Notifications
You must be signed in to change notification settings - Fork 0
/
multiple-regression.qmd
143 lines (106 loc) · 6.83 KB
/
multiple-regression.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
# Multiple Regression {#sec-multipleRegression}
## Getting Started {#sec-multipleRegressionGettingStarted}
### Load Packages {#sec-multipleRegressionLoadPackages}
```{r}
```
## Overview of Multiple Regression {#sec-multipleRegressionOverview}
Multiple regression examines the association between multiple predictor variables and one outcome variable.
It allows obtaining a more accurate estimate of the unique contribution of a given predictor, by controlling for other variables ([covariates](#sec-covariates)).
Regression with one predictor takes the form of @eq-regression:
$$
y = \beta_0 + \beta_1x_1 + \epsilon
$$ {#eq-regression}
where $y$ is the [outcome variable](#sec-correlationalStudy), $\beta_0$ is the intercept, $\beta_1$ is the slope, $x_1$ is the [predictor variable](#sec-correlationalStudy), and $\epsilon$ is the error term.
A regression line is depicted in @fig-regression.
```{r}
#| include: false
#| eval: false
set.seed(52242)
regression <- data.frame(outcome = rnorm(40, mean = 5, sd = 2))
regression$predictor <- complement(regression$outcome, .5)
regression$predictor <- regression$predictor + abs(min(regression$predictor))
lm(
outcome ~ predictor,
data = regression)
ggplot2::ggplot(
data = regression,
ggplot2::aes(
x = predictor,
y = outcome,
)
) +
ggplot2::geom_point() +
ggplot2::geom_smooth(
method = "lm",
linewidth = 2,
se = FALSE,
fullrange = TRUE) +
ggplot2::scale_x_continuous(
lim = c(0,8),
breaks = seq(from = 0, to = 8, by = 2),
expand = c(0,0)
) +
ggplot2::scale_y_continuous(
lim = c(0,8),
breaks = seq(from = 0, to = 8, by = 2),
expand = c(0,0)
) +
ggplot2::labs(
x = "Predictor Variable",
y = "Outcome Variable",
title = "Regression Best-Fit Line"
) +
ggplot2::theme_classic(
base_size = 16) +
ggplot2::theme(legend.title = element_blank())
ggsave("./images/regression.pdf", width = 6, height = 6)
```
::: {#fig-regression}
![](images/regression.png)
A Regression Best-Fit Line.
:::
Regression with multiple predictors—i.e., multiple regression—takes the form of @eq-multipleRegression:
$$
y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p + \epsilon
$$ {#eq-multipleRegression}
where $p$ is the number of [predictor variables](#sec-correlationalStudy).
## Components {#sec-multipleRegressionComponents}
- $B$ = unstandardized coefficient: direction and magnitude of the estimate (original scale)
- $\beta$ (beta) = standardized coefficient: direction and magnitude of the estimate (standard deviation scale)
- $SE$ = standard error: uncertainty of unstandardized estimate
The unstandardized regression coefficient ($B$) is interpreted such that, for every unit change in the [predictor variable](#sec-correlationalStudy), there is a __ unit change in the outcome variable.
For instance, when examining the association between age and fantasy points, if the unstandardized regression coefficient is 2.3, players score on average 2.3 more points for each additional year of age.
(In reality, we might expect a nonlinear, inverted-U-shaped association between age and fantasy points such that players tend to reach their peak in the middle of their careers.)
Unstandardized regression coefficients are tied to the metric of the raw data.
Thus, a large unstandardized regression coefficient for two variables may mean completely different things.
Holding the strength of the association constant, you tend to see larger unstandardized regression coefficients for variables with smaller units and smaller unstandardized regression coefficients for variables with larger units.
Standardized regression coefficients can be obtained by standardizing the variables to [*z*-scores](#sec-zScores) so they all have a mean of zero and standard deviation of one.
The standardized regression coefficient ($\beta$) is interpreted such that, for every standard deviation change in the [predictor variable](#sec-correlationalStudy), there is a __ standard deviation change in the outcome variable.
For instance, when examining the association between age and fantasy points, if the standardized regression coefficient is 0.1, players score on average 0.1 standard deviation more points for each additional standard deviation of their year of age.
Standardized regression coefficients—though not the case in all instances—tend to fall between [−1, 1].
Thus, standardized regression coefficients tend to be more comparable across variables and models compared to unstandardized regression coefficients.
In this way, standardized regression coefficients provide a meaningful index of [effect size](#sec-practicalSignificance).
## Coefficient of Determination ($R^2$) {#sec-multipleRegressionRSquared}
::: {#fig-regression}
![](images/multipleRegressionRSquared.png){width=50%}
Conceptual Depiction of Proportion of Variance Explained ($R^2$) in an Outcome Variable ($Y$) by Multiple Predictors ($X1$ and $X2$) in Multiple Regression. The size of each circle represents the variable's variance. The proportion of variance in $Y$ that is explained by the predictors is depicted by the areas in orange. The dark orange space ($G$) is where multiple predictors explain overlapping variance in the outcome. Overlapping variance that is explained in the outcome ($G$) will not be recovered in the regression coefficients when both predictors are included in the regression model. From @Petersen2024a and @PetersenPrinciplesPsychAssessment.
:::
## Covariates {#sec-covariates}
Covariates are variables that you include in the statistical model to try to control for them so you can better isolate the unique contribution of the predictor variable(s) in relation to the outcome variable.
Use of covariates examines the association between the predictor variable and the outcome variable when holding people's level constant on the covariates.
Inclusion of confounds as covariates allows potentially gaining a more accurate estimate of the causal effect of the predictor variable on the outcome variable.
Ideally, you want to include any and all confounds as covariates.
As described in @sec-correlationCausation, confounds are third variables that influence both the predictor variable and the outcome variable and explain their association.
Covariates are potentially (but not necessarily) confounds.
For instance, you might include the player's age as a covariate in a model that examines whether a player's 40-yard dash time at the NFL Combine predicts their fantasy points in their rookie year, but it may not be a confound.
## Multicollinearity {#sec-multipleRegressionMulticollinearity}
::: {#fig-regression}
![](images/multipleRegressionMulticollinearity.png){width=50%}
Conceptual Depiction of Multicollinearity in Multiple Regression. From @Petersen2024a and @PetersenPrinciplesPsychAssessment.
:::
::: {.content-visible when-format="html"}
## Session Info {#sec-multipleRegressionSessionInfo}
```{r}
sessionInfo()
```
:::