-
Notifications
You must be signed in to change notification settings - Fork 244
/
LearningActivity-2.Rmd
31 lines (22 loc) · 1.64 KB
/
LearningActivity-2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
---
title: "Learning Activity: Association--Quantitative versus Categorical"
author: "Tian Zheng"
date: "December 7, 2015"
output: html_document
---
### Association between a quantitative variable and a categorical variable.
Let's first visualize the `InsectSprays` data.
- Using the `plot` command, we ask `R` to plot `count` against `spray` from the dataset `InsectSprays`. `xlab` and `ylab` are options we can specify to label the X and Y axes.
- `R` automatically detects that `spray` is a categorical variable and `count` is a quantitative variable. So it uses side-by-side boxplots to visualize variation in Y (`count`) given different values of the X variable (`spray`).
```{r}
plot(count~spray, data=InsectSprays,
xlab="Types of Insecticide",
ylab="Counts of Insects")
```
- In this module, we learnt about the method of *Analysis of Variance*. In the output, you will see a column under the heading `Sum Sq`, which stands for sum of squares.
- The value in the row `spray` is the group sum of squares (SSG) as the groups are defined by the values of `spray`. The value in the row `Residuals` is the error sum of squares (SSE).
- The `anova` command evaluate the Analysis of Variance test to see whether the association observed between `spray` and `count` is statistically significant. The null hypothesis is that the mean of `count` does not depend on values of `spray`.
- In this example, the p value (value under `Pr(>F)`) is quite small. Here `<2.2 e -16` means $<2.2\times 10^{-16}$. This indicates the patterns we see in the above plot is quite unlikely to occur purely by chance.
```{r}
anova(lm(count~spray, data=InsectSprays))
```