R Learning Activities/LearningActivity-2.Rmd

---
title: "Learning Activity: Association--Quantitative versus Categorical"
author: "Tian Zheng"
date: "December 7, 2015"
output: html_document
---

### Association between a quantitative variable and a categorical variable. 
Let's first visualize the `InsectSprays` data. 

- Using the `plot` command, we ask `R` to plot `count` against `spray` from the dataset `InsectSprays`. `xlab` and `ylab` are options we can specify to label the X and Y axes. 

- `R` automatically detects that `spray` is a categorical variable and `count` is a quantitative variable. So it uses side-by-side boxplots to visualize variation in Y (`count`) given different values of the X variable (`spray`).

```{r}
plot(count~spray, data=InsectSprays,
     xlab="Types of Insecticide",
     ylab="Counts of Insects")
```

- In this module, we learnt about the method of *Analysis of Variance*. In the output, you will see a column under the heading `Sum Sq`, which stands for sum of squares. 

- The value in the row `spray` is the group sum of squares (SSG) as the groups are defined by the values of `spray`. The value in the row `Residuals` is the error sum of squares (SSE). 

- The `anova` command evaluate the Analysis of Variance test to see whether the association observed between `spray` and `count` is statistically significant. The null hypothesis is that the mean of `count` does not depend on values of `spray`. 

- In this example, the p value (value under `Pr(>F)`) is quite small. Here `<2.2 e -16` means $<2.2\times 10^{-16}$. This indicates the patterns we see in the above plot is quite unlikely to occur purely by chance. 

```{r}
anova(lm(count~spray, data=InsectSprays))
```