# Linear regression warm-up

# Goals

- Make sure everyone is comfortable with dplyr and ggplot2
- Practice interpreting basic linear regression models

# The Gapminder data set

We're going to work with some data from the Gapminder project. This is from a project led by Han Rosling to bring data to issues of health, economics, and global development. If you haven't heard of Han Rosling, you should see some of his TED talks, e.g. <http://www.ted.com/talks/hans_rosling_and_the_magic_washing_machine>.

We'll work with a few variables from their full data sets (<https://www.gapminder.org/>), which have been included in an R package, `gapminder`, put together by Jenny Bryan at UBC <https://github.com/jennybc/gapminder>.

```{r, message=FALSE}
library(tidyverse)
library(broom) # for tidy()
d <- gapminder::gapminder # load the data into `d`
```

Let's look at the data:

```{r}
d
```

The column headers should be self-explanatory.

# Modeling life expectancy over time 

Let's start with a simple model asking how global life expectancy has changed over time. 

Use ggplot2 to visualize life expectancy (`lifeExp`) on the y-axis and time (`year`) on the x-axis. 

Bonus: If you want to get fancy, you might consider lowering the `alpha` argument in `geom_point()` or adding some jittering to help with the overplotting (see `?position_jitter`).

```{r}
ggplot(d, aes(year, lifeExp)) + # exercise
  geom_point(alpha = 0.5, position = position_jitter(width = 0.6)) # exercise
```

Try fitting a linear regression to predict `lifeExp` based on `year`:

```{r}
m1 <- lm(
  lifeExp ~ year, data = d) # exercise
summary(m1)
coef(m1)
tidy(m1)
```

How do we interpret the intercept? 

How do we interpret the slope? How would you fill in the following blank? With the passing of each year, life expectancy increases by `___` years.

## Working with a transformed predictor and a log-transformed response variable 

What if we created a new variable `decade_1990` as the following: 

```{r}
d <- mutate(d, decade_1990 = (year - 1990)/10)
ggplot(d, aes(decade_1990, log(lifeExp))) + 
  geom_point(alpha = 0.5, position = position_jitter(width = 0.1))
```

The choice of the year 1990 was completely arbitrary simply to make the intercept in the following models more meaningful. We divided by 10 so that the effects are per decade instead of per year. This keeps us from working with really tiny numbers. 

And then we fit the following model. Note that this time, for the purposes of this exercise, we are modeling the *log* of life expectancy as the response. 

```{r}
m2 <- lm(log(lifeExp) ~ decade_1990, data = d)
summary(m2)
coef(m2)
```

What do the intercept and slope represent now? 

How would you answer the following questions based on the coefficients? 

In 1990, the global expected life expectancy was:

```{r}
exp(coef(m2)[1]) # exercise
```

For every decade, global life expectancy increased on average `___` fold or `___`%.

```{r}
round(exp(coef(m2)[2]), 2) # exercise
round(exp(coef(m2)[2]) * 100 - 100, 0) # exercise
```

# Adding an interaction

Run the following model. I have added an interaction term with the factor variable `continent`:

```{r}
m3 <- lm(log(lifeExp) ~ decade_1990 * continent, data = d)
summary(m3)
```

The continent Africa is the base level of the factor variable `continent`. How would you answer the following questions? You don't need to fill in the actual R code, unless you'd like to. Just know how you would obtain these values. 

What was the mean life expectancy in Africa in 1990?

```{r}
exp(coef(m3)["(Intercept)"]) # exercise
```

What was the mean life expectancy in Asia in 1990?

```{r}
exp(coef(m3)["(Intercept)"] + coef(m3)["continentAsia"]) # exercise
```

Over a decade in Europe, what is the expected fold increase in life expectancy?

```{r}
exp(coef(m3)["decade_1990"] + coef(m3)["decade_1990:continentEurope"]) # exercise
```

# Modeling life expectancy as a function of GDP per capita

Use functions from the dplyr package combined with the pipe `%>%` symbol to calculate the mean life expectancy and the mean GDP per capita for each country. Call your new columns `lifeExp` and `gdpPercap`. I've started it for you: (hint, you'll need to use the `summarise()` function)

```{r}
life <- d %>% 
  group_by(country) %>%
  summarise(lifeExp = mean(lifeExp), gdpPercap = mean(gdpPercap)) # exercise
```

Let's look at the summarized data:

```{r}
ggplot(life, aes(log(gdpPercap), log(lifeExp))) + geom_point()
```

Fit a linear regression predicting log life expectancy from log GDP per capita with our new data set `life`.

```{r}
m4 <- lm(log(lifeExp) ~ log(gdpPercap), data = life) # exercise
summary(m4)
coef(m4)
```

How do we interpret the intercept and slope in this model?

How would you fill in the following blank?

For a 1% increase in GDP per capita, we expect a `___`% increase in life expectancy.

```{r}
round(coef(m4)[2], 2) # exercise %>%
```

# Addendum 

These models are very basic and likely violate a number of statistical assumptions. (This exercise was focused only on interpretation, and we never even checked model fit.) We will come back to this data set later and fit some more realistic models.

For now, what are some of the major assumptions in the above models that are probably not valid or not ideal? How would you start checking those assumptions?

# Bonus

If you got this far, great! Try plotting the data creatively with ggplot2. Dig into the multiple dimensions. There are many patterns to be found in the data set and we will be coming back to it again later.