data_intro.Rmd

---
title: "Data Introduction & Cleaning"
output:
  distill::distill_article:
    toc: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r echo=F}
library(tidyverse)
library(rmarkdown)
```

## Overview

Nearly all of the data originates from one of two sources: a Qualtrics survey disseminated to kind survey respondents (students) in Stiles, and some housing data from the college. While this project is a living document and is still taking in student responses, at current publication, 71 student responses have been recorded.

For the sake of tidiness, most of the coding setup and data wrangling for this project's visualizations will be developed in the following two pages. For all readers, the pages to come contain the ~raw data tables which can be explored, as well as an interpretation of the overall survey data. For those interested in following the code, these pages should also elucidate the various datasets and data formats employed.

A copy of the survey questions may be found [here](site-data/MooseMapping_Qualtrics.pdf).

## Survey Responses

### Cleanup

The data table below gives a first look at the Qualtrics survey responses! (Excepting a few personally-identifying data points.)

```{r}
data.qualtrics <- read.csv("data/qualtrics_20201227_233000.csv") %>%
    slice(3:n()) %>%  # First two rows do not have respondent data; remove them
    na_if("")  # Qualtrics saves NA in CSV as blank strings; reset those to NA

paged_table(data.qualtrics %>%
                select(-c(IPAddress, LocationLongitude, LocationLatitude, Q25)))
```

A number of data-cleaning steps become quickly apparent: some rows are empty or partially empty (representing those who opened the survey but did not complete it); the columns are named quite unhelpfully (e.g. "Q20.1_1_1"), and there seem to be many columns that are not relevant. As a first cleaning step, we rename the relevant columns, remove all unwanted columns, and trim empty rows (keeping partial responses).

```{r}
data.students <- data.qualtrics %>%
    transmute(
        # Student info
        Year=Q24, Email=Q25,
        Enrolled.F20=case_when(
            Q23=="Enrolled" ~ T,
            Q23=="On Leave" ~ F,
            TRUE ~ NA),
        # Location/movement data at time t
        Loc1=Q4, Dorm1=Q5, Loc2=Q6, Loc2.5=Q26, Loc3=Q7, Loc4=Q8, Loc5=Q9, Dorm5=Q10,
        Loc6=Q11,
        # Meaning/nostalgia data (in expanded form)
        Place1.n=Q20.1_1_1, Place1.t=Q20.2_1, Place2.n=Q20.1_2_1, Place2.t=Q20.2_2,
        Place3.n=Q20.1_3_1, Place3.t=Q20.2_3, Place4.n=Q20.1_4_1, Place4.t=Q20.2_4,
        Place5.n=Q20.1_5_1, Place5.t=Q20.2_5, Place6.n=Q20.1_6_1, Place6.t=Q20.2_6,
        Place7.n=Q20.1_7_1, Place7.t=Q20.2_7, Place8.n=Q20.1_8_1, Place8.t=Q20.2_8,
        Place9.n=Q20.1_9_1, Place9.t=Q20.2_9, Place10.n=Q20.1_10_1, Place10.t=Q20.2_10,
        # Connectedness data
        Conn1=Q22.1_1, Conn2=Q22.1_2, Conn3=Q22.1_3, Conn4=Q22.1_4,
        Conn5=Q22.1_5, Conn6=Q22.1_6,
        # Useful Qualtrics data
        ID=ResponseId
    ) %>%
    # Remove 100% empty responses (empty Loc1)
    filter(!is.na(Loc1) | !is.na(Loc6)) %>%
    # Set appropriate columns to numeric
    mutate(Year=as.numeric(Year), Conn1=as.numeric(Conn1), Conn2=as.numeric(Conn2),
           Conn3=as.numeric(Conn3), Conn4=as.numeric(Conn4), Conn5=as.numeric(Conn5),
           Conn6=as.numeric(Conn6)
    )
```

In addition, there remain some discrepancies in the location and place data. In order to plot students' locations or meaningful places on a map, we will want to programatically convert the text-strings of location names (e.g. "Ashley's Ice Cream") to GPS coordinates; however, not all respondents gave specific-enough location names for something like the Google Maps service to find the location's coordinates. Therefore, some by-hand reformatting will be (unfortunately) necessary. As a first step, we can start some of this work for the Spring and Fall semesters (pre-break) by inferring precise location names from dorm data:

```{r}
data.students <- data.students %>%
    mutate(Loc1 = case_when(
        grepl("Stiles", Dorm1) ~ "Ezra Stiles College New Haven",
        grepl("Lawrance", Dorm1) ~ "Lawrance Hall New Haven",
        T ~ Loc1
    )) %>%
    mutate(Loc5 = case_when(
        grepl("Stiles", Dorm5) ~ "Ezra Stiles College New Haven",
        grepl("Lawrance", Dorm5) ~ "Lawrance Hall New Haven",
        grepl("Vanderbilt", Dorm5) ~ "Vanderbilt Hall New Haven",
        grepl("LW", Dorm5) ~ "Lanman-Wright Hall New Haven",
        T ~ Loc5
    ))
```

We also note that some respondents chose "Off-Campus" for their dorm when the more correct option was "Not in New Haven"—resolving this will come in the next page with some more by-hand work. For now, the data table below represents the cleaned survey responses.

### Data Table (Cleaned Survey Responses)

```{r}
paged_table(data.students %>% select(!Email), options=list(rows.print=25))
colnames(data.students)
```

### Interpretation

The data frame above features a wealth of information, and it is now much easier to make some sense of the survey responses. As a guide: note that each *row* represents one student's full survey response, while each *column* (named immediately above) represents that student's answer to a question, or part of an answer to a question.

In particular, for each student, we have the following response data, given in response to the corresponding survey questions:

* **Loc<sub>t</sub>**: The student's approximate location through various time periods `t` of 2020.
* **Dorm<sub>t</sub>**: The student's dorm in the Spring (`t=1`) and Fall (`t=5`) semesters, if applicable.
    * Original question: _Many of us moved, en masse and in diaspora, over a series of memorable time periods this year. Please list the city/town where you were over each time period. (If you traveled to multiple places over each time period, just write them in order separated by semicolons!)_
* **Place<sub>i</sub>.n** and **Place<sub>i</sub>.t**: A list of Yale or Yale-adjacent places which either have been a recent site for meaningful experiences or which that student has recently been nostalgic for. Suffix `.n` gives the place's name while `.t` gives whether the place is physical or virtual (its "type"), and the survey had space for up to `i=10` places.
    * Original question: _**If you lived in the New Haven this Fall**: looking back on life in New Haven this Fall (perhaps for the first time!), in which places on and around campus did you end up having meaningful experiences? **If you have been remote in the Fall**: Know that you have been missed. Although you have been remote, perhaps some Yale places have been on your mind. Which places on and around campus do you feel most nostalgic for? Have any virtual Yale spaces been meaningful for you? List as many as you would like, but please be specific to help plot themmeaningfully on a map. If some of these locations are virtual, that’s cool too!_
* **Conn<sub>t</sub>**: The student's reported feelings of being connected to other people in Stiles through various time periods `t` of 2020.
    * Original question: _On a scale of 1-10: On average, how connected have you felt to other people in the Stiles community during each time period? (1 = very disconnected, 10 = very connected.) Just answer to your best estimation/memory. First-years, it's ok if your answers are very low for the periods before Fall semester :)._
* **Student data**: year, enrollment status in Fall 2020, and an internal ID given by Qualtrics.

As hinted, some of the data above is tracked through various time periods or *epochs* `t` which are particularly relevant to the Yale calendar and the effects of the pandemic. The epochs `t` are numbered `1` through `6`, but in particular, each epoch corresponds to the following meaningful time frame:

| Epoch (`t=`value) | Time frame | Date range (2020) |
|-------------------|------------|------------|
| `t1` | Before Spring Break | < 3/7 |
| `t2` | Spring Break week 1 | 3/7 - 3/15 |
| `t2.5` | Spring Break week 2 | 3/16 - 3/22 |
| `t3` | Remainder of the Spring semester| 3/23 - 5/6 |
| `t4` | Summer | 5/7 - Late August |
| `t5` | Fall semester, before Break | 8/31 - 11/21 |
| `t6` | Remainder of the Fall semester | 11/22 - 2020 |

In the sections which follow, we extract and clean a couple of other, more specific data sets.

## Meaningful/Nostalgic Places

The dataset above focuses on students by letting each row be the student response; hence many columns represent that student's meaningful/nostalgic places. In particular, the survey has a fixed number of inputs expecting up to 10 places, and since each place has a name (`.n`) and a type (`.t`), this results in 20 columns total per student. This is a very ungainly format for the **Key Places** page, as we want to focus on each *place* rather than each student.

Hence, we extract the place data from the larger survey dataset. We `l e n g t h e n` the data by pivoting the `Place` column-pairs (Place<sub>i</sub>.n, Place<sub>i</sub>.t) to individual rows.

```{r}
data.places <- data.students %>%
    # Work only with Place data and (internal) student id
    select(ID, Place1.n, Place1.t, Place2.n, Place2.t, Place3.n, Place3.t,
           Place4.n, Place4.t, Place5.n, Place5.t, Place6.n, Place6.t, Place7.n,
           Place7.t, Place8.n, Place8.t, Place9.n, Place9.t, Place10.n, Place10.t
    ) %>%
    # https://tidyr.tidyverse.org/articles/pivot.html#multiple-observations-per-row-1
    pivot_longer(!ID,
                 names_to=c("order", ".value"),
                 names_pattern="Place(.)\\.(.)"
    ) %>%
    # Clean up
    rename(Name=n, Type=t) %>%
    filter(!is.na(Name)) %>%  # One student wrote-in "N/A"
    mutate(order=as.numeric(order))
```

### Data Table (Key Places)

```{r}
paged_table(sample_n(data.places, n()), options=list(rows.print=25))
```

In the table above, each unique ID corresponds to one respondent, and `order` represents the order in which they named the respective place (between 1 and 10). It was not explicitly stated that they order their places from most to least meaningful/nostalgic.

Note that—as noted when cleaning the survey data above—some of these locations will need to be modified manually to provide for correct *geocoding* (converting location name/address to GPS coordinates).

```{r eval=F, echo=F}
# Kept in this notebook for posterity, but not actually needed or used for Moose Mapping.
places_list <- function(Place1, Place1.t, Place2, Place2.t, Place3, Place3.t,
    Place4, Place4.t, Place5, Place5.t, Place6, Place6.t, Place7, Place7.t,
    Place8, Place8.t, Place9, Place9.t, Place10, Place10.t, ...) {
    # https://github.com/jennybc/row-oriented-workflows/blob/master/ex06_runif-via-pmap.md reference on pmap
    places <- list()
    j <- 1
    for (i in 1:10) {
        place_i <- eval(as.name(paste("Place", i, sep="")))
        place_i.t <- eval(as.name(paste("Place", i, ".t", sep="")))
        # Eval https://stackoverflow.com/questions/9057006/getting-strings-recognized-as-variable-names-in-r
        if (!is.na(place_i)) {
            places[[j]] <- c(name=place_i, is.physical=(!is.na(place_i.t) & place_i.t=="Physical"))
            j <- j + 1
        }
    }
    return(places)
}
```

## Housing Counts

Finally, for the **Missing Moose** page (*forthcoming*), we will visualize the change in housing density from Spring to Fall. Ezra Stiles College provided the relevant housing counts to assist with this section. The data is saved in the slightly more readable "wider" format, which focuses on the two terms:

```{r}
# Read housing counts
data.housing_wide <- read.csv("data/housing_S20_F20_wide.csv")
paged_table(data.housing_wide)
```

However, as with the previous *pivot*, we want the data in a "longer" format which places each entryway/dorm in a separate row:

```{r}
# Lengthen housing counts
data.housing_long <- data.housing_wide %>%
    pivot_longer(!Term,
                 names_to="Entryway",
                 values_to="Count")
paged_table(data.housing_long)
```

In the table above, `Count` gives the number of students housed in that particular entryway in that term.

## Before we move on...

The initial data cleaning has run its course; we have now done all that we can do, short of manually cleaning the `Location` and `Place` data!

We save the above datasets to separate files:

```{r eval=F}
# t <- format(Sys.time(), "%Y%m%d_%H%M%S")
t <- "20201227_235052"
write.csv(data.students, paste0("data/students_", t, ".csv", sep=""), row.names=F)
write.csv(data.places, paste0("data/places_", t, ".csv", sep=""), row.names=F)
write.csv(data.housing_long, "data/housing_S20_F20_long.csv", row.names=F)
```

Lastly, note that the full and original survey data is not made public in order to respect students' privacy. Email me at [cove.geary@yale.edu](mailto:cove.geary@yale.edu) if you have questions.