Skip to content

A case study that analyzes historical data from Divvy, a bicycle-sharing system in the Chicago metropolitan area, to identify trends in how customers use Divvy bikes differently.

Notifications You must be signed in to change notification settings

j-dz/divvy-case-study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Divvy Bike-Share Case Study

In this case study, I analyze historical data from Divvy, a bicycle-sharing system in the Chicago metropolitan area, in order to identify trends in how their customers use bikes differently. The tools I use are spreadsheets, R and Tableau.

This case study served as my capstone project for the Google Data Analytics Professional Certificate course - an eight-part professional certificate program in data analytics offered by Google through Coursera.

Throughout this case study, I performed numerous real-world tasks of a junior data analyst, including:

  • Documentation,
  • Data collection,
  • Data cleaning and manipulation,
  • Preparing data for analysis,
  • Analysis, and
  • Visualization.

I outline each step in the process in detail below.

Background

Launched in 2013, Divvy is a bicycle-sharing system in the Chicago metropolitan area that provides a convenient and affordable way for residents and visitors to get around the city using bicycles. It is operated by the Chicago Department of Transportation (CDOT) in partnership with Lyft.

Divvy features a network of docking stations located throughout Chicago, where users can rent and return bicycles. The system consists of sturdy, specially designed bikes that are available for short-term use. Users can access the bikes by purchasing a pass or membership, which allows them to unlock a bike from any available docking station and ride it to their desired destination.

Divvy offers various pass options, including single-ride, day pass, and annual membership. Once a pass or membership is obtained, users can check out a bike from any station, ride it for a specified duration, and return it to any available docking station in the system.

The program has been popular among commuters, tourists, and residents alike, as it provides an eco-friendly and convenient transportation option for short trips around the city. Divvy has expanded its bike network and continues to play a significant role in promoting sustainable transportation and reducing traffic congestion in Chicago.

Business task

The case study business task is as follows:

How do annual members and casual riders use Divvy bikes differently?

Collecting the data

We’ll be using Divvy’s historical bike trip data from the last 12 months, which is publicly available here.

The data is made available under this license by Lyft Bikes and Scooters, LLC, which operates the City of Chicago’s Divvy bicycle-sharing service. The data is stored in spreadsheets. There are 12 .CSV files in total:

2022-05_divvy_trip-data.csv

2022-06_divvy_trip-data.csv

2022-07_divvy_trip-data.csv

2022-08_divvy_trip-data.csv

2022-09_divvy_trip-data.csv

2022-10_divvy_trip-data.csv

2022-11_divvy_trip-data.csv

2022-12_divvy_trip-data.csv

2023-01_divvy_trip-data.csv

2023-02_divvy_trip-data.csv

2023-03_divvy_trip-data.csv

2023-04_divvy_trip-data.csv

Data cleaning and manipulation

First, let's begin by manipulating the spreadsheets as follows:

  1. Unzip the files.
  2. Create a folder on our desktop to house the files, using appropriate file-naming conventions. For example, "divvy_data".
  3. Create subfolders with "divvy_data" for the .CSV files so that we have a copy of the original data. Name one of the folders "raw" - this will house our original, raw data.
  4. Move the downloaded files to the "raw" subfolder.
  5. Launch Excel, open each file, and chose to Save As an Excel Workbook file. Put these .XLSX files in a new subfolder for .XLSX files. Name this subfolder "xlsx".
  6. Open each spreadsheet and create a column called “ride_length.” Calculate the length of each ride by subtracting the column “started_at” from the column “ended_at” (for example, =D2-C2) and format as HH:MM:SS using Format > Cells > Time > 37:30:55.
  7. Create a column called “day_of_week,” and calculate the day of the week that each ride started using the “WEEKDAY” command (for example, =WEEKDAY(C2,1)) in each file. Formated the cell as "General" (or as a number with no decimals), noting that 1 = Sunday and 7 = Saturday.

Prepare data for analysis on R

Now that our data is stored appropriately, we will start putting it to work using R.

First, open R Studio, create a new R script, and set up the R environment by loading the following key packages.

library(tidyverse)  #helps wrangle data
library(lubridate)  #helps wrangle date attributes
library(ggplot2)  #helps visualize data

Import data

# Upload Divvy datasets (xlsx files) here

may_2022 <- read_xlsx("202205-divvy-tripdata.xlsx")
june_2022 <- read_xlsx("202206-divvy-tripdata.xlsx")
july_2022 <- read_xlsx("202207-divvy-tripdata.xlsx")
aug_2022 <- read_xlsx("202208-divvy-tripdata.xlsx")
sept_2022 <- read_xlsx("202209-divvy-tripdata.xlsx")
oct_2022 <- read_xlsx("202210-divvy-tripdata.xlsx")
nov_2022 <- read_xlsx("202211-divvy-tripdata.xlsx")
dec_2022 <- read_xlsx("202212-divvy-tripdata.xlsx")
jan_2023 <- read_xlsx("202301-divvy-tripdata.xlsx")
feb_2023 <- read_xlsx("202302-divvy-tripdata.xlsx")
mar_2023 <- read_xlsx("202303-divvy-tripdata.xlsx")
apr_2023 <- read_xlsx("202304-divvy-tripdata.xlsx")

Make columns consistent and merge them into a single data frame

Now that our data has been imported into R, we need to ensure that the new data frames have uniform column names prior to merging them into a single data frame.

# Compare column names each of the files

colnames(may_2022)
colnames(june_2022)
colnames(july_2022)
colnames(aug_2022)
colnames(sept_2022)
colnames(oct_2022)
colnames(nov_2022)
colnames(dec_2022)
colnames(jan_2023)
colnames(feb_2023)
colnames(mar_2023)
colnames(apr_2023)

# Inspect the data frames and look for incongruencies

str(may_2022)
str(june_2022)
str(july_2022)
str(aug_2022)
str(sept_2022)
str(oct_2022)
str(nov_2022)
str(dec_2022)
str(jan_2023)
str(feb_2023)
str(mar_2023)
str(apr_2023)

Next, we will convert the ride_id and rideable_type columns to string types.

# Convert ride_id and rideable_type to character strings so that they can stack correctly

may_2022 <-  mutate(may_2022, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
june_2022 <-  mutate(june_2022, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
july_2022 <-  mutate(july_2022, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
aug_2022 <-  mutate(aug_2022, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
sept_2022 <-  mutate(sept_2022, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
oct_2022 <-  mutate(oct_2022, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
nov_2022 <-  mutate(nov_2022, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
dec_2022 <-  mutate(dec_2022, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
jan_2023 <-  mutate(jan_2023, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
feb_2023 <-  mutate(feb_2023, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
mar_2023 <-  mutate(mar_2023, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
apr_2023 <-  mutate(apr_2023, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))

Now, we can go ahead and merge the individual month's data frames into one big data frame called "all_trips"

all_trips <- bind_rows(may_2022, june_2022, july_2022, aug_2022, sept_2022, oct_2022, nov_2022, dec_2022, jan_2023, feb_2023, mar_2023, apr_2023)

Uh-oh! We've received the following error, which has prevented us from merging our data frames.

## Error: Can't combine `..1$end_station_id` <character> and `..5$end_station_id` <double>

The error appears to be related to the vector type of our variable end_station_id. To fix this, let's convert the variable into a character, and for symmetry, let's do the same for start_station_id.

may_2022 <- mutate(may_2022, start_station_id = as.character(start_station_id) ,
                   end_station_id = as.character(end_station_id))
june_2022 <- mutate(june_2022, start_station_id = as.character(start_station_id) ,
                   end_station_id = as.character(end_station_id))
july_2022 <- mutate(july_2022, start_station_id = as.character(start_station_id) ,
                   end_station_id = as.character(end_station_id))
aug_2022 <- mutate(aug_2022, start_station_id = as.character(start_station_id) ,
                   end_station_id = as.character(end_station_id))

sept_2022 <- mutate(sept_2022, start_station_id = as.character(start_station_id) ,
                   end_station_id = as.character(end_station_id))
oct_2022 <- mutate(oct_2022, start_station_id = as.character(start_station_id) ,
                   end_station_id = as.character(end_station_id))
nov_2022 <- mutate(nov_2022, start_station_id = as.character(start_station_id) ,
                   end_station_id = as.character(end_station_id))
dec_2022 <- mutate(dec_2022, start_station_id = as.character(start_station_id) ,
                   end_station_id = as.character(end_station_id))
jan_2023 <- mutate(jan_2023, start_station_id = as.character(start_station_id) ,
                   end_station_id = as.character(end_station_id))
feb_2023 <- mutate(feb_2023, start_station_id = as.character(start_station_id) ,
                   end_station_id = as.character(end_station_id))
mar_2023 <- mutate(mar_2023, start_station_id = as.character(start_station_id) ,
                   end_station_id = as.character(end_station_id))
apr_2023 <- mutate(apr_2023, start_station_id = as.character(start_station_id) ,
                   end_station_id = as.character(end_station_id))

Ok, now let's retry merging our data frames.

all_trips <- bind_rows(may_2022, june_2022, july_2022, aug_2022, sept_2022, oct_2022, nov_2022, dec_2022, jan_2023, feb_2023, mar_2023, apr_2023)

The Error was confirmed to be fixed by the successful merger of our data frames into the new aggregate data frame all_trips. At this point, we can go ahead and save our progress.

save(all_trips, file = "all_trips.RData")

Clean up and prepare new data frame for analysis

Next, we will inspect the new table that has been created.

# Inspect the new table that has been created
colnames(all_trips)  #List of column names

nrow(all_trips)  #How many rows are in data frame?

dim(all_trips)  #Dimensions of the data frame?
head(all_trips)  #See the first 6 rows of data frame.  str(all_trips)  #See list of columns and data types (numeric, character, etc)
summary(all_trips)  #Statistical summary of data. Mainly for numerics

After inspecting the data frame, there are a few problems we noticed that we will need to fix:

(1) The data can only be aggregated at the ride-level, which is too granular. We will want to add some additional columns of data -- such as day, month, year -- that provide additional opportunities to aggregate the data.

(2) We will want to add a calculated field for length of ride. Since the data does not have this, we will add a ride_length column to the data frame.

(3) There are some rides where ride_length shows up as negative, including several hundred rides where Divvy took bikes out of circulation for Quality Control reasons. We will want to delete these rides.

Let's address these problems in order.

# (1)
# Add columns that list the date, month, day, and year of each ride
# This will allow us to aggregate ride data for each month, day, or year ... before completing these operations we could only aggregate at the ride level

all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
# (2)
# Add a "ride_length" calculation to all_trips (in seconds)
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

# Inspect the structure of the columns
str(all_trips)

# Convert "ride_length" from Factor to numeric so we can run calculations on the data
is.factor(all_trips$ride_length)
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
# (3)
# Remove "bad" data
# The dataframe includes a few hundred entries when bikes were taken out of docks and checked for quality by Divvy or ride_length was negative
# We will create a new version of the dataframe (v2) since data is being removed

all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]

# Filter rows that contain duplicate values

all_trips_v2 <- all_trips_v2 %>% distinct()

Ok, now let's go ahead and save our progress.

save(all_trips_v2, file = "all_trips_v2.RData")

Analysis

It's now time to conduct descriptive analysis on "ride_length" (all figures in seconds)

mean(all_trips_v2$ride_length) #straight average (total ride length / rides)
median(all_trips_v2$ride_length) #midpoint number in the ascending array of ride lengths
max(all_trips_v2$ride_length) #longest ride

min(all_trips_v2$ride_length) #shortest ride

You can also condense the four lines above to one line using summary() on the specific attribute.

summary(all_trips_v2$ride_length)

Let's compare members and casual riders.

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)

And let's see the average ride time by each day for members vs casual riders.

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)

The days of the week are out of order. Let's fix that.

all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

Now, let's run the average ride time by each day for members vs casual users.

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)

Lastly, let's analyze ridership data by type and weekday.

all_trips_v2 %>% 
        mutate(weekday = wday(started_at, label = TRUE)) %>%  #creates weekday field using wday()
        group_by(member_casual, weekday) %>%  #groups by usertype and weekday
        summarise(number_of_rides = n()							#calculates the number of rides and average duration 
                  ,average_duration = mean(ride_length)) %>% 		# calculates the average duration
        arrange(member_casual, weekday)								# sorts

Visualization on R

Now let's being creating some visualization on R. First, let's visualize the number of rides by rider type.

all_trips_v2 %>% 
        mutate(weekday = wday(started_at, label = TRUE)) %>% 
        group_by(member_casual, weekday) %>% 
        summarise(number_of_rides = n()
                  ,average_duration = mean(ride_length)) %>% 
        arrange(member_casual, weekday)  %>% 
        ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
        geom_col(position = "dodge")

Now, let's create a visualization for the average duration.

all_trips_v2 %>% 
        mutate(weekday = wday(started_at, label = TRUE)) %>% 
        group_by(member_casual, weekday) %>% 
        summarise(number_of_rides = n()
                  ,average_duration = mean(ride_length)) %>% 
        arrange(member_casual, weekday)  %>% 
        ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
        geom_col(position = "dodge")

Great, that wraps up our work in R. Now let's export this file for further analysis and visualization on Tableau.

# Create a csv file that we will use to create a visualization in Tableau

write.csv(all_trips_v2, file = '~/Desktop/Coursera/divvy_trip_data/xlsx/all_trips_v2.csv')

Visualation on Tableau

Now we will continue our work in Tableau Public.

First, open a new Tableau workspace, and upload the file we just exported from R (all_trips_v2.csv) as a new data source.

From here, we can create several dynamic visualizations.

Here's a link to the Tableau dashboard I created.

This dashboard includes visualizations such as:

  • A map viz of trips throughout the Chicago metropolitan area
  • A side-by-side bar viz of number of trips per Day of the Week by rider type
  • A line viz of number of trips per Calendar Month by rider type
  • A side-by-side bar viz of average ride length per Day of the Week by rider type
  • A line viz of average ride length per Calendar Month by rider type

Key takeaways

  • Annual members tend to use Divvy bikes more frequently as they have unlimited access throughout the year. They are more likely to use the service for regular commuting, running errands, or leisure rides.

  • Casual riders often use Divvy bikes for occasional or one-time trips, such as sightseeing or short-distance transportation needs.

  • Annual members are more likely to consistently use the service throughout the year, in contrast to casual clients, whose usage drops more significantly during the winter months.

  • Annual members are also more likely to use the service throughout the week, whereas casual riders are most active on the weekend.

  • Lastly, while the average ride length for casual riders is much greater than that of annual members (?11 mins vs ?28 mins), I attribute this to an unintended effect of the service's subscription model, which allows casual riders to purchase single-day unlimited passes - and thus, these riders may leave their bikes "undocked" throughout the day.

Recommendations

Launch an email campaign that targets casual riders during the Spring and Summer months (March-September).

As casual riders are particularly more adverse to riding in colder weather, it would be a great idea to schedule different promotions and offerings throughout the months of their peak ridership, such as annual membership discounts, to entice them to continue riding throughout the year.

Build and launch a digital media campaign that raises awareness about the benefits of using Divvy for weekday commutes.

While ridership figures for casual riders are strong on the weekend, they remain quite low during the week compared to that of annual members. One way to narrow this gap is a digital media campaign that highlights the many benefits of opting to commute on bike, such as health benefits and reducing one's carbon footprint.

About

A case study that analyzes historical data from Divvy, a bicycle-sharing system in the Chicago metropolitan area, to identify trends in how customers use Divvy bikes differently.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages