CAPSTONE PROJECT - GOOGLE DATA ANALYTICS PROFESSIONAL CERTIFICATE

TABLE OF CONTENT

Summary

1.1. About Company
1.2. Services
1.3. Products

Study Questions
Phase 1: Ask

3.1. Business Task
3.2. Key Stakeholders

Phase 2: Prepare
Phase 3: Process
Phase 4: Analyze
Phase 5: Share
Phase 6: Conclusion (Act)

1. SUMMARY

1.1. About Company

Cyclistic is a bike-sharing service with 5,824 bicycles and 692 docking stations. Cyclistic distinguishes itself by providing reclining bikes, hand tricycles, and cargo bikes, making bike-share more accessible to people with disabilities and riders who cannot use a standard two-wheeled bike. Nearly 92% of riders choose traditional bikes, while nearly 8% use assistive options. Cyclists are more likely to ride for pleasure, but approximately 30% use them to commute to work every day.

Annual members are significantly more profitable than casual riders, according to Cyclistic's financial analysts. Although the pricing flexibility aids Cyclistic in gaining more clients, Moreno (Director of Marketing) thinks that increasing the number of annual members will be essential for long-term success.

1.2. Services

Casual Riders:
- Single-Ride Passes
- Full-Day Passes
Annual Membership

Products

Traditional Bikes (2-Wheeled Bikes)
Assistive Bikes:
- Reclining Bikes
- Cargo Bikes
- Hand Tricycles

NOTE: Annual Membership Profit > Casual Riders Profit

2. STUDY QUESTIONS

These 3 questions will guide the future marketing program:

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

3. PHASE 1: ASK

3.1. Business Task

The objective of this study is to design a new marketing strategy with my team in order to convert casual riders (single-ride passes & full-day passes) into annual members (annual membership subscribers).

3.2. Key Stakeholders

Lily Moreno: Director of Marketing and my manager.
Cyclistic Marketing Analytics Team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.
Cyclistic Executive Team: The detail-oriented executive team will decide whether to approve the recommended marketing program.

4. Phase 2: Prepare

Dataset Used for this Analysis: The data source for our case study is from Cyclistic's historical trip data. This dataset is stored in Kaggle and was made available through Motivate International Inc. under this license.
Privacy and Accessibility of the Data: The owner has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law. We can copy and modify the dataset without asking permission.
About the Dataset: Lyft Bikes and Scooters, LLC (“Bikeshare”) operates the City of Chicago’s (“City”) Divvy bicycle sharing service. Bikeshare and the City are committed to supporting bicycling as an alternative transportation option. As part of that commitment, the City permits Bikeshare to make certain Divvy system data owned by the City (“Data”) available to the public, subject to the terms and conditions of this License Agreement (“Agreement”). By accessing or using any of the Data, you agree to all of the terms and conditions of this Agreement.
Data Organization and Verification: Available to us are 12 CSV documents. Each datasets contains 100,000-800,000 observations and 13 columns. Some of the column names include - ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id, start_lat, end_lat, start_lng, end_lng, member_casual.

5. Phase 3: Process

To conduct my analysis, I used Python because it allows me to clean, consolidate, analyze and visualize my data. Specifically, I used the "pandas", "matplotlib", "numpy" and "seaborn" packages to help manipulate the data. For added visualizations and dashboards, i used Power BI.

To begin with, i have downloaded the previous 12 month datasets then used Python for the data preparation and exploration. I have chosen this tool because of the versatility and ability to analyze large datasets which was a hurdle to open with Excel.

Before importing the datasets in Python, new columns were added to the dataset.

New Columns:

ride_length = df['ended_at'] - df['started_at']
day_of_week = df['started_at'].dt.weekday. Note: Monday = 0 and Sunday = 6
days:
ride_length_minutes: Shows the ride length in minutes
seasons: Shows the season in which the ride took place
daytime: Shows times of the day for each ride
month: Extracts the month in which each ride took place
trip_distance: Extracts the distance travelled by taking the coordinates into consoderation

To preceed with the exploration with Python, I have imported the relevant python packages for loading datasets, cleaning, wrangling, manipulation, visualization, etc...

# importing packages

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import datetime

# loading monthly datasets

df1 = pd.read_csv('202108-divvy-tripdata.csv')
df2 = pd.read_csv('202109-divvy-tripdata.csv')
df3 = pd.read_csv('202110-divvy-tripdata.csv')
df4 = pd.read_csv('202111-divvy-tripdata.csv')
df5 = pd.read_csv('202112-divvy-tripdata.csv')
df6 = pd.read_csv('202201-divvy-tripdata.csv')
df7 = pd.read_csv('202202-divvy-tripdata.csv')
df8 = pd.read_csv('202203-divvy-tripdata.csv')
df9 = pd.read_csv('202204-divvy-tripdata.csv')
df10 = pd.read_csv('202205-divvy-tripdata.csv')
df11 = pd.read_csv('202206-divvy-tripdata.csv')
df12 = pd.read_csv('202207-divvy-tripdata.csv')
df1.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	ride_id	rideable_type	started_at	ended_at	start_station_name	start_station_id	end_station_name	end_station_id	start_lat	start_lng	end_lat	end_lng	member_casual
0	99103BB87CC6C1BB	electric_bike	2021-08-10 17:15:49	2021-08-10 17:22:44	NaN	NaN	NaN	NaN	41.77	-87.68	41.77	-87.68	member
1	EAFCCCFB0A3FC5A1	electric_bike	2021-08-10 17:23:14	2021-08-10 17:39:24	NaN	NaN	NaN	NaN	41.77	-87.68	41.77	-87.63	member
2	9EF4F46C57AD234D	electric_bike	2021-08-21 02:34:23	2021-08-21 02:50:36	NaN	NaN	NaN	NaN	41.95	-87.65	41.97	-87.66	member
3	5834D3208BFAF1DA	electric_bike	2021-08-21 06:52:55	2021-08-21 07:08:13	NaN	NaN	NaN	NaN	41.97	-87.67	41.95	-87.65	member
4	CD825CB87ED1D096	electric_bike	2021-08-19 11:55:29	2021-08-19 12:04:11	NaN	NaN	NaN	NaN	41.79	-87.60	41.77	-87.62	member

# comparing uniformity in dataset structure by column names

print(df1.columns == df2.columns)
print(df1.columns == df3.columns)
print(df1.columns == df4.columns)
print(df1.columns == df5.columns)
print(df1.columns == df6.columns)
print(df1.columns == df7.columns)
print(df1.columns == df8.columns)
print(df1.columns == df9.columns)
print(df1.columns == df10.columns)
print(df1.columns == df11.columns)
print(df1.columns == df12.columns)

# all datasets have the same structure which makes it easier to merge for easy analysis.

[ True  True  True  True  True  True  True  True  True  True  True  True
  True]
[ True  True  True  True  True  True  True  True  True  True  True  True
  True]
[ True  True  True  True  True  True  True  True  True  True  True  True
  True]
[ True  True  True  True  True  True  True  True  True  True  True  True
  True]
[ True  True  True  True  True  True  True  True  True  True  True  True
  True]
[ True  True  True  True  True  True  True  True  True  True  True  True
  True]
[ True  True  True  True  True  True  True  True  True  True  True  True
  True]
[ True  True  True  True  True  True  True  True  True  True  True  True
  True]
[ True  True  True  True  True  True  True  True  True  True  True  True
  True]
[ True  True  True  True  True  True  True  True  True  True  True  True
  True]
[ True  True  True  True  True  True  True  True  True  True  True  True
  True]

# merging all datasets to form a single dataset

trip_data = pd.concat([df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12])

# snapshot of the first 5 rows of the new concatenated dataset (trip_data)
trip_data.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	ride_id	rideable_type	started_at	ended_at	start_station_name	start_station_id	end_station_name	end_station_id	start_lat	start_lng	end_lat	end_lng	member_casual
0	99103BB87CC6C1BB	electric_bike	2021-08-10 17:15:49	2021-08-10 17:22:44	NaN	NaN	NaN	NaN	41.77	-87.68	41.77	-87.68	member
1	EAFCCCFB0A3FC5A1	electric_bike	2021-08-10 17:23:14	2021-08-10 17:39:24	NaN	NaN	NaN	NaN	41.77	-87.68	41.77	-87.63	member
2	9EF4F46C57AD234D	electric_bike	2021-08-21 02:34:23	2021-08-21 02:50:36	NaN	NaN	NaN	NaN	41.95	-87.65	41.97	-87.66	member
3	5834D3208BFAF1DA	electric_bike	2021-08-21 06:52:55	2021-08-21 07:08:13	NaN	NaN	NaN	NaN	41.97	-87.67	41.95	-87.65	member
4	CD825CB87ED1D096	electric_bike	2021-08-19 11:55:29	2021-08-19 12:04:11	NaN	NaN	NaN	NaN	41.79	-87.60	41.77	-87.62	member

Summary of Dataset

The trip_data DataFrame has 5901463 rows and 13 columns. The columns include various data types, such as object (string) and float64 (numeric). The ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id, and member_casual columns are of type object, while the start_lat, start_lng, end_lat, and end_lng columns are of type float64.

# checking merged data structure and types
print(trip_data.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5901463 entries, 0 to 823487
Data columns (total 13 columns):
 #   Column              Dtype  
---  ------              -----  
 0   ride_id             object 
 1   rideable_type       object 
 2   started_at          object 
 3   ended_at            object 
 4   start_station_name  object 
 5   start_station_id    object 
 6   end_station_name    object 
 7   end_station_id      object 
 8   start_lat           float64
 9   start_lng           float64
 10  end_lat             float64
 11  end_lng             float64
 12  member_casual       object 
dtypes: float64(4), object(9)
memory usage: 630.3+ MB
None

Cleaning and Checking Data for Errors

Checking for Duplicates: The program shows there are no entries of duplicated data.

Checking for Missing/Null Values: Only 5 columns has some missing values. These are:

start_station_name: 860786 rows
start_station_id: 860784 rows
end_station_name: 919896 rows
end_station_id: 919896 rows
end_lat: 5590 rows
end_lng: 5590 rows

All cells with object data types and float data type missing values will be replaced with Not Available and mean (average) value respectively. None of the rows with the missing entries will be deleted since they contain other relevant column entries for our analysis.

Checking for Unique Values: We can comfirm that there is consistency with unique values for columns: rideable_type, member_casual, day_of_week and days.

Change Data Types (dtypes): Converting ride_length and day_of_week column dtypes to datetime and Categorical dtypes respectively.

# checking for duplicates
trip_duplicates = trip_data.duplicated(subset=None, keep=False)
trip_data[trip_duplicates].head().sort_values(by='ride_id')
print(trip_data[trip_duplicates].info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ride_id             0 non-null      object 
 1   rideable_type       0 non-null      object 
 2   started_at          0 non-null      object 
 3   ended_at            0 non-null      object 
 4   start_station_name  0 non-null      object 
 5   start_station_id    0 non-null      object 
 6   end_station_name    0 non-null      object 
 7   end_station_id      0 non-null      object 
 8   start_lat           0 non-null      float64
 9   start_lng           0 non-null      float64
 10  end_lat             0 non-null      float64
 11  end_lng             0 non-null      float64
 12  member_casual       0 non-null      object 
dtypes: float64(4), object(9)
memory usage: 0.0+ bytes
None

# checking for missing values
trip_data.isnull().sum()

ride_id                    0
rideable_type              0
started_at                 0
ended_at                   0
start_station_name    860786
start_station_id      860784
end_station_name      919896
end_station_id        919896
start_lat                  0
start_lng                  0
end_lat                 5590
end_lng                 5590
member_casual              0
dtype: int64

# replacing missing values for object data types
replace_NA = trip_data[['start_station_name', 'start_station_id', 'end_station_name', 'end_station_id']].fillna('Not Available')
trip_data[['start_station_name', 'start_station_id', 'end_station_name', 'end_station_id']] = replace_NA
replace_NA.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	start_station_name	start_station_id	end_station_name	end_station_id
0	Not Available	Not Available	Not Available	Not Available
1	Not Available	Not Available	Not Available	Not Available
2	Not Available	Not Available	Not Available	Not Available
3	Not Available	Not Available	Not Available	Not Available
4	Not Available	Not Available	Not Available	Not Available

# replacing missing values for float data types
replace_mean = trip_data[['end_lat', 'end_lng']].fillna(trip_data[['end_lat', 'end_lng']]).mean()
trip_data[['end_lat', 'end_lng']] = replace_mean
replace_mean.head()

end_lat    41.901021
end_lng   -87.647608
dtype: float64

# all missing data have been successfully replaced
trip_data.isnull().sum()

ride_id               0
rideable_type         0
started_at            0
ended_at              0
start_station_name    0
start_station_id      0
end_station_name      0
end_station_id        0
start_lat             0
start_lng             0
end_lat               0
end_lng               0
member_casual         0
dtype: int64

# converting started_at and ended_at columns into datetime data types

trip_data[['started_at', 'ended_at']] = trip_data[['started_at', 'ended_at']].apply(pd.to_datetime)

# creating ride_length and day_of_week columns

trip_data['ride_length'] = trip_data['ended_at'] - trip_data['started_at']

# day_of_week where Monday = 0 and Sunday = 6
trip_data['day_of_week'] = trip_data['started_at'].dt.weekday

# Creating a new column (days) and replacing day_of_week values from numerical to name of week days

trip_data['days'] = trip_data['day_of_week'].replace({0:'Monday', 
                                                      1:'Tuesday', 2:'Wednesday', 
                                                      3:'Thursday', 4:'Friday', 
                                                      5:'Saturday', 6:'Sunday'})

# Convert day_of_week into a categorical data type

trip_data['day_of_week'] = trip_data['day_of_week'].astype('category')
trip_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5901463 entries, 0 to 823487
Data columns (total 16 columns):
 #   Column              Dtype          
---  ------              -----          
 0   ride_id             object         
 1   rideable_type       object         
 2   started_at          datetime64[ns] 
 3   ended_at            datetime64[ns] 
 4   start_station_name  object         
 5   start_station_id    object         
 6   end_station_name    object         
 7   end_station_id      object         
 8   start_lat           float64        
 9   start_lng           float64        
 10  end_lat             float64        
 11  end_lng             float64        
 12  member_casual       object         
 13  ride_length         timedelta64[ns]
 14  day_of_week         category       
 15  days                object         
dtypes: category(1), datetime64[ns](2), float64(4), object(8), timedelta64[ns](1)
memory usage: 726.0+ MB

# checking for unique values

print('rideable_type: ', trip_data['rideable_type'].unique(), "\n")
print('member_casual: ', trip_data['member_casual'].unique(), "\n")
print('day_of_week: ', sorted(trip_data['day_of_week'].unique()), "\n")
print('days: ', trip_data['days'].unique())

rideable_type:  ['electric_bike' 'classic_bike' 'docked_bike'] 

member_casual:  ['member' 'casual'] 

day_of_week:  [0, 1, 2, 3, 4, 5, 6] 

days:  ['Tuesday' 'Saturday' 'Thursday' 'Friday' 'Wednesday' 'Sunday' 'Monday']

6. Phase 4: Analyze

After data has been cleaned or processed, we can now proceed to organizing and performing calculations to identify relations and trends that helps answer the business questions.

I will be using mostly Python for the all analytical aspects of this data including visualizations. Tableau will be used to create more appealing visualizations to effectively communicate the results.

# creating ride_length_minutes column from ride_length

trip_data["ride_length_minutes"] = trip_data["ride_length"].dt.total_seconds() / 60
trip_data.head(3)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	ride_id	rideable_type	started_at	ended_at	start_station_name	start_station_id	end_station_name	end_station_id	start_lat	start_lng	end_lat	end_lng	member_casual	ride_length	day_of_week	days	ride_length_minutes
0	99103BB87CC6C1BB	electric_bike	2021-08-10 17:15:49	2021-08-10 17:22:44	Not Available	Not Available	Not Available	Not Available	41.77	-87.68	41.901021	-87.647608	member	0 days 00:06:55	1	Tuesday	6.916667
1	EAFCCCFB0A3FC5A1	electric_bike	2021-08-10 17:23:14	2021-08-10 17:39:24	Not Available	Not Available	Not Available	Not Available	41.77	-87.68	41.901021	-87.647608	member	0 days 00:16:10	1	Tuesday	16.166667
2	9EF4F46C57AD234D	electric_bike	2021-08-21 02:34:23	2021-08-21 02:50:36	Not Available	Not Available	Not Available	Not Available	41.95	-87.65	41.901021	-87.647608	member	0 days 00:16:13	5	Saturday	16.216667

# Using descriptive statistics to get an overview of the entire dataset
trip_data.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	start_lat	start_lng	end_lat	end_lng	ride_length	ride_length_minutes
count	5.901463e+06	5.901463e+06	5.901463e+06	5.901463e+06	5901463	5.901463e+06
mean	4.190077e+01	-8.764742e+01	4.190102e+01	-8.764761e+01	0 days 00:19:53.376440723	1.988961e+01
std	4.725668e-02	3.093877e-02	4.450215e-09	9.468210e-09	0 days 02:27:59.109436692	1.479852e+02
min	4.164000e+01	-8.784000e+01	4.190102e+01	-8.764761e+01	-1 days +21:42:35	-1.374167e+02
25%	4.188103e+01	-8.766154e+01	4.190102e+01	-8.764761e+01	0 days 00:06:10	6.166667e+00
50%	4.189964e+01	-8.764335e+01	4.190102e+01	-8.764761e+01	0 days 00:10:57	1.095000e+01
75%	4.192957e+01	-8.762872e+01	4.190102e+01	-8.764761e+01	0 days 00:19:49	1.981667e+01
max	4.563503e+01	-7.379648e+01	4.190102e+01	-8.764761e+01	28 days 21:49:10	4.162917e+04

The descriptive statistics for ride_length shows a negative min result. This could be as a result of data entry errors where the start time and end time were interchanged. This will be fixed in the next cell to eliminate any ride_length_minutes row with a value less than 1 minute or more than 1440 minutes.

# checking for inconsistent values for ride_length
outliers_ride_length = (trip_data["ride_length_minutes"] < 1) | (trip_data["ride_length_minutes"] > 1440)
outliers_ride_length.value_counts()

False    5791938
True      109525
Name: ride_length_minutes, dtype: int64

# Removing rows with outliers

trip_data.drop(trip_data.index[outliers_ride_length], inplace=True)
print('All outliers have been deleted')

All outliers have been deleted

# all rows with outliers have been removed

((trip_data["ride_length_minutes"] < 1) | (trip_data["ride_length_minutes"] > 1440)).value_counts()

False    5071194
Name: ride_length_minutes, dtype: int64

Summary Statistics

Based on the descriptive statistics provided below, we can see that there are significant differences in the way that casual riders and annual members use the bike-share service.

Some key observations include:

Member riders take more rides than Casual members: there are around 2.2 million rides taken by casual riders, compared to around 2.9 million rides taken by annual members.
Casual riders take longer rides than annual members: on average, casual riders take rides that are 24.3 minutes long, while annual members take rides that are 12.9 minutes long.
There is more variation in the trip duration taken by casual riders: the standard deviation of ride lengths for casual riders is 42.5 minutes, while the standard deviation for annual members is 19.1 minutes. This suggests that casual riders may be taking longer, more diverse types of rides than annual members.
The longest trip duration is a little higher for member riders at 1439.95 as compared to 1439.92 for casual riders. The shortest trip duration for both riders is 1 minutes.

These differences suggest that casual riders may be using the bike-share service for different purposes than annual members. For example, casual riders may be more likely to use the service for longer, leisurely rides or for transportation to and from destinations that are further away. Annual members, on the other hand, may be more likely to use the service for shorter, more frequent trips, such as commuting to work or running errands.

To design a marketing strategy that targets casual riders and aims to convert them into annual members, it will be important to understand these differences in more detail and consider how they might influence the types of incentives or promotions that are most likely to be effective.

# Group the data by the member_casual column and calculate summary statistics
summary_stats = trip_data.groupby("member_casual")["ride_length_minutes"].describe()
summary_stats

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	count	mean	std	min	25%	50%	75%	max
member_casual
casual	2179148.0	24.295795	42.502919	1.0	8.450000	14.65	26.600000	1439.916667
member	2892046.0	12.920055	19.111587	1.0	5.416667	9.20	15.883333	1439.950000

# creating trip_distance column using geo data

trip_data["trip_distance"] = np.sqrt(((trip_data["end_lat"] - trip_data["start_lat"])**2) + ((trip_data["end_lng"] - trip_data["start_lng"])**2))
trip_data["trip_distance"] = trip_data["trip_distance"] * 111.139

# Summary statistics of dataset

trip_data.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	start_lat	start_lng	end_lat	end_lng	ride_length	ride_length_minutes	trip_distance
count	5.071194e+06	5.071194e+06	5.071194e+06	5.071194e+06	5071194	5.071194e+06	5.071194e+06
mean	4.190087e+01	-8.764739e+01	4.190102e+01	-8.764761e+01	0 days 00:17:48.500153612	1.780834e+01	4.961660e+00
std	4.715035e-02	3.036223e-02	3.203617e-09	7.084524e-09	0 days 00:31:52.752662185	3.187921e+01	3.772230e+00
min	4.164000e+01	-8.784000e+01	4.190102e+01	-8.764761e+01	0 days 00:01:00	1.000000e+00	1.996711e-01
25%	4.188103e+01	-8.766150e+01	4.190102e+01	-8.764761e+01	0 days 00:06:24	6.400000e+00	2.501338e+00
50%	4.189993e+01	-8.764335e+01	4.190102e+01	-8.764761e+01	0 days 00:11:11	1.118333e+01	3.617545e+00
75%	4.192957e+01	-8.762872e+01	4.190102e+01	-8.764761e+01	0 days 00:20:04	2.006667e+01	5.963345e+00
max	4.207000e+01	-8.752000e+01	4.190102e+01	-8.764761e+01	0 days 23:59:57	1.439950e+03	3.137809e+01

7. Phase 5: Share

This phase of the project is to identify patterns and share findings from the data. Effective visualization will be shared in this section. As mentioned ealier, Tableau will be used for the visuals.

Before proceeding to create our visualizations with Tableau, we first have to export our cleaned and transformed data as a CSV file as illustrated in the cell below:

Percentage Distribution of Trips Taken

The distribution of customer types shows that 57.03% of trips taken by Cyclistic customers are members, while 42.74% (2,522,226) are casual riders. This indicates that the majority of the trips taken are by customers who have purchased an annual membership, while the other proportion of customers prefer to pay per ride.

# Percentage of Trips Taken by Customer Types

# Count the number of rides for each customer type
customer_type = trip_data['member_casual'].value_counts()
print(customer_type)

plt.style.use('seaborn')
plt.pie(customer_type, labels = customer_type.index, radius = 1.4, autopct = '%0.2f%%', colors = ['teal', 'grey'], 
        textprops = {'fontsize' : 12})
plt.axis('equal')
plt.title('Pie Distribution of Customer Types', fontsize = 14 )
plt.legend(loc = 'upper right', fontsize = 12)

circle = plt.Circle(xy=(0,0), radius = 0.65, facecolor = 'white')
plt.gca().add_artist(circle)

member    2892046
casual    2179148
Name: member_casual, dtype: int64





<matplotlib.patches.Circle at 0x25fc47b6370>

Distribution of Rideable Types by Users

Based on the given data, the distribution of rideable types by user type (member_casual) can be summarized as follows:

Classic bikes:

Casual riders: 1,132,892 trips (44.92%)
Member riders: 1,922,749 trips (56.90%)

Docked bikes:

Casual riders: 226,728 trips (8.99%)

Electric bikes:

Casual riders: 1,162,606 trips (46.09%)
Member riders: 1,456,488 trips (43.10%)

These percentages represent the proportion of trips taken by each rideable type within each user type category. It shows that classic bikes are more popular among member riders, while electric bikes are more commonly used by casual riders. Docked bikes are primarily used by casual riders.

# Count the number of trips taken by rideable_type and group them by member_casual
trip_counts = trip_data.groupby(['rideable_type', 'member_casual']).size().reset_index(name='count')

# Calculate the total count for each member_casual group
total_counts = trip_counts.groupby('member_casual')['count'].transform('sum')

# Calculate the percentage for each trip count
trip_counts['percentage'] = (trip_counts['count'] / total_counts) * 100
trip_counts.sort_values('member_casual')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	rideable_type	member_casual	count	percentage
0	classic_bike	casual	981619	45.046000
2	docked_bike	casual	195924	8.990853
3	electric_bike	casual	1001605	45.963147
1	classic_bike	member	1647662	56.972192
4	electric_bike	member	1244384	43.027808

# Group by member_casual and rideable_type
rideable_type_grouped = trip_data.groupby(['member_casual', 'rideable_type']).size().reset_index(name='count')

# Plot the bar chart
plt.figure(figsize=(8, 6))
sns.barplot(x='member_casual', y='count', hue='rideable_type', data=rideable_type_grouped)
plt.xlabel('Member/Casual')
plt.ylabel('Count')
plt.title('Distribution of Rideable Types by Users')
plt.legend(title='Rideable Type')
    
# Show the plot
plt.show()

Trips Taken per Day

These findings suggest that there may be different usage patterns and preferences between Member and Casual riders, with Member riders tending to use the service more frequently on weekdays (Monday - Friday), while Casual riders have a higher proportion of rides on weekends (Saturday and Sunday).

# Plotting Number of Trips Taken per Day

days_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

ax = sns.countplot(data=trip_data, x='days', hue='member_casual', color=sns.set_palette(["teal", "grey"]), order=days_order)
plt.title('Number of Rides per Day')
plt.ylabel('Number of Daily Rides')
plt.xlabel('Days')

# Adding labels to the x-axis
for p in ax.patches:
    ax.text(p.get_x() + p.get_width()/2., p.get_height(), '%d' % int(p.get_height()), 
            fontsize=8, color='black', ha='center', va='bottom')
    
# Placing the legend on the top right corner
ax.legend(title='User Type', bbox_to_anchor=(1.02, 1), loc='upper left')

plt.show()

# Group by 'member_casual' and 'day_of_week', and calculate the mean distance
mean_distance = trip_data.groupby(['member_casual', 'day_of_week'])['trip_distance'].mean().reset_index()

# Plot the mean distances using a bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='day_of_week', y='trip_distance', hue='member_casual', data=mean_distance)
plt.xlabel('Day of Week')
plt.ylabel('Mean Distance')
plt.title('Mean Distance Traveled Within the Week')
plt.legend(title='Member/Casual')

# Show the plot
plt.show()

Trips per Season

As observed from the diagram, most rides occurs during summer. Member riders accounts for most ride with a total sum of 1,209,267 rides whilst casual riders accounts for 1,187,777 rides during the summer.

# Plotting Number of Rides per Season

trip_data['seasons'] = trip_data["started_at"].dt.month

# assign seasons to the season column
trip_data['seasons'] = trip_data['seasons'].apply(lambda x: "Winter" if x in [12, 1, 2] 
                        else "Spring" if x in [3, 4, 5] 
                        else "Summer" if x in [6, 7, 8] 
                        else "Fall")

season_order = ["Fall", "Summer", "Spring", "Winter"]
ax = sns.countplot(data=trip_data, x='seasons', hue='member_casual', color=sns.set_palette(["teal", "grey"]), order=season_order)
plt.title('Number of Rides per Season')
plt.ylabel('Number of Seasonal Rides (Million)')
plt.xlabel('Seasons')

# Adding labels to the x-axis
for p in ax.patches:
    ax.text(p.get_x() + p.get_width()/2., p.get_height(), '%d' % int(p.get_height()), 
            fontsize=8, color='black', ha='center', va='bottom')

plt.show()

Trips per Daytime

We can depict from the diagram below that most trips occurs during the afternoon of which majority of the rides are from member riders as opposed to casual riders. The second most daytime trips occurs during the night then followed by morning trips.

# Plotting the Number of Rides per Daytime

# create a new column for the day time
trip_data['daytime'] = trip_data["started_at"].dt.hour

# assign day times to the day_time column
trip_data['daytime'] = trip_data['daytime'].apply(lambda x: "Morning" if x in range(6, 12) 
                        else "Afternoon" if x in range(12, 18) 
                        else "Night")

daytime_order = ["Afternoon", "Night", "Morning"]
ax = sns.countplot(data=trip_data, x='daytime', hue='member_casual', color=sns.set_palette(["teal", "grey"]), order=daytime_order)
plt.title('Number of Rides per Day')
plt.xlabel('Day Time')
plt.ylabel('Number of Day Time Rides (Million)')

# Adding labels to the x-axis
for p in ax.patches:
    ax.text(p.get_x() + p.get_width()/2., p.get_height(), '%d' % int(p.get_height()), 
            fontsize=8, color='black', ha='center', va='bottom')
    
plt.show()

Trips per Month

# Plotting the Number of Rides per Month

# creating a month column
trip_data['month'] = trip_data["started_at"].dt.month
month_dict = {1: 'January', 2: 'February', 3: 'March', 4: 'April', 5: 'May', 6: 'June', 7: 'July', 8: 'August', 9: 'September', 10: 'October', 11: 'November', 12: 'December'}
trip_data['month'] = trip_data['month'].map(month_dict)
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

ax = sns.countplot(data=trip_data, x='month', hue='member_casual', color=sns.set_palette(["teal", "grey"]), order=month_order)
plt.title('Number of Rides per Month')
plt.xlabel('Month')
plt.xticks(rotation=90)
plt.ylabel('Number of Monthly Rides')

# Adding labels to the x-axis
for p in ax.patches:
    ax.text(p.get_x() + p.get_width()/2., p.get_height(), '%d' % int(p.get_height()), 
            fontsize=6, color='black', ha='center', va='bottom')
    
plt.show()
print(ax)

AxesSubplot(0.125,0.11;0.775x0.77)

# creating a month column
trip_data['month'] = trip_data["started_at"].dt.month
month_dict = {1: 'January', 2: 'February', 3: 'March', 4: 'April', 5: 'May', 6: 'June', 
              7: 'July', 8: 'August', 9: 'September', 10: 'October', 11: 'November', 12: 'December'}
trip_data['month'] = trip_data['month'].map(month_dict)
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 
               'August', 'September', 'October', 'November', 'December']

# Get the count of rides for each month and member_casual
monthly_rides = trip_data.groupby(['month','member_casual'])['ride_length_minutes'].size()

# convert the series into dataframe for seaborn plot
monthly_rides = monthly_rides.reset_index(name='count')

# Convert the 'month' column to a categorical variable with the desired order
monthly_rides['month'] = pd.Categorical(monthly_rides['month'], categories=month_order, ordered=True)

# Create the line plot with the ordered month variable as the x-axis
sns.lineplot(x='month', y='count', data=monthly_rides, hue='member_casual', color=sns.set_palette(["teal", "grey"]))
plt.title('Number of Rides per Month')
plt.xlabel('Month')
plt.xticks(rotation=90)
plt.ylabel('Number of Monthly Rides')

# Get the maximum and minimum values
max_value = monthly_rides['count'].max()
min_value = monthly_rides['count'].min()

# Annotate the maximum value
plt.annotate(f'Max: {max_value}', xy=(monthly_rides.loc[monthly_rides['count'] == max_value, 'month'].item(), max_value), xytext=(-10, 10), textcoords='offset points',
             arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=.2'))

# Annotate the minimum value
plt.annotate(f'Min: {min_value}', xy=(monthly_rides.loc[monthly_rides['count'] == min_value, 'month'].item(), min_value), xytext=(-10, -20), textcoords='offset points',
             arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=-.2'))


plt.show()

# Group the data by month and member_casual columns
monthly_rides3 = trip_data.groupby(['month', 'member_casual'])['ride_length_minutes'].mean().reset_index()

# Use the seaborn library to create a bar plot
import seaborn as sns
sns.barplot(x='month', y='ride_length_minutes', hue='member_casual', data=monthly_rides3)

# Add x and y labels
plt.xlabel('Month')
plt.ylabel('Average Ride Length (minutes)')

# Rotate the x-axis labels to be vertical
plt.xticks(rotation=90)

# Show the plot
plt.show()

trip_data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	ride_id	rideable_type	started_at	ended_at	start_station_name	start_station_id	end_station_name	end_station_id	start_lat	start_lng	...	end_lng	member_casual	ride_length	day_of_week	days	ride_length_minutes	trip_distance	seasons	daytime	month
0	99103BB87CC6C1BB	electric_bike	2021-08-10 17:15:49	2021-08-10 17:22:44	Not Available	Not Available	Not Available	Not Available	41.77	-87.68	...	-87.647608	member	0 days 00:06:55	1	Tuesday	6.916667	14.999919	Summer	Afternoon	August
1	EAFCCCFB0A3FC5A1	electric_bike	2021-08-10 17:23:14	2021-08-10 17:39:24	Not Available	Not Available	Not Available	Not Available	41.77	-87.68	...	-87.647608	member	0 days 00:16:10	1	Tuesday	16.166667	14.999919	Summer	Afternoon	August
2	9EF4F46C57AD234D	electric_bike	2021-08-21 02:34:23	2021-08-21 02:50:36	Not Available	Not Available	Not Available	Not Available	41.95	-87.65	...	-87.647608	member	0 days 00:16:13	5	Saturday	16.216667	5.449995	Summer	Night	August
3	5834D3208BFAF1DA	electric_bike	2021-08-21 06:52:55	2021-08-21 07:08:13	Not Available	Not Available	Not Available	Not Available	41.97	-87.67	...	-87.647608	member	0 days 00:15:18	5	Saturday	15.300000	8.060094	Summer	Morning	August
4	CD825CB87ED1D096	electric_bike	2021-08-19 11:55:29	2021-08-19 12:04:11	Not Available	Not Available	Not Available	Not Available	41.79	-87.60	...	-87.647608	member	0 days 00:08:42	3	Thursday	8.700000	13.425361	Summer	Morning	August
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
823483	605787F70B3B9FD3	electric_bike	2022-07-09 08:32:15	2022-07-09 08:56:10	Not Available	Not Available	Not Available	Not Available	41.74	-87.68	...	-87.647608	member	0 days 00:23:55	5	Saturday	23.916667	18.254188	Summer	Morning	July
823484	F0EECBEE637DF028	electric_bike	2022-07-17 13:27:57	2022-07-17 13:33:09	Not Available	Not Available	Not Available	Not Available	41.87	-87.66	...	-87.647608	member	0 days 00:05:12	6	Sunday	5.200000	3.712511	Summer	Afternoon	July
823485	B8B091DC72DDAB9D	electric_bike	2022-07-17 14:51:37	2022-07-17 14:57:01	Not Available	Not Available	Not Available	Not Available	41.95	-87.64	...	-87.647608	member	0 days 00:05:24	6	Sunday	5.400000	5.508790	Summer	Afternoon	July
823486	B4D3FFCC1F3AF5EC	electric_bike	2022-07-28 13:41:50	2022-07-28 13:47:17	Not Available	Not Available	Not Available	Not Available	41.88	-87.63	...	-87.647608	member	0 days 00:05:27	3	Thursday	5.450000	3.047559	Summer	Afternoon	July
823487	AA9217C8DA3BAC0B	electric_bike	2022-07-29 13:05:20	2022-07-29 13:10:40	Not Available	Not Available	Not Available	Not Available	41.90	-87.66	...	-87.647608	member	0 days 00:05:20	4	Friday	5.333333	1.381877	Summer	Afternoon	July

5071194 rows × 21 columns

# Exporting dataset as a csv file

trip_data.to_csv('trip_data.csv')
print("Dataset exported as CSV file successfully.")

Dataset exported as CSV file successfully.

pd.read_csv('trip_data.csv')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Unnamed: 0	ride_id	rideable_type	started_at	ended_at	start_station_name	start_station_id	end_station_name	end_station_id	start_lat	...	end_lng	member_casual	ride_length	day_of_week	days	ride_length_minutes	trip_distance	seasons	daytime	month
0	0	99103BB87CC6C1BB	electric_bike	2021-08-10 17:15:49	2021-08-10 17:22:44	Not Available	Not Available	Not Available	Not Available	41.77	...	-87.647608	member	0 days 00:06:55	1	Tuesday	6.916667	14.999919	Summer	Afternoon	August
1	1	EAFCCCFB0A3FC5A1	electric_bike	2021-08-10 17:23:14	2021-08-10 17:39:24	Not Available	Not Available	Not Available	Not Available	41.77	...	-87.647608	member	0 days 00:16:10	1	Tuesday	16.166667	14.999919	Summer	Afternoon	August
2	2	9EF4F46C57AD234D	electric_bike	2021-08-21 02:34:23	2021-08-21 02:50:36	Not Available	Not Available	Not Available	Not Available	41.95	...	-87.647608	member	0 days 00:16:13	5	Saturday	16.216667	5.449995	Summer	Night	August
3	3	5834D3208BFAF1DA	electric_bike	2021-08-21 06:52:55	2021-08-21 07:08:13	Not Available	Not Available	Not Available	Not Available	41.97	...	-87.647608	member	0 days 00:15:18	5	Saturday	15.300000	8.060094	Summer	Morning	August
4	4	CD825CB87ED1D096	electric_bike	2021-08-19 11:55:29	2021-08-19 12:04:11	Not Available	Not Available	Not Available	Not Available	41.79	...	-87.647608	member	0 days 00:08:42	3	Thursday	8.700000	13.425361	Summer	Morning	August
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5071189	823483	605787F70B3B9FD3	electric_bike	2022-07-09 08:32:15	2022-07-09 08:56:10	Not Available	Not Available	Not Available	Not Available	41.74	...	-87.647608	member	0 days 00:23:55	5	Saturday	23.916667	18.254188	Summer	Morning	July
5071190	823484	F0EECBEE637DF028	electric_bike	2022-07-17 13:27:57	2022-07-17 13:33:09	Not Available	Not Available	Not Available	Not Available	41.87	...	-87.647608	member	0 days 00:05:12	6	Sunday	5.200000	3.712511	Summer	Afternoon	July
5071191	823485	B8B091DC72DDAB9D	electric_bike	2022-07-17 14:51:37	2022-07-17 14:57:01	Not Available	Not Available	Not Available	Not Available	41.95	...	-87.647608	member	0 days 00:05:24	6	Sunday	5.400000	5.508790	Summer	Afternoon	July
5071192	823486	B4D3FFCC1F3AF5EC	electric_bike	2022-07-28 13:41:50	2022-07-28 13:47:17	Not Available	Not Available	Not Available	Not Available	41.88	...	-87.647608	member	0 days 00:05:27	3	Thursday	5.450000	3.047559	Summer	Afternoon	July
5071193	823487	AA9217C8DA3BAC0B	electric_bike	2022-07-29 13:05:20	2022-07-29 13:10:40	Not Available	Not Available	Not Available	Not Available	41.90	...	-87.647608	member	0 days 00:05:20	4	Friday	5.333333	1.381877	Summer	Afternoon	July

5071194 rows × 22 columns

Download Cyclistic Dashboard PDF

8. Phase 6: Conclusion (Act)

The purchase of this phase is to discuss the findings from the analysis and suggest possible optional decisions that the key stakeholders should act on. The following study questions as mentioned previously which we seek to answer are as follow:

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

8.1. Summary of Findings

The findings from our analysis reveals the following:

Note: All figures used in this section are approximated to the nearest whole number.

Distribution of User Types: Cyclistic has more member riders than casual riders. This inherently means that a higher chunk of the company's profit comes from annual member riders. This result matches with the finance analyst observation that annual members are much more profitable than casual riders. To put the figures into perpectives, 57% of riders are annual members, whilst 43% of the riders are casual riders.
Distribution of Ride Types: Cyclistic has 3 main types of rides: classic bikes, electric bikes and docked bikes. According to our analysis, 52% of all rides chosen are classic bikes which are basically 2-wheeled traditional bikes. Electric bikes has the second highest distribution with 44%, whilst the least used bike form are docked bikes with a usage share of 4%.
Number of Rides per Day
- 3.1. Member riders have an almost equally distributed usage rate per day with peak days Tuesday, Wednesday, Thursday, and Saturday averaging 450,000 rides. Sunday has the lowest amount of rides at 360,000, whilst Monday and Friday are both averaging 400,000 rides.
- 3.2. The highest number of rides taken by casual riders is recorded on Saturday with a total of 460,000. This record not only applies to casual riders but is the highest recorded ride in a day for Cyclistic during the entire 12 month period. The second highest rides are taken on Sundays with a total of 410,000. Tuesdays and Wenesdays records the lowest number of rides at an average of 240,000.
Number of Ride Type per User
- 4.1. A higher proportion of member riders choose classic bikes followed by electric bikes. Within the analyzed 12 month period, non of the member riders chose docked bikes.
- 4.2. A slighly higher proportion of casual riders chose electric bikes followed by classic bikes. All rides taken with docked bikes comes from casual riders with the total number of 200,000 rides which represents the 4% distribution of ride type within the period.
Percentage of Total Rides per Month: Both member and casual riders record their highest number of rides on July with an average of 7%. The second and third highest number of rides are on August and June, respectively for casual riders. For member riders, the second and third rides were taken on June and August. Both customer types have their lowest number of rides recorded on January with 1% for members and 0.3% for casual riders.
Number of Rides per Season: Most rides are taken during summer for both members and casual riders at an average of 1 million. The second highest number of rides is recorded during Fall, followed by Spring and then Winter as the lowest.
Ride Length: It is revealed that casual riders take longer rides on average than member riders.

8.2. Recommendations

The supported recommendations from the author is as follow:

Since casual riders take longer rides on average than member riders, marketing should be targeted towards showing the cost savings that casual riders will benefit from switching to annual membership. It should be clear from the ad campaign that the monthly expense from casual riders is more as compared to the standard monthly cost of subscriptions.
Further incentives like discount offers can be included in the campaign to drive interest for clients to make the switch from casual to annual membership.
Winter season has the lowest number of rides taken. In order to improve rides during this season, we recommend that Cyclistic should improve riding comfort by offering safety gears like gloves, nose masks, winter tires, and winter jackets. These offerings can be made more effective through partnership with local retailers.
Cyclistic should offer specialized Saturday and Sunday discounts to attract members to take more rides during these days. The point-earning system can be introduced which gives people an opportunity to earn points to attractive discounts to online retailers like food delivery, fashion, and supermarkets.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
Certificates		Certificates
Datasets		Datasets
Certificate Info.md		Certificate Info.md
Cyclistic Dashboard.pdf		Cyclistic Dashboard.pdf
Cyclistic Dashboard.png		Cyclistic Dashboard.png
README.md		README.md
cyclistic_data_analytics.ipynb		cyclistic_data_analytics.ipynb
output_37_2.png		output_37_2.png
output_40_0.png		output_40_0.png
output_42_0.png		output_42_0.png
output_43_0.png		output_43_0.png
output_45_0.png		output_45_0.png
output_47_0.png		output_47_0.png
output_49_0.png		output_49_0.png
output_50_0.png		output_50_0.png
output_51_0.png		output_51_0.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAPSTONE PROJECT - GOOGLE DATA ANALYTICS PROFESSIONAL CERTIFICATE

TABLE OF CONTENT

1. SUMMARY

1.1. About Company

1.2. Services

Products

2. STUDY QUESTIONS

3. PHASE 1: ASK

3.1. Business Task

3.2. Key Stakeholders

4. Phase 2: Prepare

5. Phase 3: Process

New Columns:

Summary of Dataset

Cleaning and Checking Data for Errors

6. Phase 4: Analyze

Summary Statistics

7. Phase 5: Share

Percentage Distribution of Trips Taken

Distribution of Rideable Types by Users

Trips Taken per Day

Trips per Season

Trips per Daytime

Trips per Month

8. Phase 6: Conclusion (Act)

8.1. Summary of Findings

8.2. Recommendations

About

Releases

Packages

Languages

CypranA/Google-Data-Analytics-Professional-Certificate

Folders and files

Latest commit

History

Repository files navigation

CAPSTONE PROJECT - GOOGLE DATA ANALYTICS PROFESSIONAL CERTIFICATE

TABLE OF CONTENT

1. SUMMARY

1.1. About Company

1.2. Services

Products

2. STUDY QUESTIONS

3. PHASE 1: ASK

3.1. Business Task

3.2. Key Stakeholders

4. Phase 2: Prepare

5. Phase 3: Process

New Columns:

Summary of Dataset

Cleaning and Checking Data for Errors

6. Phase 4: Analyze

Summary Statistics

7. Phase 5: Share

Percentage Distribution of Trips Taken

Distribution of Rideable Types by Users

Trips Taken per Day

Trips per Season

Trips per Daytime

Trips per Month

8. Phase 6: Conclusion (Act)

8.1. Summary of Findings

8.2. Recommendations

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages