Hey!
This is my take on the capstone project for Google’s Data Analytics Professional Certificate. As part of the final course, I will be showcasing all the skills I learned throughout the program into a practice case study. Using a real-world business problem, I will be tackling it with the Ask, Prepare, Process, Share, and Act methodology. Any feedback would be greatly appreciated!
You are a junior data analyst working on the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
● Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use the bikes to commute to work each day.
● Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
● Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals—as well as how you, as a junior data analyst, can help Cyclistic achieve them.
● Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a solid opportunity to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.
Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?
Moreno has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?
A clear statement of the business task
A description of all data sources used
Documentation of any cleaning or manipulation of data
A summary of your analysis
Supporting visualizations and key findings
Your top three recommendations based on your analysis
The data that will be used in this analysis is provided by Divvy, a real bike-share company, from which Cyclistic is based on and is organized into csv files by month. Details regarding its public use can be found on Divvy’s License Agreement.
To keep this analysis as real as possible, I will be using the last 12 months of data available to me (at this time is April 2023 - March 2024).
We will first begin by loading in the libraries needed for analyzing Cyclistic’s historical data.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(geosphere)
library(lubridate)
library(leaflet)
Next we will import the 12 most recent datasets (April 2023 - April 2024).
CyclisticData202304 <- read.csv("~/Data/202304-divvy-tripdata.csv")
CyclisticData202305 <- read.csv("~/Data/202305-divvy-tripdata.csv")
CyclisticData202306 <- read.csv("~/Data/202306-divvy-tripdata.csv")
CyclisticData202307 <- read.csv("~/Data/202307-divvy-tripdata.csv")
CyclisticData202308 <- read.csv("~/Data/202308-divvy-tripdata.csv")
CyclisticData202309 <- read.csv("~/Data/202309-divvy-tripdata.csv")
CyclisticData202310 <- read.csv("~/Data/202310-divvy-tripdata.csv")
CyclisticData202311 <- read.csv("~/Data/202311-divvy-tripdata.csv")
CyclisticData202312 <- read.csv("~/Data/202312-divvy-tripdata.csv")
CyclisticData202401 <- read.csv("~/Data/202401-divvy-tripdata.csv")
CyclisticData202402 <- read.csv("~/Data/202402-divvy-tripdata.csv")
CyclisticData202403 <- read.csv("~/Data/202403-divvy-tripdata.csv")
CyclisticData202404 <- read.csv("~/Data/202404-divvy-tripdata.csv")
After examining the data, we see that the relative structure of columns and data types between each report remains consistent. Therefore, we can combine these datasets into a new dataframe.
CyclisticMerged <- rbind(CyclisticData202304, CyclisticData202305, CyclisticData202306, CyclisticData202307, CyclisticData202308, CyclisticData202309, CyclisticData202310, CyclisticData202311, CyclisticData202312, CyclisticData202401, CyclisticData202402, CyclisticData202403, CyclisticData202404)
head(CyclisticMerged)
## ride_id rideable_type started_at ended_at
## 1 8FE8F7D9C10E88C7 electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37
## 2 34E4ED3ADF1D821B electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12
## 3 5296BF07A2F77CB5 electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22
## 4 40759916B76D5D52 electric_bike 2023-04-19 13:31:30 2023-04-19 13:35:09
## 5 77A96F460101AC63 electric_bike 2023-04-19 12:05:36 2023-04-19 12:10:26
## 6 8D6A2328E19DC168 electric_bike 2023-04-19 12:17:34 2023-04-19 12:21:38
## start_station_name start_station_id end_station_name end_station_id start_lat
## 1 41.80
## 2 41.87
## 3 41.93
## 4 41.92
## 5 41.91
## 6 41.91
## start_lng end_lat end_lng member_casual
## 1 -87.60 41.79 -87.60 member
## 2 -87.65 41.93 -87.68 member
## 3 -87.66 41.93 -87.66 member
## 4 -87.65 41.91 -87.65 member
## 5 -87.65 41.91 -87.63 member
## 6 -87.63 41.92 -87.65 member
As we can see our merged dataset was a success. Before we begin cleaning the data, let’s get some important metrics about our dataset.
glimpse(CyclisticMerged)
## Rows: 6,165,202
## Columns: 13
## $ ride_id <chr> "8FE8F7D9C10E88C7", "34E4ED3ADF1D821B", "5296BF07A2…
## $ rideable_type <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at <chr> "2023-04-02 08:37:28", "2023-04-19 11:29:02", "2023…
## $ ended_at <chr> "2023-04-02 08:41:37", "2023-04-19 11:52:12", "2023…
## $ start_station_name <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_station_id <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_name <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_id <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_lat <dbl> 41.80, 41.87, 41.93, 41.92, 41.91, 41.91, 41.93, 42…
## $ start_lng <dbl> -87.60, -87.65, -87.66, -87.65, -87.65, -87.63, -87…
## $ end_lat <dbl> 41.79, 41.93, 41.93, 41.91, 41.91, 41.92, 41.91, 41…
## $ end_lng <dbl> -87.60, -87.68, -87.66, -87.65, -87.63, -87.65, -87…
## $ member_casual <chr> "member", "member", "member", "member", "member", "…
summary(CyclisticMerged)
## ride_id rideable_type started_at ended_at
## Length:6165202 Length:6165202 Length:6165202 Length:6165202
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_station_name start_station_id end_station_name end_station_id
## Length:6165202 Length:6165202 Length:6165202 Length:6165202
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_lat start_lng end_lat end_lng
## Min. :41.63 Min. :-87.94 Min. : 0.00 Min. :-88.16
## 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66
## Median :41.90 Median :-87.64 Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65 Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.46 Max. :42.18 Max. : 0.00
## NA's :8045 NA's :8045
## member_casual
## Length:6165202
## Class :character
## Mode :character
##
##
##
##
Key takeaways:
However, there is one big concern we need to address. Namely, rides are not all grouped by the rider_id created, so we cannot measure how frequently the person behind the ID(s) uses it.
With this in mind, let’s move on to the cleaning step.
Let’s start by using the distinct function to remove duplicate rider_ids while preserving the columns.
CyclisticMergedDupRemoved <- CyclisticMerged %>%
distinct(ride_id, .keep_all = TRUE)
print(paste("Removed", nrow(CyclisticMerged)-nrow(CyclisticMergedDupRemoved), "duplicate rows"))
## [1] "Removed 0 duplicate rows"
Now let’s create a new column to measure the ride time by finding the difference between the end and start time columns.
CyclisticMergedDupRemoved$ride_time_calc <- (as.double(difftime(CyclisticMerged$ended_at, CyclisticMerged$started_at))) / 60
Now let’s see a summary of the data:
summary(CyclisticMergedDupRemoved$ride_time_calc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -16716.52 5.47 9.60 18.29 17.07 98489.07
Now from our newly created ride_time_calc column, we should remove any outliers such as rides under 60 seconds and any especially lengthy rides so that we do not severely skew our data. Let’s first see how our data is skewed and move on from there.
quantile_ride_time_calc = quantile(CyclisticMergedDupRemoved$ride_time_calc, seq(0, 1, by=0.01))
quantile_ride_time_calc
## 0% 1% 2% 3% 4%
## -1.671652e+04 2.833333e-01 7.000000e-01 1.333333e+00 1.833333e+00
## 5% 6% 7% 8% 9%
## 2.166667e+00 2.433333e+00 2.666667e+00 2.866667e+00 3.050000e+00
## 10% 11% 12% 13% 14%
## 3.233333e+00 3.400000e+00 3.566667e+00 3.716667e+00 3.883333e+00
## 15% 16% 17% 18% 19%
## 4.033333e+00 4.166667e+00 4.316667e+00 4.466667e+00 4.616667e+00
## 20% 21% 22% 23% 24%
## 4.750000e+00 4.900000e+00 5.033333e+00 5.183333e+00 5.333333e+00
## 25% 26% 27% 28% 29%
## 5.466667e+00 5.616667e+00 5.766667e+00 5.900000e+00 6.050000e+00
## 30% 31% 32% 33% 34%
## 6.200000e+00 6.350000e+00 6.500000e+00 6.650000e+00 6.816667e+00
## 35% 36% 37% 38% 39%
## 6.966667e+00 7.116667e+00 7.283333e+00 7.450000e+00 7.600000e+00
## 40% 41% 42% 43% 44%
## 7.766667e+00 7.950000e+00 8.116667e+00 8.283333e+00 8.466667e+00
## 45% 46% 47% 48% 49%
## 8.650000e+00 8.833333e+00 9.016667e+00 9.216667e+00 9.400000e+00
## 50% 51% 52% 53% 54%
## 9.600000e+00 9.800000e+00 1.001667e+01 1.021667e+01 1.043333e+01
## 55% 56% 57% 58% 59%
## 1.066667e+01 1.088333e+01 1.111667e+01 1.136667e+01 1.161667e+01
## 60% 61% 62% 63% 64%
## 1.186667e+01 1.213333e+01 1.240000e+01 1.268333e+01 1.298333e+01
## 65% 66% 67% 68% 69%
## 1.328333e+01 1.358333e+01 1.391667e+01 1.425000e+01 1.460000e+01
## 70% 71% 72% 73% 74%
## 1.496667e+01 1.535000e+01 1.575000e+01 1.616667e+01 1.660000e+01
## 75% 76% 77% 78% 79%
## 1.706667e+01 1.755000e+01 1.805000e+01 1.860000e+01 1.916667e+01
## 80% 81% 82% 83% 84%
## 1.976667e+01 2.041667e+01 2.111667e+01 2.186667e+01 2.266667e+01
## 85% 86% 87% 88% 89%
## 2.355000e+01 2.448333e+01 2.553333e+01 2.670000e+01 2.800000e+01
## 90% 91% 92% 93% 94%
## 2.946667e+01 3.115000e+01 3.310000e+01 3.545000e+01 3.826667e+01
## 95% 96% 97% 98% 99%
## 4.183333e+01 4.675000e+01 5.453333e+01 6.815000e+01 9.975000e+01
## 100%
## 9.848907e+04
Seeing this, we can see that we should remove any values less than 1 minute, and any values greater than 120 minutes to avoid skew
CyclisticMergedDupRemoved <- CyclisticMergedDupRemoved %>%
filter(ride_time_calc >=1) %>%
filter(ride_time_calc <=120)
summary(CyclisticMergedDupRemoved$ride_time_calc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.717 9.783 14.049 17.100 120.000
Now we can create a two new columns to show what day of the week each ride was and what month it took place.
CyclisticMergedDupRemoved <- CyclisticMergedDupRemoved %>%
mutate(weekday = strftime(CyclisticMergedDupRemoved$started_at, "%A"))
CyclisticMergedDupRemoved <- CyclisticMergedDupRemoved %>%
mutate(month = strftime(CyclisticMergedDupRemoved$started_at, "%B"))
head(CyclisticMergedDupRemoved)
## ride_id rideable_type started_at ended_at
## 1 8FE8F7D9C10E88C7 electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37
## 2 34E4ED3ADF1D821B electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12
## 3 5296BF07A2F77CB5 electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22
## 4 40759916B76D5D52 electric_bike 2023-04-19 13:31:30 2023-04-19 13:35:09
## 5 77A96F460101AC63 electric_bike 2023-04-19 12:05:36 2023-04-19 12:10:26
## 6 8D6A2328E19DC168 electric_bike 2023-04-19 12:17:34 2023-04-19 12:21:38
## start_station_name start_station_id end_station_name end_station_id start_lat
## 1 41.80
## 2 41.87
## 3 41.93
## 4 41.92
## 5 41.91
## 6 41.91
## start_lng end_lat end_lng member_casual ride_time_calc weekday month
## 1 -87.60 41.79 -87.60 member 4.150000 Sunday April
## 2 -87.65 41.93 -87.68 member 23.166667 Wednesday April
## 3 -87.66 41.93 -87.66 member 2.000000 Wednesday April
## 4 -87.65 41.91 -87.65 member 3.650000 Wednesday April
## 5 -87.65 41.91 -87.63 member 4.833333 Wednesday April
## 6 -87.63 41.92 -87.65 member 4.066667 Wednesday April
We can then change this new column to a factor for faster analysis.
CyclisticMergedDupRemoved$weekday <- as.factor(CyclisticMergedDupRemoved$weekday)
CyclisticMergedDupRemoved$month <-
as.factor(CyclisticMergedDupRemoved$month)
class(CyclisticMergedDupRemoved$weekday)
## [1] "factor"
nlevels(CyclisticMergedDupRemoved$weekday)
## [1] 7
summary(CyclisticMergedDupRemoved$weekday)
## Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## 863022 781974 932384 771390 895379 863811 859323
class(CyclisticMergedDupRemoved$month)
## [1] "factor"
nlevels(CyclisticMergedDupRemoved$month)
## [1] 12
summary(CyclisticMergedDupRemoved$month)
## April August December February January July June March
## 811052 746786 218246 217885 139529 740487 695001 293432
## May November October September
## 582422 353484 522055 646904
Let’s also add a column that stores the start time of these rides.
CyclisticMergedDupRemoved$started_at <- ymd_hms(CyclisticMergedDupRemoved$started_at)
CyclisticMergedDupRemoved <- CyclisticMergedDupRemoved %>%
mutate(hour_started = floor_date(started_at, unit = "hour"))
CyclisticMergedDupRemoved <- CyclisticMergedDupRemoved %>%
mutate(hour_started = strftime(CyclisticMergedDupRemoved$hour_started, "%R"))
head(CyclisticMergedDupRemoved)
## ride_id rideable_type started_at ended_at
## 1 8FE8F7D9C10E88C7 electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37
## 2 34E4ED3ADF1D821B electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12
## 3 5296BF07A2F77CB5 electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22
## 4 40759916B76D5D52 electric_bike 2023-04-19 13:31:30 2023-04-19 13:35:09
## 5 77A96F460101AC63 electric_bike 2023-04-19 12:05:36 2023-04-19 12:10:26
## 6 8D6A2328E19DC168 electric_bike 2023-04-19 12:17:34 2023-04-19 12:21:38
## start_station_name start_station_id end_station_name end_station_id start_lat
## 1 41.80
## 2 41.87
## 3 41.93
## 4 41.92
## 5 41.91
## 6 41.91
## start_lng end_lat end_lng member_casual ride_time_calc weekday month
## 1 -87.60 41.79 -87.60 member 4.150000 Sunday April
## 2 -87.65 41.93 -87.68 member 23.166667 Wednesday April
## 3 -87.66 41.93 -87.66 member 2.000000 Wednesday April
## 4 -87.65 41.91 -87.65 member 3.650000 Wednesday April
## 5 -87.65 41.91 -87.63 member 4.833333 Wednesday April
## 6 -87.63 41.92 -87.65 member 4.066667 Wednesday April
## hour_started
## 1 03:00
## 2 06:00
## 3 03:00
## 4 08:00
## 5 07:00
## 6 07:00
Likewise, let’s make this into a factor for faster analysis later on. Note, NA’s occur for starting stations due to electric bikes as they do not need to be docked.
CyclisticMergedDupRemoved$hour_started <- as.factor(CyclisticMergedDupRemoved$hour_started)
class(CyclisticMergedDupRemoved$hour_started)
## [1] "factor"
nlevels(CyclisticMergedDupRemoved$hour_started)
## [1] 24
summary(CyclisticMergedDupRemoved$hour_started)
## 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00
## 64686 164208 273398 311507 243032 253594 307059 348156 354852 373699 443645
## 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00
## 545428 588123 478015 343576 246006 195066 153651 103296 70012 42942 25359
## 22:00 23:00
## 16325 21648
Lastly, let’s create a new column to measure the distance between start and ending coordinates. Note, some rows contain NA for ending station and starting due to electric bikes not being required to be docked.
CyclisticMergedDupRemoved$distance_km <-
distGeo(matrix(c(CyclisticMergedDupRemoved$start_lng, CyclisticMergedDupRemoved$start_lat), ncol=2), matrix(c(CyclisticMergedDupRemoved$end_lng, CyclisticMergedDupRemoved$end_lat), ncol=2)) / 1000
summary(CyclisticMergedDupRemoved$distance_km)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.921 1.594 2.160 2.798 9814.022 302
We see that the max is particularly high, let’s see what the head of a descended sort of this data looks like so we can remove high outliers.
CyclisticMergedDupRemoved <- CyclisticMergedDupRemoved %>%
arrange(desc(distance_km))
head(CyclisticMergedDupRemoved$distance_km, n = 20)
## [1] 9814.02222 49.02223 40.85935 36.90903 32.22815 32.22256
## [7] 32.22119 31.60031 31.09615 30.34938 30.15654 29.52220
## [13] 28.76031 27.73879 26.84829 26.61580 26.55284 26.55284
## [19] 26.32181 26.31305
We see that there is only one extremely high distance value so we can remove it to avoid skew.
CyclisticMergedDupRemoved <- CyclisticMergedDupRemoved %>%
slice(-(1))
head(CyclisticMergedDupRemoved$distance_km, n = 10)
## [1] 49.02223 40.85935 36.90903 32.22815 32.22256 32.22119 31.60031 31.09615
## [9] 30.34938 30.15654
summary(CyclisticMergedDupRemoved$distance_km)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.921 1.594 2.158 2.798 49.022 302
Let’s also change member_casual to a factor for better analysis later on.
CyclisticMergedDupRemoved$member_casual <- as.factor(CyclisticMergedDupRemoved$member_casual)
class(CyclisticMergedDupRemoved$member_casual)
## [1] "factor"
nlevels(CyclisticMergedDupRemoved$member_casual)
## [1] 2
summary(CyclisticMergedDupRemoved$member_casual)
## casual member
## 2109065 3858217
For the final step, we will remove irrelevant columns such as the started_at and ended_at due to us already finding the time differences.
CyclisticCleaned <- CyclisticMergedDupRemoved %>%
select(-c(started_at, ended_at))
head(CyclisticCleaned)
## ride_id rideable_type start_station_name start_station_id
## 1 500867C891161912 electric_bike State St & Van Buren St TA1305000035
## 2 587C8B39F4D4F488 electric_bike St. Clair St & Erie St 13016
## 3 EFE648FE9A5670D5 electric_bike
## 4 49FC4504C708157A electric_bike Wabash Ave & Wacker Pl TA1307000131
## 5 21DFEEDFA4539B5F electric_bike Wabash Ave & Wacker Pl TA1307000131
## 6 D1D23BC897600D5B electric_bike
## end_station_name end_station_id start_lat start_lng end_lat end_lng
## 1 41.87727 -87.62792 42.07 -88.16
## 2 41.89441 -87.62271 41.65 -87.99
## 3 41.66000 -87.64000 41.97 -87.80
## 4 41.88679 -87.62625 42.15 -87.79
## 5 41.88685 -87.62624 42.15 -87.79
## 6 42.01000 -87.67000 41.72 -87.66
## member_casual ride_time_calc weekday month hour_started distance_km
## 1 casual 99.48333 Thursday June 20:00 49.02223
## 2 member 2.60000 Friday October 18:00 40.85935
## 3 casual 118.01667 Sunday June 06:00 36.90903
## 4 casual 115.76667 Sunday April 11:00 32.22815
## 5 casual 117.43333 Sunday April 11:00 32.22256
## 6 member 90.08333 Friday August 06:00 32.22119
summary(CyclisticCleaned)
## ride_id rideable_type start_station_name start_station_id
## Length:5967282 Length:5967282 Length:5967282 Length:5967282
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## end_station_name end_station_id start_lat start_lng
## Length:5967282 Length:5967282 Min. :41.63 Min. :-87.94
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.46
##
## end_lat end_lng member_casual ride_time_calc
## Min. :41.61 Min. :-88.16 casual:2109065 Min. : 1.000
## 1st Qu.:41.88 1st Qu.:-87.66 member:3858217 1st Qu.: 5.717
## Median :41.90 Median :-87.64 Median : 9.783
## Mean :41.90 Mean :-87.65 Mean : 14.049
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.: 17.100
## Max. :42.18 Max. :-87.44 Max. :120.000
## NA's :302 NA's :302
## weekday month hour_started distance_km
## Friday :863022 April : 811052 12:00 : 588123 Min. : 0.000
## Monday :781974 August : 746786 11:00 : 545428 1st Qu.: 0.921
## Saturday :932384 July : 740487 13:00 : 478015 Median : 1.594
## Sunday :771390 June : 695000 10:00 : 443645 Mean : 2.158
## Thursday :895378 September: 646904 09:00 : 373699 3rd Qu.: 2.798
## Tuesday :863811 May : 582422 08:00 : 354852 Max. :49.022
## Wednesday:859323 (Other) :1744631 (Other):3183520 NA's :302
With all that done, we will now move on to the analysis step!
**
Now it is time for my favorite step, analysis. Using the packages and libraries we installed above, we will be grouping the data to create visualizations for potential insights.
First, let’s begin by looking at the distribution of member types.
member_casual_view <- CyclisticCleaned %>%
group_by(member_casual) %>%
summarize(
count = n(),
percentage = round(count / nrow(CyclisticCleaned) * 100, 2),
.groups = "drop"
)
options(scipen = 999, repr.plot.width = 12, repr.plot.height = 8)
ggplot(member_casual_view, aes(x = "", y = count, fill = member_casual)) +
geom_col(width = 0.6, color = "white") +
geom_text(
aes(label = paste0(count, " (", percentage, "%)")),
position = position_stack(vjust = 0.5),
size = 5,
fontface = "bold"
) +
labs(
title = "Ride Nums. by User Class",
subtitle = "April 2023 - March 2024",
) +
theme_void(base_size = 17) +
coord_polar(theta = "y") +
scale_fill_manual(values = c("#FF6F61", "#6B5B95"), name = NULL) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(size = 12, hjust = 0.5),
legend.position = "right",
legend.spacing.x = unit(0.5, "cm")
)
There is a large difference in the number of casual vs. member users, approximately 1.75 million. Let’s delve deeper and separate each month by its rides and separate those rides by the membership status of users.
CyclisticCleaned$month <- ordered(CyclisticCleaned$month, levels =
c("April", "May", "June", "July", "August", "September", "October", "November", "December", "January", "February", "March"))
member_casual_monthly <- CyclisticCleaned %>%
group_by(month, member_casual) %>%
summarize(count = n(), .groups = "drop") %>%
arrange(month, member_casual)
options(repr.plot.width = 12, repr.plot.height = 6)
custom_colors <- c("#FF6F61", "#6B5B95")
ggplot(data = member_casual_monthly, aes(x = month, y = count, fill = member_casual)) +
geom_col(position = "dodge", width = 0.7) +
labs(title = "Monthly Ride Counts by Membership Class", subtitle = "(April 2023 - March 2024)", x = "Month", y = "Num. of Rides") +
scale_fill_manual(values = custom_colors, name = "User Type:") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"), plot.subtitle = element_text(hjust = .5, size = 12, face = "bold"),
axis.title = element_text(size = 12),
legend.position = "top",
legend.title = element_text(size = 12),
legend.text = element_text(size = 10))
There are a few key takeaways from this data:
Let’s now look deeper through the popularity of rides through each weekday.
CyclisticCleaned$weekday <- ordered(CyclisticCleaned$weekday, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
weekday_member_casual <- CyclisticCleaned %>%
group_by(weekday, member_casual) %>%
summarize(count = n(), .groups = "drop") %>%
arrange(weekday, member_casual)
custom_colors <- c("#FF6F61", "#6B5B95")
ggplot(data = weekday_member_casual, aes(x = weekday, y = count, fill = member_casual), show.legend = FALSE) +
geom_col(position = "dodge", width = .8) +
labs(title = "Ride Counts Each Day of the Week", x = "Day", y = "Num. of Rides", subtitle = "(April 2023 - March 2024)") +
scale_fill_manual(values = custom_colors, name = "User Class") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"), plot.subtitle = element_text(hjust = .5, size = 12, face = "bold"),
axis.title = element_text(size = 12),
legend.position = "top",
legend.title = element_text(size = 12),
legend.text = element_text(size = 10))
As we can see through the data, throughout the work week the number of members dominates the number of rides. During the weekends, the number of casuals increases and almost reaches the number of members.
This makes sense as those with set schedules would likely plan to use these bikes and thus secure a membership while weekend plans are more spontaneous and encourages tourism, leading to a shift in the numbers.
Let’s further examine the times that these bikes are used during each weekday.
hourly_member_casual_weekday <- CyclisticCleaned%>%
filter(weekday %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")) %>%
group_by(hour_started, member_casual)%>%
summarize(count = n(), .groups = "drop")%>%
arrange(hour_started, member_casual)
custom_colors <- c("#FF6F61", "#6B5B95")
ggplot(data = hourly_member_casual_weekday, aes(x = hour_started, y = count, group = member_casual, color = member_casual)) +
geom_point() +
geom_line() +
scale_color_manual(values = custom_colors) +
labs(title = "Ride Count by Time of Day Weekdays", subtitle = "(April 2023 - March 2024)", x = "Hour", y = "Num. of Rides") +
guides(color = guide_legend(title = "User Class")) +
theme(text = element_text(size = 12), axis.text.x = element_text(angle = 90), plot.title = element_text(hjust = 0.5, size = 16, face = "bold"), plot.subtitle = element_text(hjust = .5, size = 12, face = "bold"),
axis.title = element_text(size = 12),
legend.position = "top",
legend.title = element_text(size = 12),
legend.text = element_text(size = 10))
hourly_member_casual_weekend <- CyclisticCleaned %>%
filter(weekday %in% c("Saturday", "Sunday")) %>%
group_by(hour_started, member_casual) %>%
summarize(count = n(), .groups = "drop") %>%
arrange(hour_started, member_casual)
custom_colors <- c("#FF6F61", "#6B5B95")
ggplot(data = hourly_member_casual_weekend, aes(x = hour_started, y = count, group = member_casual, color = member_casual)) +
geom_point() +
geom_line() +
scale_color_manual(values = custom_colors) +
labs(title = "Ride Count by Time of Day Weekends", subtitle = "(April 2023 - March 2024)", x = "Hour", y = "Num. of Rides") +
guides(color = guide_legend(title = "User Class")) +
theme(text = element_text(size = 12), axis.text.x = element_text(angle = 90), plot.title = element_text(hjust = 0.5, size = 16, face = "bold"), plot.subtitle = element_text(hjust = .5, size = 12, face = "bold"),
axis.title = element_text(size = 12),
legend.position = "top",
legend.title = element_text(size = 12),
legend.text = element_text(size = 10))
As we can see, the weekends tend to be more rounded, indicating a loose structure in when bikes are used the most, while the weekdays are more rigid and sharp, indicating scheduled times. We can also see that during the weekdays, 12:00 seems to be a peak.
Now let’s see which bike type is more favored by member types. Note, docked bikes were later changed to be included into classic bikes.
casual_classic <- CyclisticCleaned %>%
filter(member_casual == "casual" & (rideable_type == "classic_bike" | rideable_type == "docked_bike"))
casual_electric <- CyclisticCleaned %>%
filter(member_casual == "casual" & rideable_type == "electric_bike")
count_casual_classic <- nrow(casual_classic)
count_casual_electric <- nrow(casual_electric)
total_casual_riders <- count_casual_classic + count_casual_electric
pie_data <- data.frame(
category = c("Classic", "Electric Bike"),
count = c(count_casual_classic, count_casual_electric),
percentage = c(count_casual_classic/total_casual_riders, count_casual_electric/total_casual_riders)
)
ggplot(pie_data, aes(x = "", y = count, fill = category)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start = 0) +
geom_text(aes(label = paste0(count, " (", round(percentage * 100), "%)")), position = position_stack(vjust = 0.5), fontface = "bold") +
labs(title = "Casual Riders by Bike Type") +
scale_fill_manual(values = c("#0077FF", "#FF7700"), name = "Bike Type") +
theme_void() +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"))
member_classic <- CyclisticCleaned %>%
filter(member_casual == "member" & (rideable_type == "classic_bike" | rideable_type == "docked_bike"))
member_electric <- CyclisticCleaned %>%
filter(member_casual == "member" & rideable_type == "electric_bike")
count_member_classic <- nrow(member_classic)
count_member_electric <- nrow(member_electric)
total_member_riders <- count_member_classic + count_member_electric
pie_data <- data.frame(
category = c("Classic", "Electric Bike"),
count = c(count_member_classic, count_member_electric),
percentage = c(count_member_classic/total_member_riders, count_member_electric/total_member_riders)
)
ggplot(pie_data, aes(x = "", y = count, fill = category)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start = 0) +
geom_text(aes(label = paste0(count, " (", round(percentage * 100), "%)")), position = position_stack(vjust = 0.5), fontface = "bold") +
labs(title = "Member Riders by Bike Type") +
scale_fill_manual(values = c("#00CC66", "#9966FF"), name = "Bike Type") +
theme_void() +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"))
They seem to be relatively similar across the board. Let’s see if comparing ride lengths for both member types on each bike type can offer better insights.
casual_bikes <- CyclisticCleaned %>%
filter(member_casual == "casual") %>%
mutate(bike_type = ifelse(rideable_type %in% c("classic_bike", "docked_bike"), "classic/docked", "electric_bike")) %>%
group_by(hour_started, bike_type) %>%
summarize(average_ride_time = mean(ride_time_calc), .groups = "drop") %>%
arrange(hour_started, bike_type)
custom_colors <- c("classic/docked" = "#0077FF", "electric_bike" = "#FF7700")
ggplot(data = casual_bikes, aes(x = hour_started, y = average_ride_time, group = bike_type, color = bike_type)) +
geom_point() +
geom_line() +
scale_color_manual(values = custom_colors) +
labs(title = "Average Ride Time by Hour Started for Casual Riders",
subtitle = "(April 2023 - March 2024)",
x = "Hour",
y = "Average Ride Time (minutes)",
color = "Bike Type") +
theme_minimal() +
theme(text = element_text(size = 12),
axis.text.x = element_text(angle = 90),
plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, size = 12, face = "bold"),
axis.title = element_text(size = 12),
legend.position = "top",
legend.title = element_text(size = 12),
legend.text = element_text(size = 10))
member_bikes <- CyclisticCleaned %>%
filter(member_casual == "member") %>%
mutate(bike_type = ifelse(rideable_type %in% c("classic_bike", "docked_bike"), "classic/docked", "electric_bike")) %>%
group_by(hour_started, bike_type) %>%
summarize(average_ride_time = mean(ride_time_calc), .groups = "drop") %>%
arrange(hour_started, bike_type)
custom_colors <- c("classic/docked" = "#00CC66", "electric_bike" = "#9966FF")
ggplot(data = member_bikes, aes(x = hour_started, y = average_ride_time, group = bike_type, color = bike_type)) +
geom_point() +
geom_line() +
scale_color_manual(values = custom_colors) +
labs(title = "Average Ride Time by Hour Started for Members",
subtitle = "(April 2023 - March 2024)",
x = "Hour",
y = "Average Ride Time (minutes)",
color = "Bike Type") +
theme_minimal() +
theme(text = element_text(size = 12),
axis.text.x = element_text(angle = 90),
plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, size = 12, face = "bold"),
axis.title = element_text(size = 12),
legend.position = "top",
legend.title = element_text(size = 12),
legend.text = element_text(size = 10))
Interestingly enough, it seems casual riders that chose to use classic bikes would spend much more time to electric bike casual riders than member riders. This can be especially seen between 5-10 am.
One possible cause is that members are more experienced with riding bikes and thus can maintain a higher speed. Both membership types were consistent in showing classic bikes being slower than electric, which makes sense due to their differences in speed.
Lastly, let’s take a look at the geospatial differences between starting points of casual riders and member riders for round trips. We will be using the leaflet library and we will first separate and summarize distances where starting and ending were the same and different respectively before combining them to create a display.
Since we made the coordinate columns factors, we can group them by these coordinates and remove any that are less than specified values to make visualization easier and more efficient.
distance_summarized <- CyclisticCleaned %>%
drop_na(distance_km)
distance_0 <- distance_summarized %>%
filter(distance_km==0)
route_summary_0 <- distance_0 %>%
group_by(start_lng, start_lat, end_lng, end_lat, member_casual) %>%
summarize(total = n(), .groups="drop") %>%
filter(total>75) %>%
arrange(desc(total))
route_summarized <- distance_summarized %>%
filter(!distance_km==0) %>%
group_by(start_lng, start_lat, end_lng, end_lat, member_casual) %>%
summarize(total = n(), .groups = "drop") %>%
filter(total>100) %>%
arrange(desc(total))
data_for_clustering <- bind_rows(route_summary_0, route_summarized)
normalize_radius <- function(total, min_radius=2, max_radius=20) {
min_total <- min(total)
max_total <- max(total)
normalized <- ((total - min_total) / (max_total - min_total)) * (max_radius - min_radius) + min_radius
return(normalized)
}
casual_data <- data_for_clustering %>%
filter(member_casual=="casual") %>%
mutate(radius = normalize_radius(total))
leaflet(data = casual_data) %>%
addTiles() %>%
addCircleMarkers(~start_lng, ~start_lat,
radius = ~radius,
fillOpacity = .3,
stroke = TRUE,
color = "#FF6F61")
member_data <- data_for_clustering %>%
filter(member_casual == "member") %>%
mutate(radius = normalize_radius(total))
leaflet(data = member_data) %>%
addTiles() %>%
addCircleMarkers(
~start_lng, ~start_lat,
radius = ~radius,
fillOpacity = 0.3,
stroke = TRUE,
color = "#6B5B95"
)
It seems that for both user types, the area near Lake Michigan seems to be the most used starting area with casual members having large amounts of users starting in areas with high tourism such as ports.
Through these visualizations, we have determined several key insights into the differences of behaviors between casual and member users. Now it is time to summarize these findings and present our data-driven recommendations for a targeted marketing campaign into a digestable manner.
Create a marketing campaign and fitness reward program targeted at casual users. A significant portion of these riders see Cyclistic as a fun and cost-effective way to explore the city and get exercise done during the weekends and other holidays. By marketing Cyclistic as THE platform for introducing people to the world of biking, we can push more users to try biking through our platform and convert to a membership to hit goals. We can also add a rewards plan and refer-a-friend bonuses so people feel more inclined to continue to use Cyclistic. Furthermore, as these users tend to explore and start from tourist heavy areas in downtown and the Lake Michigan area, advertisements positioned in this locations would be most impactful. To amplify these efforts, we can explore collaborations with popular tourist influencers to better promote our platform and appeal for this target audience.
Highlight the cost effectiveness of member plans vs. casual riders who frequent Cyclistic as their main mode of commuting transportation through weekly metrics. Currently, there exists a small subset of casual riders whose behavior mimics members very closely during weekdays in terms of ride times. Despite this, they seem to continue to rely on casual membership plans. By underscoring the potential savings and additional benefits offered by member plans, such as lower rates per ride and access to exclusive features, this initiative aims to encourage these riders to consider upgrading to membership plans, enhancing user satisfaction and increasing the membership base.
Introduce seasonal offerings, marketing campaigns, and membership types for both casual and existing members. Our data shows that summer months, particularly April and May-August, have the highest number of riders from both user types. We can leverage this into creating a seasonal membership option with additional benefits that is only active during these months so casual riders not wanting to commit to an annual membership will be more likely to switch to a seasonal membership. These benefits could include discounted rates, bonus ride credits, or exclusive access to summer events or promotions. Additionally, we can use the knowledge that the influx of casual riders in summer months likely arises from tourists and offer them a tourist membership bundled with benefits with other attractions. To effectively promote these offerings, customized marketing campaigns will be implemented, utilizing targeted advertising on social media platforms, email newsletters, and partnerships with local tourist attractions. By implementing these strategies, Cyclistic can effectively cater to the preferences and behaviors of its diverse user base, drive membership conversion, and enhance overall user satisfaction.
That about wraps up this case study. I had a lot of fun making this and would love to hear any feedback from anyone who happened to make it to the end. Thanks!