INTRODUCTION

Hey!

This is my take on the capstone project for Google’s Data Analytics Professional Certificate. As part of the final course, I will be showcasing all the skills I learned throughout the program into a practice case study. Using a real-world business problem, I will be tackling it with the Ask, Prepare, Process, Share, and Act methodology. Any feedback would be greatly appreciated!


SCENARIO

You are a junior data analyst working on the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Characters and teams

Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use the bikes to commute to work each day.

Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.

Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals—as well as how you, as a junior data analyst, can help Cyclistic achieve them.

Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

About the company

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a solid opportunity to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.


ASK

Three questions will guide the future marketing program:

  1. How do annual members and casual riders use Cyclistic bikes differently?

  2. Why would casual riders buy Cyclistic annual memberships?

  3. How can Cyclistic use digital media to influence casual riders to become members?

Moreno has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?


DELIVERABLES

  1. A clear statement of the business task

  2. A description of all data sources used

  3. Documentation of any cleaning or manipulation of data

  4. A summary of your analysis

  5. Supporting visualizations and key findings

  6. Your top three recommendations based on your analysis


Let’s get started!


PREPARATION

The data that will be used in this analysis is provided by Divvy, a real bike-share company, from which Cyclistic is based on and is organized into csv files by month. Details regarding its public use can be found on Divvy’s License Agreement.

To keep this analysis as real as possible, I will be using the last 12 months of data available to me (at this time is April 2023 - March 2024).


PROCESSING

We will first begin by loading in the libraries needed for analyzing Cyclistic’s historical data.

library(tidyverse) 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(geosphere) 
library(lubridate)
library(leaflet)


Next we will import the 12 most recent datasets (April 2023 - April 2024).

CyclisticData202304 <- read.csv("~/Data/202304-divvy-tripdata.csv")
CyclisticData202305 <- read.csv("~/Data/202305-divvy-tripdata.csv")
CyclisticData202306 <- read.csv("~/Data/202306-divvy-tripdata.csv")
CyclisticData202307 <- read.csv("~/Data/202307-divvy-tripdata.csv")
CyclisticData202308 <- read.csv("~/Data/202308-divvy-tripdata.csv")
CyclisticData202309 <- read.csv("~/Data/202309-divvy-tripdata.csv")
CyclisticData202310 <- read.csv("~/Data/202310-divvy-tripdata.csv")
CyclisticData202311 <- read.csv("~/Data/202311-divvy-tripdata.csv")
CyclisticData202312 <- read.csv("~/Data/202312-divvy-tripdata.csv")
CyclisticData202401 <- read.csv("~/Data/202401-divvy-tripdata.csv")
CyclisticData202402 <- read.csv("~/Data/202402-divvy-tripdata.csv")
CyclisticData202403 <- read.csv("~/Data/202403-divvy-tripdata.csv")
CyclisticData202404 <- read.csv("~/Data/202404-divvy-tripdata.csv")

After examining the data, we see that the relative structure of columns and data types between each report remains consistent. Therefore, we can combine these datasets into a new dataframe.

CyclisticMerged <- rbind(CyclisticData202304, CyclisticData202305, CyclisticData202306, CyclisticData202307, CyclisticData202308, CyclisticData202309, CyclisticData202310, CyclisticData202311, CyclisticData202312, CyclisticData202401, CyclisticData202402, CyclisticData202403, CyclisticData202404)

head(CyclisticMerged)
##            ride_id rideable_type          started_at            ended_at
## 1 8FE8F7D9C10E88C7 electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37
## 2 34E4ED3ADF1D821B electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12
## 3 5296BF07A2F77CB5 electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22
## 4 40759916B76D5D52 electric_bike 2023-04-19 13:31:30 2023-04-19 13:35:09
## 5 77A96F460101AC63 electric_bike 2023-04-19 12:05:36 2023-04-19 12:10:26
## 6 8D6A2328E19DC168 electric_bike 2023-04-19 12:17:34 2023-04-19 12:21:38
##   start_station_name start_station_id end_station_name end_station_id start_lat
## 1                                                                         41.80
## 2                                                                         41.87
## 3                                                                         41.93
## 4                                                                         41.92
## 5                                                                         41.91
## 6                                                                         41.91
##   start_lng end_lat end_lng member_casual
## 1    -87.60   41.79  -87.60        member
## 2    -87.65   41.93  -87.68        member
## 3    -87.66   41.93  -87.66        member
## 4    -87.65   41.91  -87.65        member
## 5    -87.65   41.91  -87.63        member
## 6    -87.63   41.92  -87.65        member

As we can see our merged dataset was a success. Before we begin cleaning the data, let’s get some important metrics about our dataset.

glimpse(CyclisticMerged)
## Rows: 6,165,202
## Columns: 13
## $ ride_id            <chr> "8FE8F7D9C10E88C7", "34E4ED3ADF1D821B", "5296BF07A2…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <chr> "2023-04-02 08:37:28", "2023-04-19 11:29:02", "2023…
## $ ended_at           <chr> "2023-04-02 08:41:37", "2023-04-19 11:52:12", "2023…
## $ start_station_name <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_station_id   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_name   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_id     <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_lat          <dbl> 41.80, 41.87, 41.93, 41.92, 41.91, 41.91, 41.93, 42…
## $ start_lng          <dbl> -87.60, -87.65, -87.66, -87.65, -87.65, -87.63, -87…
## $ end_lat            <dbl> 41.79, 41.93, 41.93, 41.91, 41.91, 41.92, 41.91, 41…
## $ end_lng            <dbl> -87.60, -87.68, -87.66, -87.65, -87.63, -87.65, -87…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
summary(CyclisticMerged)
##    ride_id          rideable_type       started_at          ended_at        
##  Length:6165202     Length:6165202     Length:6165202     Length:6165202    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  start_station_name start_station_id   end_station_name   end_station_id    
##  Length:6165202     Length:6165202     Length:6165202     Length:6165202    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    start_lat       start_lng         end_lat         end_lng      
##  Min.   :41.63   Min.   :-87.94   Min.   : 0.00   Min.   :-88.16  
##  1st Qu.:41.88   1st Qu.:-87.66   1st Qu.:41.88   1st Qu.:-87.66  
##  Median :41.90   Median :-87.64   Median :41.90   Median :-87.64  
##  Mean   :41.90   Mean   :-87.65   Mean   :41.90   Mean   :-87.65  
##  3rd Qu.:41.93   3rd Qu.:-87.63   3rd Qu.:41.93   3rd Qu.:-87.63  
##  Max.   :42.07   Max.   :-87.46   Max.   :42.18   Max.   :  0.00  
##                                   NA's   :8045    NA's   :8045    
##  member_casual     
##  Length:6165202    
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Key takeaways:

However, there is one big concern we need to address. Namely, rides are not all grouped by the rider_id created, so we cannot measure how frequently the person behind the ID(s) uses it.

With this in mind, let’s move on to the cleaning step.


CLEANING

Let’s start by using the distinct function to remove duplicate rider_ids while preserving the columns.

CyclisticMergedDupRemoved <- CyclisticMerged %>%
  distinct(ride_id, .keep_all = TRUE)

print(paste("Removed", nrow(CyclisticMerged)-nrow(CyclisticMergedDupRemoved), "duplicate rows"))
## [1] "Removed 0 duplicate rows"

Now let’s create a new column to measure the ride time by finding the difference between the end and start time columns.

CyclisticMergedDupRemoved$ride_time_calc <- (as.double(difftime(CyclisticMerged$ended_at, CyclisticMerged$started_at))) / 60

Now let’s see a summary of the data:

summary(CyclisticMergedDupRemoved$ride_time_calc)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -16716.52      5.47      9.60     18.29     17.07  98489.07

Now from our newly created ride_time_calc column, we should remove any outliers such as rides under 60 seconds and any especially lengthy rides so that we do not severely skew our data. Let’s first see how our data is skewed and move on from there.

quantile_ride_time_calc = quantile(CyclisticMergedDupRemoved$ride_time_calc, seq(0, 1, by=0.01))
quantile_ride_time_calc
##            0%            1%            2%            3%            4% 
## -1.671652e+04  2.833333e-01  7.000000e-01  1.333333e+00  1.833333e+00 
##            5%            6%            7%            8%            9% 
##  2.166667e+00  2.433333e+00  2.666667e+00  2.866667e+00  3.050000e+00 
##           10%           11%           12%           13%           14% 
##  3.233333e+00  3.400000e+00  3.566667e+00  3.716667e+00  3.883333e+00 
##           15%           16%           17%           18%           19% 
##  4.033333e+00  4.166667e+00  4.316667e+00  4.466667e+00  4.616667e+00 
##           20%           21%           22%           23%           24% 
##  4.750000e+00  4.900000e+00  5.033333e+00  5.183333e+00  5.333333e+00 
##           25%           26%           27%           28%           29% 
##  5.466667e+00  5.616667e+00  5.766667e+00  5.900000e+00  6.050000e+00 
##           30%           31%           32%           33%           34% 
##  6.200000e+00  6.350000e+00  6.500000e+00  6.650000e+00  6.816667e+00 
##           35%           36%           37%           38%           39% 
##  6.966667e+00  7.116667e+00  7.283333e+00  7.450000e+00  7.600000e+00 
##           40%           41%           42%           43%           44% 
##  7.766667e+00  7.950000e+00  8.116667e+00  8.283333e+00  8.466667e+00 
##           45%           46%           47%           48%           49% 
##  8.650000e+00  8.833333e+00  9.016667e+00  9.216667e+00  9.400000e+00 
##           50%           51%           52%           53%           54% 
##  9.600000e+00  9.800000e+00  1.001667e+01  1.021667e+01  1.043333e+01 
##           55%           56%           57%           58%           59% 
##  1.066667e+01  1.088333e+01  1.111667e+01  1.136667e+01  1.161667e+01 
##           60%           61%           62%           63%           64% 
##  1.186667e+01  1.213333e+01  1.240000e+01  1.268333e+01  1.298333e+01 
##           65%           66%           67%           68%           69% 
##  1.328333e+01  1.358333e+01  1.391667e+01  1.425000e+01  1.460000e+01 
##           70%           71%           72%           73%           74% 
##  1.496667e+01  1.535000e+01  1.575000e+01  1.616667e+01  1.660000e+01 
##           75%           76%           77%           78%           79% 
##  1.706667e+01  1.755000e+01  1.805000e+01  1.860000e+01  1.916667e+01 
##           80%           81%           82%           83%           84% 
##  1.976667e+01  2.041667e+01  2.111667e+01  2.186667e+01  2.266667e+01 
##           85%           86%           87%           88%           89% 
##  2.355000e+01  2.448333e+01  2.553333e+01  2.670000e+01  2.800000e+01 
##           90%           91%           92%           93%           94% 
##  2.946667e+01  3.115000e+01  3.310000e+01  3.545000e+01  3.826667e+01 
##           95%           96%           97%           98%           99% 
##  4.183333e+01  4.675000e+01  5.453333e+01  6.815000e+01  9.975000e+01 
##          100% 
##  9.848907e+04

Seeing this, we can see that we should remove any values less than 1 minute, and any values greater than 120 minutes to avoid skew

CyclisticMergedDupRemoved <- CyclisticMergedDupRemoved %>%
  filter(ride_time_calc >=1) %>%
  filter(ride_time_calc <=120)

summary(CyclisticMergedDupRemoved$ride_time_calc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.717   9.783  14.049  17.100 120.000

Now we can create a two new columns to show what day of the week each ride was and what month it took place.

CyclisticMergedDupRemoved <- CyclisticMergedDupRemoved %>%
  mutate(weekday = strftime(CyclisticMergedDupRemoved$started_at, "%A"))

CyclisticMergedDupRemoved <- CyclisticMergedDupRemoved %>%
  mutate(month = strftime(CyclisticMergedDupRemoved$started_at, "%B"))

head(CyclisticMergedDupRemoved)
##            ride_id rideable_type          started_at            ended_at
## 1 8FE8F7D9C10E88C7 electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37
## 2 34E4ED3ADF1D821B electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12
## 3 5296BF07A2F77CB5 electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22
## 4 40759916B76D5D52 electric_bike 2023-04-19 13:31:30 2023-04-19 13:35:09
## 5 77A96F460101AC63 electric_bike 2023-04-19 12:05:36 2023-04-19 12:10:26
## 6 8D6A2328E19DC168 electric_bike 2023-04-19 12:17:34 2023-04-19 12:21:38
##   start_station_name start_station_id end_station_name end_station_id start_lat
## 1                                                                         41.80
## 2                                                                         41.87
## 3                                                                         41.93
## 4                                                                         41.92
## 5                                                                         41.91
## 6                                                                         41.91
##   start_lng end_lat end_lng member_casual ride_time_calc   weekday month
## 1    -87.60   41.79  -87.60        member       4.150000    Sunday April
## 2    -87.65   41.93  -87.68        member      23.166667 Wednesday April
## 3    -87.66   41.93  -87.66        member       2.000000 Wednesday April
## 4    -87.65   41.91  -87.65        member       3.650000 Wednesday April
## 5    -87.65   41.91  -87.63        member       4.833333 Wednesday April
## 6    -87.63   41.92  -87.65        member       4.066667 Wednesday April

We can then change this new column to a factor for faster analysis.

CyclisticMergedDupRemoved$weekday <- as.factor(CyclisticMergedDupRemoved$weekday)

CyclisticMergedDupRemoved$month <-
  as.factor(CyclisticMergedDupRemoved$month)

class(CyclisticMergedDupRemoved$weekday)
## [1] "factor"
nlevels(CyclisticMergedDupRemoved$weekday)
## [1] 7
summary(CyclisticMergedDupRemoved$weekday)
##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##    863022    781974    932384    771390    895379    863811    859323
class(CyclisticMergedDupRemoved$month)
## [1] "factor"
nlevels(CyclisticMergedDupRemoved$month)
## [1] 12
summary(CyclisticMergedDupRemoved$month)
##     April    August  December  February   January      July      June     March 
##    811052    746786    218246    217885    139529    740487    695001    293432 
##       May  November   October September 
##    582422    353484    522055    646904

Let’s also add a column that stores the start time of these rides.

CyclisticMergedDupRemoved$started_at <- ymd_hms(CyclisticMergedDupRemoved$started_at)

CyclisticMergedDupRemoved <- CyclisticMergedDupRemoved %>%
  mutate(hour_started = floor_date(started_at, unit = "hour"))
CyclisticMergedDupRemoved <- CyclisticMergedDupRemoved %>%
  mutate(hour_started = strftime(CyclisticMergedDupRemoved$hour_started, "%R"))

head(CyclisticMergedDupRemoved)
##            ride_id rideable_type          started_at            ended_at
## 1 8FE8F7D9C10E88C7 electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37
## 2 34E4ED3ADF1D821B electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12
## 3 5296BF07A2F77CB5 electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22
## 4 40759916B76D5D52 electric_bike 2023-04-19 13:31:30 2023-04-19 13:35:09
## 5 77A96F460101AC63 electric_bike 2023-04-19 12:05:36 2023-04-19 12:10:26
## 6 8D6A2328E19DC168 electric_bike 2023-04-19 12:17:34 2023-04-19 12:21:38
##   start_station_name start_station_id end_station_name end_station_id start_lat
## 1                                                                         41.80
## 2                                                                         41.87
## 3                                                                         41.93
## 4                                                                         41.92
## 5                                                                         41.91
## 6                                                                         41.91
##   start_lng end_lat end_lng member_casual ride_time_calc   weekday month
## 1    -87.60   41.79  -87.60        member       4.150000    Sunday April
## 2    -87.65   41.93  -87.68        member      23.166667 Wednesday April
## 3    -87.66   41.93  -87.66        member       2.000000 Wednesday April
## 4    -87.65   41.91  -87.65        member       3.650000 Wednesday April
## 5    -87.65   41.91  -87.63        member       4.833333 Wednesday April
## 6    -87.63   41.92  -87.65        member       4.066667 Wednesday April
##   hour_started
## 1        03:00
## 2        06:00
## 3        03:00
## 4        08:00
## 5        07:00
## 6        07:00

Likewise, let’s make this into a factor for faster analysis later on. Note, NA’s occur for starting stations due to electric bikes as they do not need to be docked.

CyclisticMergedDupRemoved$hour_started <- as.factor(CyclisticMergedDupRemoved$hour_started)

class(CyclisticMergedDupRemoved$hour_started)
## [1] "factor"
nlevels(CyclisticMergedDupRemoved$hour_started)
## [1] 24
summary(CyclisticMergedDupRemoved$hour_started)
##  00:00  01:00  02:00  03:00  04:00  05:00  06:00  07:00  08:00  09:00  10:00 
##  64686 164208 273398 311507 243032 253594 307059 348156 354852 373699 443645 
##  11:00  12:00  13:00  14:00  15:00  16:00  17:00  18:00  19:00  20:00  21:00 
## 545428 588123 478015 343576 246006 195066 153651 103296  70012  42942  25359 
##  22:00  23:00 
##  16325  21648

Lastly, let’s create a new column to measure the distance between start and ending coordinates. Note, some rows contain NA for ending station and starting due to electric bikes not being required to be docked.

CyclisticMergedDupRemoved$distance_km <-
  distGeo(matrix(c(CyclisticMergedDupRemoved$start_lng, CyclisticMergedDupRemoved$start_lat), ncol=2), matrix(c(CyclisticMergedDupRemoved$end_lng, CyclisticMergedDupRemoved$end_lat), ncol=2)) / 1000

summary(CyclisticMergedDupRemoved$distance_km)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##    0.000    0.921    1.594    2.160    2.798 9814.022      302

We see that the max is particularly high, let’s see what the head of a descended sort of this data looks like so we can remove high outliers.

CyclisticMergedDupRemoved <- CyclisticMergedDupRemoved %>%
  arrange(desc(distance_km))
head(CyclisticMergedDupRemoved$distance_km, n = 20)
##  [1] 9814.02222   49.02223   40.85935   36.90903   32.22815   32.22256
##  [7]   32.22119   31.60031   31.09615   30.34938   30.15654   29.52220
## [13]   28.76031   27.73879   26.84829   26.61580   26.55284   26.55284
## [19]   26.32181   26.31305

We see that there is only one extremely high distance value so we can remove it to avoid skew.

CyclisticMergedDupRemoved <- CyclisticMergedDupRemoved %>%
  slice(-(1))

head(CyclisticMergedDupRemoved$distance_km, n = 10)
##  [1] 49.02223 40.85935 36.90903 32.22815 32.22256 32.22119 31.60031 31.09615
##  [9] 30.34938 30.15654
summary(CyclisticMergedDupRemoved$distance_km)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.921   1.594   2.158   2.798  49.022     302

Let’s also change member_casual to a factor for better analysis later on.

CyclisticMergedDupRemoved$member_casual <- as.factor(CyclisticMergedDupRemoved$member_casual)
 
 class(CyclisticMergedDupRemoved$member_casual)
## [1] "factor"
 nlevels(CyclisticMergedDupRemoved$member_casual)
## [1] 2
 summary(CyclisticMergedDupRemoved$member_casual)
##  casual  member 
## 2109065 3858217

For the final step, we will remove irrelevant columns such as the started_at and ended_at due to us already finding the time differences.

CyclisticCleaned <- CyclisticMergedDupRemoved %>%
  select(-c(started_at, ended_at))

head(CyclisticCleaned)
##            ride_id rideable_type      start_station_name start_station_id
## 1 500867C891161912 electric_bike State St & Van Buren St     TA1305000035
## 2 587C8B39F4D4F488 electric_bike  St. Clair St & Erie St            13016
## 3 EFE648FE9A5670D5 electric_bike                                         
## 4 49FC4504C708157A electric_bike  Wabash Ave & Wacker Pl     TA1307000131
## 5 21DFEEDFA4539B5F electric_bike  Wabash Ave & Wacker Pl     TA1307000131
## 6 D1D23BC897600D5B electric_bike                                         
##   end_station_name end_station_id start_lat start_lng end_lat end_lng
## 1                                  41.87727 -87.62792   42.07  -88.16
## 2                                  41.89441 -87.62271   41.65  -87.99
## 3                                  41.66000 -87.64000   41.97  -87.80
## 4                                  41.88679 -87.62625   42.15  -87.79
## 5                                  41.88685 -87.62624   42.15  -87.79
## 6                                  42.01000 -87.67000   41.72  -87.66
##   member_casual ride_time_calc  weekday   month hour_started distance_km
## 1        casual       99.48333 Thursday    June        20:00    49.02223
## 2        member        2.60000   Friday October        18:00    40.85935
## 3        casual      118.01667   Sunday    June        06:00    36.90903
## 4        casual      115.76667   Sunday   April        11:00    32.22815
## 5        casual      117.43333   Sunday   April        11:00    32.22256
## 6        member       90.08333   Friday  August        06:00    32.22119
summary(CyclisticCleaned)
##    ride_id          rideable_type      start_station_name start_station_id  
##  Length:5967282     Length:5967282     Length:5967282     Length:5967282    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5967282     Length:5967282     Min.   :41.63   Min.   :-87.94  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.07   Max.   :-87.46  
##                                                                        
##     end_lat         end_lng       member_casual    ride_time_calc   
##  Min.   :41.61   Min.   :-88.16   casual:2109065   Min.   :  1.000  
##  1st Qu.:41.88   1st Qu.:-87.66   member:3858217   1st Qu.:  5.717  
##  Median :41.90   Median :-87.64                    Median :  9.783  
##  Mean   :41.90   Mean   :-87.65                    Mean   : 14.049  
##  3rd Qu.:41.93   3rd Qu.:-87.63                    3rd Qu.: 17.100  
##  Max.   :42.18   Max.   :-87.44                    Max.   :120.000  
##  NA's   :302     NA's   :302                                        
##       weekday             month          hour_started      distance_km    
##  Friday   :863022   April    : 811052   12:00  : 588123   Min.   : 0.000  
##  Monday   :781974   August   : 746786   11:00  : 545428   1st Qu.: 0.921  
##  Saturday :932384   July     : 740487   13:00  : 478015   Median : 1.594  
##  Sunday   :771390   June     : 695000   10:00  : 443645   Mean   : 2.158  
##  Thursday :895378   September: 646904   09:00  : 373699   3rd Qu.: 2.798  
##  Tuesday  :863811   May      : 582422   08:00  : 354852   Max.   :49.022  
##  Wednesday:859323   (Other)  :1744631   (Other):3183520   NA's   :302

With all that done, we will now move on to the analysis step!


ANALYSIS

**

Now it is time for my favorite step, analysis. Using the packages and libraries we installed above, we will be grouping the data to create visualizations for potential insights.

First, let’s begin by looking at the distribution of member types.

member_casual_view <- CyclisticCleaned %>%
  group_by(member_casual) %>%
  summarize(
    count = n(),
    percentage = round(count / nrow(CyclisticCleaned) * 100, 2),
    .groups = "drop"
  )

options(scipen = 999, repr.plot.width = 12, repr.plot.height = 8)

ggplot(member_casual_view, aes(x = "", y = count, fill = member_casual)) +
  geom_col(width = 0.6, color = "white") +
  geom_text(
    aes(label = paste0(count, " (", percentage, "%)")),
    position = position_stack(vjust = 0.5),
    size = 5,
    fontface = "bold"
  ) +
  labs(
    title = "Ride Nums. by User Class",
    subtitle = "April 2023 - March 2024",
  ) +
  theme_void(base_size = 17) +
  coord_polar(theta = "y") +
  scale_fill_manual(values = c("#FF6F61", "#6B5B95"), name = NULL) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(size = 12, hjust = 0.5),
    legend.position = "right",
    legend.spacing.x = unit(0.5, "cm")
  )

There is a large difference in the number of casual vs. member users, approximately 1.75 million. Let’s delve deeper and separate each month by its rides and separate those rides by the membership status of users.

CyclisticCleaned$month <- ordered(CyclisticCleaned$month, levels = 
                                    c("April", "May", "June", "July", "August", "September", "October", "November", "December", "January", "February", "March"))

member_casual_monthly <- CyclisticCleaned %>%
  group_by(month, member_casual) %>%
  summarize(count = n(), .groups = "drop") %>%
  arrange(month, member_casual)

options(repr.plot.width = 12, repr.plot.height = 6)

custom_colors <- c("#FF6F61", "#6B5B95")

ggplot(data = member_casual_monthly, aes(x = month, y = count, fill = member_casual)) +
  geom_col(position = "dodge", width = 0.7) +
  labs(title = "Monthly Ride Counts by Membership Class", subtitle = "(April 2023 - March 2024)", x = "Month", y = "Num. of Rides") +
  scale_fill_manual(values = custom_colors, name = "User Type:") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5, size = 16, face = "bold"), plot.subtitle = element_text(hjust = .5, size = 12, face = "bold"),
        axis.title = element_text(size = 12),
        legend.position = "top",
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 10))

There are a few key takeaways from this data:

Let’s now look deeper through the popularity of rides through each weekday.

CyclisticCleaned$weekday <- ordered(CyclisticCleaned$weekday, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

weekday_member_casual <- CyclisticCleaned %>%
  group_by(weekday, member_casual) %>%
  summarize(count = n(), .groups = "drop") %>%
  arrange(weekday, member_casual)

custom_colors <- c("#FF6F61", "#6B5B95")

ggplot(data = weekday_member_casual, aes(x = weekday, y = count, fill = member_casual), show.legend = FALSE) +
  geom_col(position = "dodge", width = .8) +
  labs(title = "Ride Counts Each Day of the Week", x = "Day", y = "Num. of Rides", subtitle = "(April 2023 - March 2024)") +
  scale_fill_manual(values = custom_colors, name = "User Class") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5, size = 16, face = "bold"), plot.subtitle = element_text(hjust = .5, size = 12, face = "bold"),
        axis.title = element_text(size = 12),
        legend.position = "top",
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 10))

As we can see through the data, throughout the work week the number of members dominates the number of rides. During the weekends, the number of casuals increases and almost reaches the number of members.

This makes sense as those with set schedules would likely plan to use these bikes and thus secure a membership while weekend plans are more spontaneous and encourages tourism, leading to a shift in the numbers.

Let’s further examine the times that these bikes are used during each weekday.

hourly_member_casual_weekday <- CyclisticCleaned%>%
  filter(weekday %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")) %>%
  group_by(hour_started, member_casual)%>%
  summarize(count = n(), .groups = "drop")%>%
  arrange(hour_started, member_casual)

custom_colors <- c("#FF6F61", "#6B5B95")

ggplot(data = hourly_member_casual_weekday, aes(x = hour_started, y = count, group = member_casual, color = member_casual)) + 
  geom_point() + 
  geom_line() + 
  scale_color_manual(values = custom_colors) +
  labs(title = "Ride Count by Time of Day Weekdays", subtitle = "(April 2023 - March 2024)", x = "Hour", y = "Num. of Rides") + 
  guides(color = guide_legend(title = "User Class")) + 
  theme(text = element_text(size = 12), axis.text.x = element_text(angle = 90), plot.title = element_text(hjust = 0.5, size = 16, face = "bold"), plot.subtitle = element_text(hjust = .5, size = 12, face = "bold"),
        axis.title = element_text(size = 12),
        legend.position = "top",
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 10))

hourly_member_casual_weekend <- CyclisticCleaned %>%
  filter(weekday %in% c("Saturday", "Sunday")) %>%
  group_by(hour_started, member_casual) %>%
  summarize(count = n(), .groups = "drop") %>%
  arrange(hour_started, member_casual)

custom_colors <- c("#FF6F61", "#6B5B95")

ggplot(data = hourly_member_casual_weekend, aes(x = hour_started, y = count, group = member_casual, color = member_casual)) + 
  geom_point() + 
  geom_line() +
  scale_color_manual(values = custom_colors) +
  labs(title = "Ride Count by Time of Day Weekends", subtitle = "(April 2023 - March 2024)", x = "Hour", y = "Num. of Rides") + 
  guides(color = guide_legend(title = "User Class")) + 
  theme(text = element_text(size = 12), axis.text.x = element_text(angle = 90), plot.title = element_text(hjust = 0.5, size = 16, face = "bold"), plot.subtitle = element_text(hjust = .5, size = 12, face = "bold"),
        axis.title = element_text(size = 12),
        legend.position = "top",
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 10))

As we can see, the weekends tend to be more rounded, indicating a loose structure in when bikes are used the most, while the weekdays are more rigid and sharp, indicating scheduled times. We can also see that during the weekdays, 12:00 seems to be a peak.

Now let’s see which bike type is more favored by member types. Note, docked bikes were later changed to be included into classic bikes.

casual_classic <- CyclisticCleaned %>%
  filter(member_casual == "casual" & (rideable_type == "classic_bike" | rideable_type == "docked_bike"))

casual_electric <- CyclisticCleaned %>%
  filter(member_casual == "casual" & rideable_type == "electric_bike")

count_casual_classic <- nrow(casual_classic)
count_casual_electric <- nrow(casual_electric)

total_casual_riders <- count_casual_classic + count_casual_electric

pie_data <- data.frame(
  category = c("Classic", "Electric Bike"),
  count = c(count_casual_classic, count_casual_electric),
  percentage = c(count_casual_classic/total_casual_riders, count_casual_electric/total_casual_riders)
)

ggplot(pie_data, aes(x = "", y = count, fill = category)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0) +
  geom_text(aes(label = paste0(count, " (", round(percentage * 100), "%)")), position = position_stack(vjust = 0.5), fontface = "bold") +
  labs(title = "Casual Riders by Bike Type") +
  scale_fill_manual(values = c("#0077FF", "#FF7700"), name = "Bike Type") +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"))

member_classic <- CyclisticCleaned %>%
  filter(member_casual == "member" & (rideable_type == "classic_bike" | rideable_type == "docked_bike"))

member_electric <- CyclisticCleaned %>%
  filter(member_casual == "member" & rideable_type == "electric_bike")

count_member_classic <- nrow(member_classic)
count_member_electric <- nrow(member_electric)

total_member_riders <- count_member_classic + count_member_electric

pie_data <- data.frame(
  category = c("Classic", "Electric Bike"),
  count = c(count_member_classic, count_member_electric),
  percentage = c(count_member_classic/total_member_riders, count_member_electric/total_member_riders)
)

ggplot(pie_data, aes(x = "", y = count, fill = category)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0) +
  geom_text(aes(label = paste0(count, " (", round(percentage * 100), "%)")), position = position_stack(vjust = 0.5), fontface = "bold") +
  labs(title = "Member Riders by Bike Type") +
  scale_fill_manual(values = c("#00CC66", "#9966FF"), name = "Bike Type") +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"))

They seem to be relatively similar across the board. Let’s see if comparing ride lengths for both member types on each bike type can offer better insights.

casual_bikes <- CyclisticCleaned %>%
  filter(member_casual == "casual") %>%
  mutate(bike_type = ifelse(rideable_type %in% c("classic_bike", "docked_bike"), "classic/docked", "electric_bike")) %>%
  group_by(hour_started, bike_type) %>%
  summarize(average_ride_time = mean(ride_time_calc), .groups = "drop") %>%
  arrange(hour_started, bike_type)

custom_colors <- c("classic/docked" = "#0077FF", "electric_bike" = "#FF7700")

ggplot(data = casual_bikes, aes(x = hour_started, y = average_ride_time, group = bike_type, color = bike_type)) + 
  geom_point() + 
  geom_line() + 
  scale_color_manual(values = custom_colors) +
  labs(title = "Average Ride Time by Hour Started for Casual Riders",
       subtitle = "(April 2023 - March 2024)",
       x = "Hour",
       y = "Average Ride Time (minutes)",
       color = "Bike Type") + 
  theme_minimal() +
  theme(text = element_text(size = 12),
        axis.text.x = element_text(angle = 90),
        plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 12, face = "bold"),
        axis.title = element_text(size = 12),
        legend.position = "top",
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 10))

member_bikes <- CyclisticCleaned %>%
  filter(member_casual == "member") %>%
  mutate(bike_type = ifelse(rideable_type %in% c("classic_bike", "docked_bike"), "classic/docked", "electric_bike")) %>%
  group_by(hour_started, bike_type) %>%
  summarize(average_ride_time = mean(ride_time_calc), .groups = "drop") %>%
  arrange(hour_started, bike_type)

custom_colors <- c("classic/docked" = "#00CC66", "electric_bike" = "#9966FF")

ggplot(data = member_bikes, aes(x = hour_started, y = average_ride_time, group = bike_type, color = bike_type)) + 
  geom_point() + 
  geom_line() + 
  scale_color_manual(values = custom_colors) +
  labs(title = "Average Ride Time by Hour Started for Members",
       subtitle = "(April 2023 - March 2024)",
       x = "Hour",
       y = "Average Ride Time (minutes)",
       color = "Bike Type") + 
  theme_minimal() +
  theme(text = element_text(size = 12),
        axis.text.x = element_text(angle = 90),
        plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 12, face = "bold"),
        axis.title = element_text(size = 12),
        legend.position = "top",
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 10))

Interestingly enough, it seems casual riders that chose to use classic bikes would spend much more time to electric bike casual riders than member riders. This can be especially seen between 5-10 am.

One possible cause is that members are more experienced with riding bikes and thus can maintain a higher speed. Both membership types were consistent in showing classic bikes being slower than electric, which makes sense due to their differences in speed.

Lastly, let’s take a look at the geospatial differences between starting points of casual riders and member riders for round trips. We will be using the leaflet library and we will first separate and summarize distances where starting and ending were the same and different respectively before combining them to create a display.

Since we made the coordinate columns factors, we can group them by these coordinates and remove any that are less than specified values to make visualization easier and more efficient.

distance_summarized <- CyclisticCleaned %>%
  drop_na(distance_km)

distance_0 <- distance_summarized %>%
  filter(distance_km==0)

route_summary_0 <- distance_0 %>%
  group_by(start_lng, start_lat, end_lng, end_lat, member_casual) %>%
  summarize(total = n(), .groups="drop") %>%
  filter(total>75) %>%
  arrange(desc(total))

route_summarized <- distance_summarized %>%
  filter(!distance_km==0) %>%
  group_by(start_lng, start_lat, end_lng, end_lat, member_casual) %>%
  summarize(total = n(), .groups = "drop") %>%
  filter(total>100) %>%
  arrange(desc(total))

data_for_clustering <- bind_rows(route_summary_0, route_summarized)
normalize_radius <- function(total, min_radius=2, max_radius=20) {
  min_total <- min(total)
  max_total <- max(total)
  normalized <- ((total - min_total) / (max_total - min_total)) * (max_radius - min_radius) + min_radius
  return(normalized)
}

casual_data <- data_for_clustering %>%
  filter(member_casual=="casual") %>%
  mutate(radius = normalize_radius(total))

leaflet(data = casual_data) %>%
  addTiles() %>%
  addCircleMarkers(~start_lng, ~start_lat, 
             radius = ~radius,
             fillOpacity = .3,
             stroke = TRUE,
             color = "#FF6F61")
member_data <- data_for_clustering %>%
  filter(member_casual == "member") %>%
  mutate(radius = normalize_radius(total))

leaflet(data = member_data) %>%
  addTiles() %>%
  addCircleMarkers(
    ~start_lng, ~start_lat,
    radius = ~radius,
    fillOpacity = 0.3,
    stroke = TRUE,
    color = "#6B5B95"
  )

It seems that for both user types, the area near Lake Michigan seems to be the most used starting area with casual members having large amounts of users starting in areas with high tourism such as ports.

Through these visualizations, we have determined several key insights into the differences of behaviors between casual and member users. Now it is time to summarize these findings and present our data-driven recommendations for a targeted marketing campaign into a digestable manner.


SHARE & ACT

Our goal was to find insights and report findings on how casual and member riders use Cyclistic bicycles differently. From our analysis of rides from April 2023 to March 2024, we found several similarities and differences between these two groups.

Similarities:

Differences:

With these similarities and differences, we can create character profiles for both casual and member riders.

Casual Riders:

Member Riders:


With this in mind, let’s move on to the recommendations


RECOMMENDATIONS

  1. Create a marketing campaign and fitness reward program targeted at casual users. A significant portion of these riders see Cyclistic as a fun and cost-effective way to explore the city and get exercise done during the weekends and other holidays. By marketing Cyclistic as THE platform for introducing people to the world of biking, we can push more users to try biking through our platform and convert to a membership to hit goals. We can also add a rewards plan and refer-a-friend bonuses so people feel more inclined to continue to use Cyclistic. Furthermore, as these users tend to explore and start from tourist heavy areas in downtown and the Lake Michigan area, advertisements positioned in this locations would be most impactful. To amplify these efforts, we can explore collaborations with popular tourist influencers to better promote our platform and appeal for this target audience.

  2. Highlight the cost effectiveness of member plans vs. casual riders who frequent Cyclistic as their main mode of commuting transportation through weekly metrics. Currently, there exists a small subset of casual riders whose behavior mimics members very closely during weekdays in terms of ride times. Despite this, they seem to continue to rely on casual membership plans. By underscoring the potential savings and additional benefits offered by member plans, such as lower rates per ride and access to exclusive features, this initiative aims to encourage these riders to consider upgrading to membership plans, enhancing user satisfaction and increasing the membership base.

  3. Introduce seasonal offerings, marketing campaigns, and membership types for both casual and existing members. Our data shows that summer months, particularly April and May-August, have the highest number of riders from both user types. We can leverage this into creating a seasonal membership option with additional benefits that is only active during these months so casual riders not wanting to commit to an annual membership will be more likely to switch to a seasonal membership. These benefits could include discounted rates, bonus ride credits, or exclusive access to summer events or promotions. Additionally, we can use the knowledge that the influx of casual riders in summer months likely arises from tourists and offer them a tourist membership bundled with benefits with other attractions. To effectively promote these offerings, customized marketing campaigns will be implemented, utilizing targeted advertising on social media platforms, email newsletters, and partnerships with local tourist attractions. By implementing these strategies, Cyclistic can effectively cater to the preferences and behaviors of its diverse user base, drive membership conversion, and enhance overall user satisfaction.


CLOSING NOTES

That about wraps up this case study. I had a lot of fun making this and would love to hear any feedback from anyone who happened to make it to the end. Thanks!