In 2016, Cyclistic, fictional company, launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders.
Design marketing strategies aimed at converting casual riders into annual members.
library(tidyverse)
library(dplyr)
library(skimr)
library(janitor)
library(readr)
library(ggplot2)
library(knitr)
colnames(X202301_divvy_tripdata)
colnames(X202302_divvy_tripdata)
colnames(X202303_divvy_tripdata)
colnames(X202304_divvy_tripdata)
colnames(X202305_divvy_tripdata)
colnames(X202306_divvy_tripdata)
colnames(X202307_divvy_tripdata)
colnames(X202308_divvy_tripdata)
colnames(X202309_divvy_tripdata)
colnames(X202310_divvy_tripdata)
colnames(X202311_divvy_tripdata)
colnames(X202312_divvy_tripdata)
str(X202301_divvy_tripdata)
str(X202302_divvy_tripdata)
str(X202303_divvy_tripdata)
str(X202304_divvy_tripdata)
str(X202305_divvy_tripdata)
str(X202306_divvy_tripdata)
str(X202307_divvy_tripdata)
str(X202308_divvy_tripdata)
str(X202309_divvy_tripdata)
str(X202310_divvy_tripdata)
str(X202311_divvy_tripdata)
str(X202312_divvy_tripdata)
combined_tripdata_2023 <- list.files(path="/Users/johnrama/Documents/data_analytics_capstone_bikeshare_2023",
pattern = "*.csv",full.names = TRUE) %>%
lapply((read_csv)) %>%
bind_rows()
## Create dataframe
combined_tripdata_2023_df <- as.data.frame(combined_tripdata_2023)
## Investigate new dataframe
colnames(combined_tripdata_2023_df) #List of column names
nrow(combined_tripdata_2023_df) #Number of rows in df
dim(combined_tripdata_2023_df) #Dimensiions of df
head(combined_tripdata_2023_df) #First 6 lines of df
tail(combined_tripdata_2023_df) #Last 6 lines of df
glimpse(combined_tripdata_2023_df) #List of columns and datatypes
str(combined_tripdata_2023_df) #List of columns and datatypes
summary(combined_tripdata_2023_df) #Statistical summary of data.
##count number of distinct ride_ids
n_distinct(combined_tripdata_2023_df$ride_id)
## Trim white space from df for ride_id, member_casual, column, rideable_type,
## and start_station_name columns
combined_tripdata_2023_df$ride_id <- str_squish(combined_tripdata_2023_df$ride_id)
combined_tripdata_2023_df$member_casual <- str_squish(combined_tripdata_2023_df$member_casual)
combined_tripdata_2023_df$rideable_type <- str_squish(combined_tripdata_2023_df$rideable_type)
combined_tripdata_2023_df$start_station_name <- str_squish(combined_tripdata_2023_df$start_station_name)
##Count the number of distinct values in each column
sapply(combined_tripdata_2023_df, function(x) n_distinct(x))
# Note: There are multiple spellings for the same start_station_name.
# This results in a discrepancy in the number of start/end station names and id's.
# Further cleaning would be necessary if analyzing start_station_names or ids.
## Add columns for date, month, day, and year of each ride taken
## This will allow us to aggregate ride data by month, day, or year
combined_tripdata_2023_df$date <- as.Date(combined_tripdata_2023_df$started_at)
# format is yyyy-mm-dd
combined_tripdata_2023_df$month <- format(as.Date(combined_tripdata_2023_df$date), "%m")
combined_tripdata_2023_df$day <- format(as.Date(combined_tripdata_2023_df$date), "%d")
combined_tripdata_2023_df$year <- format(as.Date(combined_tripdata_2023_df$date), "%Y")
combined_tripdata_2023_df$day_of_week <- format(as.Date(combined_tripdata_2023_df$date), "%A")
## Add column for start time for each ride taken
## This will allow us to aggregate ride data by time of day
combined_tripdata_2023_df$start_time <- format(as.POSIXct(combined_tripdata_2023_df$
started_at), format = "%H:%M:%S")
## Add trip duration column to dataset (minutes)
combined_tripdata_2023_df$trip_duration <- difftime(combined_tripdata_2023_df$
ended_at, combined_tripdata_2023_df$started_at, units = "mins")
## Create dataframe with trip duration column
comb_tripdata_2023_duration_df <- as.data.frame(combined_tripdata_2023_df)
## Inspect the structure of the columns in our new df
str(comb_tripdata_2023_duration_df)
## Convert trip_duration to numeric
is.numeric(comb_tripdata_2023_duration_df$trip_duration)
comb_tripdata_2023_duration_df$trip_duration <-
as.numeric(comb_tripdata_2023_duration_df$trip_duration)
is.numeric(comb_tripdata_2023_duration_df$trip_duration)
## Inspect trip_duration column values
summary(comb_tripdata_2023_duration_df$trip_duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -16656.52 5.50 9.65 17.21 17.15 98489.07
Note: Negative trip duration values exist as do null values as start station names.
## Remove bad data, including negative values for trip duration and null values
## for start_station_name
## Null values occurr where bikes were taken out of circulation by DIVVY
comb_tripdata_2023_duration_clean <- comb_tripdata_2023_duration_df %>%
filter(!(is.na(start_station_name) | start_station_name == ""), !(is.na(end_station_name) |
end_station_name == ""), !(trip_duration <= 0))
# removed start and end station names as NA | blanks and negative trip_duration values
## Look for duplicate values in df
comb_tripdata_2023_duration_clean %>%
get_dupes(ride_id, rideable_type, started_at, start_station_id, member_casual, trip_duration)
## Confirm bad data was removed from dataset
comb_tripdata_2023_duration_clean %>%
filter(is.na(start_station_name) | start_station_name == "", is.na(end_station_name) |
end_station_name == "", trip_duration <= 0)
comb_tripdata_2023_cleaned <- as.data.frame(comb_tripdata_2023_duration_clean)
Note: Start time is rounded to the nearest hour and formatted as time.
## Add column for start time rounded to nearest hour
comb_tripdata_2023_cleaned$rounded_nearest_start_time <-
round_date(comb_tripdata_2023_cleaned$started_at,"hour")
## Format rounded start time column to strickly be a time stamp
comb_tripdata_2023_cleaned$rounded_nearest_start_time <- format(as.POSIXct(
comb_tripdata_2023_cleaned$rounded_nearest_start_time), format = "%H:%M")
## Summary statistics for trip_duration for cleaned data set
summary_trip_duration <- comb_tripdata_2023_cleaned %>%
summarize(mean_trip_duration = mean(trip_duration), max_trip_duration =
max(trip_duration), min_trip_duration = min(trip_duration), median_trip_duration =
median(trip_duration))
##Make table to display summary statistics
summary_trip_duration %>%
knitr::kable(col.names = c("Mean", "Max", "Min", "Median"),
caption = "Summary Stats for Trip Duration (minutes)")
Mean | Max | Min | Median |
---|---|---|---|
15.95478 | 12136.3 | 0.0166667 | 9.8 |
Note: All values are given in minutes.
## Compare trip_duration for members and casual riders
agg_mean <- aggregate(comb_tripdata_2023_cleaned$trip_duration~
comb_tripdata_2023_cleaned$member_casual, FUN = mean)
agg_median <- aggregate(comb_tripdata_2023_cleaned$trip_duration~
comb_tripdata_2023_cleaned$member_casual, FUN = median)
agg_max <- aggregate(comb_tripdata_2023_cleaned$trip_duration~
comb_tripdata_2023_cleaned$member_casual, FUN = max)
agg_min <- aggregate(comb_tripdata_2023_cleaned$trip_duration~
comb_tripdata_2023_cleaned$member_casual, FUN = min)
## Make tables of aggregate data
agg_mean %>%
knitr::kable(col.names = c("Member Type", "Mean Trip Duration"))
Member Type | Mean Trip Duration |
---|---|
casual | 22.94186 |
member | 12.13228 |
agg_median %>%
knitr::kable(col.names = c("Member Type", "Median Trip Duration"))
Member Type | Median Trip Duration |
---|---|
casual | 12.750000 |
member | 8.616667 |
agg_max %>%
knitr::kable(col.names = c("Member Type", "Max Trip Duration"))
Member Type | Max Trip Duration |
---|---|
casual | 12136.300 |
member | 1497.867 |
agg_min %>%
knitr::kable(col.names = c("Member Type", "Min Trip Duration"))
Member Type | Min Trip Duration |
---|---|
casual | 0.0166667 |
member | 0.0166667 |
## Order days of week
comb_tripdata_2023_cleaned$day_of_week <- ordered(comb_tripdata_2023_cleaned$day_of_week,
levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
## Average (mean) trip duration by day of week for member and casual riders
agg_mean_weekday_member <- comb_tripdata_2023_cleaned %>%
aggregate(comb_tripdata_2023_cleaned$trip_duration ~
comb_tripdata_2023_cleaned$member_casual +
comb_tripdata_2023_cleaned$day_of_week, FUN = mean)
## Avg (median) trip duration by day of week for member and casual riders
agg_med_weekday_member <- comb_tripdata_2023_cleaned %>%
aggregate(comb_tripdata_2023_cleaned$trip_duration ~
comb_tripdata_2023_cleaned$member_casual +
comb_tripdata_2023_cleaned$day_of_week, FUN = median)
## Make tables of avg trip duration by day of week and member type
agg_mean_weekday_member %>%
knitr::kable(col.names = c("Member Type", "Weekday", "Mean Trip Duration"),
caption = "Mean Trip Duration (minutes)")
Member Type | Weekday | Mean Trip Duration |
---|---|---|
casual | Sunday | 26.56936 |
member | Sunday | 13.61416 |
casual | Monday | 22.53504 |
member | Monday | 11.55048 |
casual | Tuesday | 20.51681 |
member | Tuesday | 11.64962 |
casual | Wednesday | 19.60035 |
member | Wednesday | 11.58680 |
casual | Thursday | 20.00623 |
member | Thursday | 11.60257 |
casual | Friday | 22.32226 |
member | Friday | 12.03977 |
casual | Saturday | 25.92237 |
member | Saturday | 13.58449 |
agg_med_weekday_member %>%
knitr::kable(col.names = c("Member Type", "Weekday", "Median Trip Duration"),
caption = "Median Trip Duration (minutes)")
Member Type | Weekday | Median Trip Duration |
---|---|---|
casual | Sunday | 15.083333 |
member | Sunday | 9.333333 |
casual | Monday | 12.100000 |
member | Monday | 8.216667 |
casual | Tuesday | 11.316667 |
member | Tuesday | 8.400000 |
casual | Wednesday | 10.933333 |
member | Wednesday | 8.433333 |
casual | Thursday | 11.200000 |
member | Thursday | 8.450000 |
casual | Friday | 12.433333 |
member | Friday | 8.466667 |
casual | Saturday | 15.050000 |
member | Saturday | 9.483333 |
## Create plot to analyze number of rides by weekday, for member_casual
comb_tripdata_2023_cleaned %>%
mutate(weekday = wday(started_at, label = TRUE)) %>% #create weekday field
group_by(member_casual, weekday) %>% # group by usertype and weekday
summarize(number_rides = n(), average_duration = mean(trip_duration)) %>%
# num of rides & avg duration
arrange(member_casual, weekday) %>% # arrange
ggplot() + geom_col(mapping = aes(x = weekday, y = number_rides, fill = member_casual),
position = "dodge") + labs(title = "Rider Habits", subtitle = "Ride frequency by
day of week", caption = "www.divvybikes.com/data") +
scale_fill_brewer(palette = "Set1")
# column plot num rides v weekday, adjusted color
## Analyze & plot the percentage of rides by weekday by member type
comb_tripdata_2023_cleaned %>%
mutate(weekday = wday(started_at, label = TRUE)) %>% # create weekday field
group_by(member_casual, weekday) %>% # group by member type & weekday
summarise(count=n()) %>% #count the number of each type of member
mutate(percent_of_rides = count/sum(count) *100) %>% # Calculate percent rides by weekday
arrange(member_casual, weekday) %>% # arrange
ggplot() + geom_col(mapping = aes(x = weekday, y = percent_of_rides,
fill = member_casual), position = "dodge") + labs(title = "Breakdown of Member Rides", subtitle = "Percent of rides by day of week", caption = "www.divvybikes.com/data") + scale_fill_brewer(palette = "Set1")
# plot percent of rides by weekday for each type of member v season, adjust color
Key Finding #1
The plot on the left shows that casual riders prefer to ride on weekend days whereas members tend to ride more during the week.
More casual rides occur on Saturday than any other day of the week and more member rides occur between Tuesday and Thursday than any other days of the week.
The plot on the right shows that, although more members ride on any given day of the week, when compared to casual riders, a greater percentage of casual riders ride from Friday through Sunday.
Slightly over 50% of all casual rides occur on the weekend (Fri.-Sun.) compared to only 36.9% of member rides.
Takeaway Casual riders tend to ride more on weekend days suggesting that incentives should be targeting casual weekend riders. Possible incentives might include lowering the per minute cost of weekend rides, currenlty set at $0.18/min. (classic bike), for a set time period for anyone purchasing a new membership. Additionally, new members might receive a limited number of guest passes that may only be used on weekend days.
## Create plot to analyze median ride duration by weekday, for member_casual
comb_tripdata_2023_cleaned %>%
mutate(weekday = wday(started_at, label = TRUE)) %>% # create weekday field
group_by(member_casual, weekday) %>% # group by usertype and weekday
summarize(number_rides = n(), med_duration_minutes = median(trip_duration)) %>%
# num of rides & median ride duration
arrange(member_casual, weekday) %>% # arrange
ggplot() + geom_col(mapping = aes(x = weekday, y = med_duration_minutes,
fill = member_casual), position = "dodge") + labs(title = "Daily Ride Length",
subtitle = "Median ride length by day of week", caption = "www.divvybikes.com/data") +
scale_fill_brewer(palette = "Set1") + theme(axis.text.x = element_text(size = 10))
# column plot median ride duration v weekday, adjusted color
Key Finding #2
Takeaway Look to incentivize longer rides on weekends, with the purchase of a new membership. In 2023 alone, close to 100,000 weekend rides (Fri.-Sun.) were taken, by casual riders that lasted longer than 45 minutes. Members are currently charged $0.18/minute (classic bike) for rides lasting over 45 minutes. One incentive might be to extend the length of the free ride time to 60 minutes for new members allowing them to ride longer on the weekend. Additionally, new members could receive bonus points for participating in the Bike Angels program on weekend days and use those points toward purchasing extra riding minutes, DIVVY swag, discounted rates on scooters and e-bikes, or reduced membership renewal costs.
comb_tripdata_2023_cleaned %>%
group_by(member_casual, rounded_nearest_start_time) %>% # group by member_casual & time
summarize(number_rides = n(), avg_trip_duration = mean(trip_duration)) %>%
# num of rides and median trip duration
arrange(member_casual, rounded_nearest_start_time) %>% # arrange member_casual & time
ggplot() + geom_line(mapping = aes(x = rounded_nearest_start_time, y = number_rides,
group = member_casual, color = member_casual)) + labs(title = "Ride Times", subtitle = "Ride habits by time of day (CST)", caption = "www.divvybikes.com/data ") +
scale_color_brewer(palette = "Set1") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
# line plot of number of rides by time of day by member type
## Analyze and plot ride duration by time of day
comb_tripdata_2023_cleaned %>%
group_by(member_casual, rounded_nearest_start_time) %>% # group by member_casual & time
summarize(number_rides = n(), med_duration_minutes = median(trip_duration)) %>%
# num of rides and median trip duration
arrange(member_casual, rounded_nearest_start_time) %>% # arrange member_casual & time
ggplot() + geom_line(mapping = aes(x = rounded_nearest_start_time, y = med_duration_minutes,
group = member_casual, color = member_casual)) + labs(title = "Hourly Ride Length",
subtitle = "Median ride length by hour (CST - rounded to nearest hour)",
caption = "www.divvybikes.com/data ") + scale_color_brewer(palette = "Set1") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
# line plot of number of rides by time of day by member type
Key Finding #3
The plot on the left shows the number of casual rides, occurring on any given day, is highest between the hours of 4pm and 6pm (CST).
The number of member rides spikes between the hours of 7am and 10am and then again between the 4pm and 7pm (CST).
The plot on the right indicates that the length of casual rides dramatically increases between the hours of 9am and 11am (CST).
Casual riders ride for longer periods of time between the hours of 11am and 3pm (CST). It is during this period of time that casual rider ride times peak for the day.
Member ride lengths are much more consistent throughout the course of the day, with the longest rides occurring between the hours of 4pm and 7pm (CST).
Takeaway From this analysis we learn that more casual riders ride between the hours of 4pm and 6pm than another part of the day. Additionally, many casual riders are riding for longer periods of time between the hours of 11am and 3pm. With these two thoughts in mind new member incentives should be focused on the 289,257 rides that occured between 4pm and 6pm as well as the 398,147 riders who rode for longer periods of time between the hours of 11am and 3pm. One incentive might include earning extra reward points for rides taken between the hours of 4pm and 6pm, for new members, that could then be applied to purchasing extra riding minutes, DIVVY swag, or reduced membership renewal costs. In addition, new members would also be granted, for a limited period of time, 60 minutes of free ride time (up from 45 minutes), before per minute rates apply, and double the points for participating in the Bike Angel program, during the hours 11am - 3pm.
###Summary###
The Business Task: Design marketing strategies aimed at converting casual riders into annual members.
Questions to be answered from analysis: How do annual members and casual riders use Cyclistic bikes differently?
Marketing strategy recommendations aimed at converting casual riders into annual members: Recommendations are based on these findings:
Top three marketing strategy recommendations: