Bike Share Analysis

What Makes a Member?

I dug into the data and found some interesting insights on how casual riders and member riders differed. Click the image to view my recommendations to stakeholders. Expand the sections below to dive deeper into my analysis process.

Formulas Used

Calculations for Bike Share Analysis

To read about my step-by-step process of analysis, including troubleshooting and rationales, please expand the menus below. You can view a detailed presentation HERE.

Data Source

All data used in this analysis was obtained from Motivate International Inc. Motivate International Inc. gives a limited license to use their data as source material for analyses. The Data License Agreement is available to view here.

To give the bike-sharing company the most accurate data that also takes seasonality into account, I decided to download the most recent 12 months worth of data. At the time of download, this was data from May of 2021 through April of 2022.

Analysis Questions

How do annual members and casual riders use the company’s bikes differently?

Why would casual riders buy an annual membership?

Prepare - Load the Data

I used Excel for initial analysis. For each month’s data, I ran the remove duplicates tool and then used filters to find any blank cells in the data.

Data Issue #1

When I filtered the data, I noticed that many of the observations were missing values for start_station_name, start_station_id, end_station_name, or end_station_id.

Solution #1

I sorted the sheet by start_lat to see if I could determine the stations based on recorded longitude and latitude. I discovered another error in the data.

Data Issue #2

I noticed that some longitude and latitude values varied in decimal length even within the same start and end station.

Hypothetical Solutions

If I were doing an analysis for an actual company where I could contact stakeholders, I would reach out to my stakeholders to see how they would like me to address these issues. Since I am completing this case study on my own time for myself, I decided to proceed with the data as-is, making sure I take the missing values and slight variation into account when completing my analysis. I used COUNTIF=(range, “”) and then divided the result by the total number of observations in that month to find the percentage of all trips that were missing start or end station information. I also applied this formula to each column to see if there were any other missing entries and discovered that there were a small number of missing end_lat and end_lng entries (0.1% or less for each month).

Process - Adding additional calculations

Day of the Week

I created a new column using the function =WEEKDAY(D2) populated down the column. The output gives a number 1 through 7, with 1 being Sunday. This new column shows the share of riders of each type on different days of the week. This will allow me to see if different types of riders have different weekly behavior.

Start Time

I created a new column using the function =TIME(HOUR(D2),MINUTE(D2),SECOND(D2)) populated down the column. This gives me the HMS output of the start time so I can see what time of day riders of each type are riding.

Trip Duration

I created a new column using the formula =D3-D2 populated down the column. This gives me the duration of the trip, allowing me to see if different types of users tend to take longer or shorter trips. I filtered columns to show only trips with duration less than 1 second and removed any rows that had a duration of 0 or negative seconds.

Start Station Same as End Station

I created a new column using the formula =IFS(OR(J2="",H2=""),2,H2=J2,1,H2<>J2,0) populated down the column. This tells me whether the rider returned the bike to the original station (1), a different station (0), or either start or end station information is missing (2)

Analysis - Trip Distribution By Day

Question: Do member and casual riders have different trip distribution across days of the week?

I created a pivot table using row labels member_casual and day_of_week to list all the days of the week for members and for casual riders. Then I added Count of ride_id to the values pane to find the total number of rides taken by each group on each day of the week. I graphed the results in two pie charts for easier viewing.

Answer: Yes, the distributions are significantly different between the two groups.

Most months, members were fairly evenly distributed across days. Casual riders were heavily concentrated on the weekend, with summer months seeing more than 50% of total casual riders riding during the weekend.

Yearly averages show approximately 41% of casual riders ride during the weekend compared with just 26% of member riders. This suggests that many casual riders may be using the bikes for leisure. Member riders may use the bikes as part of their commute or other necessary tasks.

Analysis - Trip Duration

Question: Do ride durations differ between each group?

A longer average trip for casual members would support the idea that they were using the bikes for leisure. Shorter trips would more likely involve necessary travel between destinations. I added the average of trip_duration to the values on my pivot table.

Answer: Yes, casual riders have significantly longer trip_duration averages than member riders.

The trip_duration average for casual riders was more than double the average for member riders. Casual riders averaged 29 minutes and 52 seconds. Member riders averaged only 12 minutes and 45 seconds.

The range of averages per month was wider for Casual Riders than Member Riders. Casual riders saw a difference of 15 minutes and seven seconds between the highest average month (May) and the lowest average month (November). Member riders had a difference of only 3 minutes and 38 seconds between the high in May and the low in December. Members may take more discretionary trips in spring and summer months when the weather is nice, but a larger portion of their use is necessary trips.

Analysis - Return to Start

Question: Are casual riders more likely to return the bike to the same station?

Next, I wanted to see how frequently rider types returned to their start station. Riders who begin and end at the same station are more likely riding for leisure.

For this pivot table, I ended up iterating on my column formula several times. Initially, I wanted to create a pivot table using the start_station names that would tell me if the stop_station was the same. I realized I would not be able to do this using a pivot table, so I researched to find a solution that would give me the information I needed.

I created a new start_is_end column with a formula using an IF() statement to return a value dependent on whether start_station and end_station were the same. I realized that a simple IF() statement would not account for the missing station entries (more than 20% of total records for some months). I modified it to IFS(OR(J2="",H2=""),2,H2=J2,1,H2<>J2,0) to show a third value for rows that were missing either start_station or end_station data. I excluded the third value in my calculations.

Next, I created a pivot table using member_casual for rows and start_is_end for columns. I used the count of start_is_end column for values to get the total count of 0 values (trips where the end station was different from the start station) and 1 values (trips where start and end stations were the same). I then divided the number of 1 values by the number of 0 values plus 1 values to find the percentage of trips with the same start and end station for each rider type.

Answer: Yes, casual riders returned to the same station at more than two and a half times the rate of member riders.

The majority of both rider types ended at a different station than they started. Casual riders returned to the same station 10.54% of the time, compared to just 4.20% for member riders. This does not mean that only 10.54% of casual rides and 4.20% of member rides were for leisure. The high concentration of stations could mean that the rider returned the bike to a nearby station, but not the exact one that they started at. Alternately, riders could have returned the bike to a station near an activity they wanted to do and then booked another bike after completing the activity to continue their leisure ride. The difference in the rate for casual riders as opposed to member riders does support the theory that casual riders are more likely to ride for leisure.

Analysis - Trip Distribution by Month

Question: Are casual riders more likely to ride in warmer months than member riders?

I combined the pivot table data of ride_id per user group for all 12 months to find the total number of rides for each group in each month.

Answer: Yes, casual riders are much more heavily distributed in summer than member riders.

Half of all casual rides occurred from June to August. Member riders had a more even distribution throughout most of the year. Member rides were lower than expected in January and February.

I was curious why Member riders made so many fewer trips in January and February. I looked at the historical weather data for Chicago from December 2021 through March 2022. I used Weatherspark’s historical data and found that January and February were the two coldest months in that timeframe. High temperatures were below freezing for more than half of the days in those two months. It stands to reason that even people who normally commute by bike might find alternate modes of transportation on those days. Weatherspark’s Terms of Use do not permit me to reproduce their graphs here, but their historical weather data can be viewed on their website here.


Processing in R Studio

I chose R Studio to combine the monthly data into seasonal and year-long data frames because Excel was unable to handle the volume of data for the combined sets.

First, I combined months into seasonal data using rbind() to create a new data frame for each season. Next, I attempted to combine the seasonal data frames into a year-long data frame using rbind() again. That command returned an error indicating a lossy cast from <character> to <hms>. To troubleshoot where the problem was, I tried combining the seasonal data frames one at a time. Fall, winter, and spring all combined without errors. When I tried to add summer, I again got the lossy cast error. This process told me that the problem was in the summer data frame.

I compared the schema of the summer data frame with the combined data frame for the other seasons. I saw that the start_time column for summer was in character format, while it was in hms format for the other data frames. I used parse_hms() to convert the summer$start_time column to hms format. I completed the rbind() to give me a data frame with the entire year’s data.

Analysis - Start Time and Trip Duration

I used ggplot(data=year) + geom_smooth(mapping=aes(x=start_time,y=trip_duration,color=member_casual) to create a double trendline graph comparing the average trip duration to start time. I expected to see a slight dip in trip_duration for member riders before 9am, when many people would be commuting to work. I did see a small dip for members, but I was surprised to see a much larger dip in average trip duration for casual riders during the same time. Casual riders also had another smaller dip in trip_duration starting around 5pm, when many people are leaving work.

My initial hypothesis was not entirely correct. Casual riders do seem to use the bikes for leisure more than member riders. Another subset of casual riders use the bikes occasionally for commuting. This could be a great population to target for membership.


Analysis - User Type and Bike Type

I entered ggplot(data=year) + geom_smooth(mapping=aes(x=start_time,y=trip_duration,color=rideable_type)) + facet_wrap(~member_casual) to show bike types by member group. It seemed that member users used classic bikes at a higher proportion than casual users, but when I combined classic bikes and docked bikes for casual users, the proportion was about the same as for member users. Member users did not use any docked bikes, leading me to believe that this rideable_type is only registered when a user does a one time transaction. If I were performing this analysis for an actual company, I would confirm this theory with my stakeholders.

I wondered about riders’ usage of the different bike types, so I decided to combine my earlier plot with the data on bike types by entering ggplot(data=year) + geom_smooth(mapping=aes(x=start_time,y=trip_duration,color=rideable_type)) + facet_wrap(~member_casual).

The second graph supports my theory that the docked_bike type was a one time transaction. The average trip duration for that type was much higher compared to the other bike types. Longer trips would be consistent with a person who decided to use the bike for leisure or exercise without planning ahead and booking in advance. For both groups, the average trip duration for the electric_bike type was lowest, indicating that users may prefer the electric bike for commuting or running errands. This would make sense, since the electric bike required less physical exertion, meaning the user would not arrive at work or another important destination sweaty and tired.


Processing For Tableau

Tableau has a maximum file upload size of 1GB, so my original year.csv file was too large to upload. I returned to my data in R Studio and created a new data frame named year_tableau using subset() to remove columns I was not planning to use with Tableau. I then did the same for each seasonal data frame.

When I uploaded the data, I realized another problem. Because the formats of the latitudes and longitudes were inconsistent, Tableau was reading entries that were from the same location as being different, making it impossible to visualize the most frequent start and stop locations. To fix this, I returned to R Studio and used the round() function to round all values for longitude and latitude to 3 decimal places. I chose 3 places because that would give me accuracy to within 111 meters, which was accurate enough for general mapping purposes without being so specific as to cause the same problem I was trying to fix.

Tableau - Mapping Starts and Ends

I wanted to map out where each casual and member rider started and ended their trip so that I could determine the most common start and end locations for each user type. I created a separate map for start and end locations for the year long data as well as for each season. To visually show the most common locations, I put longitude and latitude values as dimensions on details, then the count of longitude as size and count of latitude as color. This gave me a graph where locations with more rides showed up larger and a different color than locations with fewer rides. I then added a filter to toggle between member type.

There were several locations that stood out as having a high volume of riders. I used Google Maps to look up the longitude and latitude values for those locations and added this information to my visualization as annotations. The full visualizations can be viewed here.

New Insights From Mapping

One thing I noticed right away was that casual riders were overwhelmingly clustered in more tourist-oriented locations, especially around the Navy Pier. Member riders had clusters of end locations throughout the city, but by far their highest volume start location was on the campus of the University of Chicago. Interestingly, none of the other major universities in Chicago had similar numbers, even though several of the other universities have a higher enrollment than University of Chicago.

The seasonal maps were mostly similar to the year long map with two key differences. First, for member riders, the summer months saw a wider distribution of start locations. This makes sense since there aren't as many classes offered in the summer, so not as many people were on the University of Chicago campus. Since the number of member riders also increased in the summer compared to other months, this suggests that more member riders were riding for leisure or other reasons during those months.

Second, and more importantly for the purposes of our business task, there was a large uptick in the proportion of casual riders at the University of Chicago in winter compared to other months. This suggests that there is still a market of casual riders at the University of Chicago that could be turned into members. They are likely there all year long and are just overshadowed by the tourist riders during the warmer months.

Key Findings

After completing my analysis, I have several ideas to share with stakeholders to help increase membership rates. You can view my presentation to stakeholders by clicking here or clicking the picture at the top of the page.

1. Promote Leisure, Especially in Summer

Casual riders tend to take long rides, averaging over 30 minutes in summer months. They clearly enjoy the rides enough to continue for a long period of time, so they may be open to signing up for a membership to continue enjoying rides if given targeted advertising after their ride.

2. Target Universities

The University of Chicago is the only university in the area with a significant member population. Targeted advertising at other universities citing the convenience of the bike share program could generate additional interest and membership.

3. Commuters Use More Electric Bikes

Casual riders who use electric bikes have a lower ride length on average, indicating that they are more likely using the bikes for their commute or errands. Targeted advertising citing the convenience of bike share for those purposes after a casual rider uses an electric bike could convince some to sign up for a membership.