I. Introduction

Bikeshare system is wildly adopted by major cities in the United States. In the bikeshare system, users can pickup and return bikes at designated bike sharing stations with a finite number of docks. Unfortunately, user behavior results in spatial imbalance of the bike inventory over time. The system equilibrium is often characterized by unacceptably low availability of bikes or open docks, for pickups or returns respectively.

D.C.’s Capital Bikeshare was one of the first big bike share systems in the United States, offering docks full of bikes all over the central part of the District, with some additional service in outlying neighborhoods and in the suburbs. During the COVID-19 pandemic, two methods of re-balancing could be adopted. The first one is to move bicycles with trucks, which can create job positions. The second one is offering membership discount to encourage subscription, considering cycling is a rather safe outdoor activity.

II. Data Loading and Feature Engineering

2.1 Import Bike Share Data

Import DC Capital Bikeshare trip data over the period of September 2020 and October 2020. The data includes: * Duration – Duration of trip * Start Date – Includes start date and time * End Date – Includes end date and time * Start Station – Includes starting station name and number * End Station – Includes ending station name and number * Bike Number – Includes ID number of bike used for the trip * Member Type – Indicates whether user was a “registered” member.

From the start time of a trip, we generated: * interval60 - represents the hour of start time * interval15 - represents 15-minute interval of the start time. * week - week number of the trip * dotw - day of the trip

2.2 Import Census Data

This analysis uses data from the 2014-2018 ACS 5-year estimates. The following demographic variables were selected from ACS 2018 for census tracts in Washington D.C.:

  • Total population

  • Median household income

  • White population percentage

  • Travel time

  • Number of commuter

  • Means of transportation

  • Total public transportation

2.3 Import Weather Data

Import weather data from Ronald Reagan Washington National Airport (code DCA) using riem_measures function. We can mutate the data to get temperature, wind speed, precipitation on an hourly basis and plot the temperature and precipitation trends over our study period.

weather.Panel <- 
  riem_measures(station = "DCA", date_start = "2020-09-01", date_end = "2020-10-30") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

2.4 Create Distance Feature

In this section, we calculate the distance of bikestations to three features imported from DC Opendata portal:

  • DC metro station: This dataset contains points representing Metro facilities in DC, 40 records in total.

  • DC grocery stores: A point feature class of grocery stores in the District. There are 79 records in the dataset.

  • DC valet parking lot: A point feature class of locations and businesses permitted to offer valet parking. There are in total 1.987 records.

2.5 Create Full Space-Time Panel

We create the full panel by summarizing counts by station for each time interval, keep census info and lat/lon information along for joining later to other data. We remove data for station IDs that are FALSE.

study.panel <- 
  expand.grid(interval60=unique(dat_census$interval60), 
              start_station_id = unique(dat_census$start_station_id)) %>%
  left_join(., dat_census %>%
              dplyr::select(start_station_id, start_station_name, Origin.Tract, start_lng, start_lat)%>%
              distinct() %>%
              group_by(start_station_id) %>%
              slice(1))

ride.panel <- 
  dat_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station_id, start_station_name, Origin.Tract, start_lng, start_lat) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(is.na(start_station_id) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

ride.panel <- 
  left_join(ride.panel, dcCensus %>%
              as.data.frame() %>%
              dplyr::select(-geometry), by = c("Origin.Tract" = "GEOID"))

# Integrate built features into whole dataset
ride.panel <- merge(x = ride.panel, y = bike, by = 'start_station_id',
                    all.y  = TRUE) %>%
    filter(start_station_id >12) %>%
  na.omit(ride.panel$interval60)

2.6 Create Time Lags

Creating time lag variables will add additional nuance about the demand during a given time period - hours before and during that day.

ride.panel <- 
  ride.panel %>% 
  arrange(start_station_id, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24)) %>%
   mutate(day = yday(interval60))

2.7 Split Training Set and Test Set

This analysis is training on 3 weeks of data, weeks 37-39, and testing on the preceding 2 weeks, 35-36.

III. Exploratory Analysis

In this section, the ride share data is explored for time, space, weather, and demographic relationships.

3.1 Serial Autocorrelation

3.1.1 Bikeshare count by week

From Figure 3.1.1 we can see that during our study period, Sep. 1st to Sep 29th, although each day generally follows the same trend of peak time and low time, total trip count varies. Specifically, trip count in test set is lower than training set.

mondays <- 
  mutate(ride.panel,
         monday = ifelse(dotw == "Mon" & hour(interval60) == 1,
                         interval60, 0)) %>%
  filter(monday != 0) 



rbind(
  mutate(ride.Train, Legend = "Training"), 
  mutate(ride.Test, Legend = "Testing")) %>% 
    group_by(Legend, interval60) %>% 
      summarize(Trip_Count = sum(Trip_Count)) %>%
      ungroup() %>% 
      ggplot(aes(interval60, Trip_Count, colour = Legend)) + geom_line() +
        scale_colour_manual(values = palette2) +
        geom_vline(data = mondays, aes(xintercept = monday), linetype = "dotted", color = '#596a71') +
        labs(title="Bikeshare trips by week in DC: September-October",
             subtitle="Dotted Lines for Each Monday\n",
             caption = 'Figure 3.1.1',
             x="Day", y="Trip Count") +
        plotTheme() 

3.1.2 Bike share trips on weekend vs weekday, DC

Figure 3.1.2.1 shows the plots of bikeshare trip counts by days of the week. We can see that Monday to Friday generally follows the same trend, that a small peak is around 8am-9am, a major peak is around 8pm, and then decrese through midnight. While for Saturday and Sunday, trip counts gradually increase and peak at noon, and decrease through midnight. Figure 3.1.2.2 presents similar information with the classification of weekday and weekend.

ggplot(dat_census %>% mutate(hour = hour(started_at)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  scale_color_manual(values = palette7,
                          name = 'Weekday') +
  labs(title="Bike share trips in DC, by day of the week, Sept & Oct, 2020\n",
       caption = 'Figure 3.1.2.1',
       x="Hour", 
       y="Trip Counts")+
     plotTheme()

ggplot(dat_census %>% mutate(hour = hour(started_at),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"))) +
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
    scale_color_manual(values = palette2,
                          name = 'Period') +
  labs(title="Bike share trips in DC - weekend vs weekday, Sept & Oct, 2020\n",
       caption = 'Figure 3.1.2.2',
       x="Hour", 
       y="Trip Counts")+
     plotTheme()

3.1.3 Mean Number of Hourly Trips Per Station

In Figure 3.1.3, time of the day were divided into chunks. Mid-day (10:00-15:00) and night rush hour (15:00-18:00) are more active than morning rush time (7:00-10:00) and overnight (18:00-24:00), when most stations have zero trips or one trip.

dat_census %>%
  filter(start_station_id <= 626) %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start_station_name, time_of_day) %>%
         tally()%>%
  group_by(start_station_name, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), fill = '#596a71',binwidth = 0.5)+
  labs(title="Mean Number of Hourly Trips Per Station. DC, Sept & Oct, 2020\n",
       caption = 'Figure 3.1.3',
       x="Trip Count", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme()

3.1.4 the time lag features are tested for correlation with Trip_Count.

Overall, the correlation between time lag and trip is minor. The strongest correlation is for 2 hour lag time, with 0.11 coefficient. The correlation coefficient diminish to 0 for time lag larger that 4 hours.

plotData.lag <-
  filter(as.data.frame(ride.panel), week == 39) %>%
  dplyr::select(starts_with("lag"), Trip_Count) %>%
  gather(Variable, Value, -Trip_Count) %>%
  mutate(Variable = fct_relevel(Variable, "lagHour","lag2Hours","lag3Hours",
                                          "lag4Hours","lag12Hours","lag1day"))


correlation.lag <-
  group_by(plotData.lag, Variable) %>%
    summarize(correlation = round(cor(Value, Trip_Count, use = "complete.obs"), 2)) 

ggplot(plotData.lag, aes(Value, Trip_Count)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.lag, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "darkred") +
  facet_wrap(~Variable) +
  labs(title = "Bikeshare trip counts as a function of time lags",
       subtitle = 'Week 39 in 2020, DC\n',
       caption = 'Figure 3.1.4') +
  plotTheme()

3.2 Spatial Autocorrelation

This section create multiple plots to show bike share trips per hour by station. We can conclude that weekday overnight is more active, then follows weekend overnight, weekday mid-day and night rush. Despite the time of the day, more trips occurred in the center city and there are less trips in edge DC area.

ggplot()+
  geom_sf(data = dcTracts %>%
          st_transform(crs=4326))+
  geom_point(data = ride.panel %>%

            mutate(hour = hour(interval60),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              na.omit(ride.panel$interval60) %>%
              group_by(start_station_id, start_lat, start_lng, weekend, time_of_day) %>%
              tally(),
            aes(x=start_lng, y = start_lat, color = n), 
            fill = "transparent", alpha = 0.4, size = 0.05)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "plasma",
  name = 'Counts')+

  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lng), max(dat_census$start_lng))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bikeshare trips per hour by station, Week35-Week39, 2020",
       subtitle = "DC",
       caption = 'Figure 3.2.4')+
  mapTheme +
  plotTheme()

3.3 Space/Time Correlation: Animation on Week39

An animated map was created based on one-week data in the training set. There is no clear pattern of bikeshare activity with existing trips scattered across DC and most of the time no trip occurred in most stations.

week39 <-
  filter(ride.panel , week == 39 ) 

animation.data <- 
  week39 %>% 
  group_by(interval60, start_station_id) %>%
  summarize(Trip_Count = sum(Trip_Count, na.rm=T)) 

dock <- bike %>%
  dplyr::select(start_station_id) %>% 
  filter(start_station_id > 12)
  

animation.data <- 
  merge(x = animation.data, y = dock, by = 'start_station_id',
                    all.y  = TRUE) %>%
  na.omit(animation.data$Trip_Count) 

animation.data <- animation.data %>%
  st_sf() %>%
  mutate(Trips = case_when(Trip_Count == 0 ~ "0 trip",
                             Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
                             Trip_Count > 4 & Trip_Count <= 6 ~ "4-6 trips",
                             Trip_Count > 6 ~ "6+ trips")) %>%
    mutate(Trips  = fct_relevel(Trips, "0 trip","1-3 trips","4-6 trips",
                                       "6+ trips"))

animation <- 
  ggplot() +
  geom_sf(data = dcTracts %>%
          st_transform(crs=4326))+
geom_sf(data = animation.data , aes(color = Trips)) +
  scale_color_manual(values = palette4) + 
    labs(title = "Bikeshare trip counts for Week39 in 2020, DC",
         subtitle = "1 hour intervals: {current_frame}\n",
         caption = 'Figure 3.3') +
  plotTheme() + 
    transition_manual(interval60) +
    mapTheme

gganimate::animate(animation,  duration=30,renderer = gifski_renderer())

3.4 Weather Coorelation

The dataset includes three weather-related variables: precipitation, temperature and wind speed. This section presents how the three weather features effect bikeshare trip respectively.

3.4.1 Precipitation

Figure 3.4.1.2 shows that although the mean trip count is minor, precipitation significantly reduced the number of bikeshare trips.

ggplot(ride.panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Precipitation Across Time\n", 
       caption = 'Figure 3.4.1.1',
       x="Hour", y="Perecipitation") + 
  plotTheme()

ride.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Precipitation = first(Precipitation)) %>%
  mutate(isPercip = ifelse(Precipitation > 0,"Precipitation", "None")) %>%
  group_by(isPercip) %>%
  summarize(Mean_Trip_Count = mean(Trip_Count)) %>%
    ggplot(aes(isPercip, Mean_Trip_Count)) + geom_bar(stat = "identity",width = 0.5, fill = '#485123' ) +
      labs(title="Does Bikeshare trip count vary with precipitation?\n", 
       caption = 'Figure 3.4.1.2',
           x="Precipitation", y="Mean Trip Count") +
      plotTheme()

ride.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Precipitation = first(Precipitation)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Precipitation, Trip_Count)) + 
    geom_point() + geom_smooth(method = "lm", se= FALSE, color = 'red') +
    facet_wrap(~week, ncol=8) + 
    labs(title="Trip Count as A Function of Precipitation by Week\n", 
       caption = 'Figure 3.4.1.3',
         x="Precipitation", y="Mean Trip Count") +
    plotTheme() 

3.4.2 Temperature

With the temperature in DC fluctuated daily and slight decreased over weeks, the number of bikeshare trips seemed to increased.

ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature Across Time\n", 
       caption = 'Figure 3.4.2.1',
       x="Hour", y="Temperature") + plotTheme()

ride.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Temperature = first(Temperature)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Temperature, Trip_Count)) + 
    geom_point() + geom_smooth(method = "lm", se= FALSE,color = 'red') +
    facet_wrap(~week, ncol=8) + 
    labs(title="Trip Count as A Function of Temperature by Week\n", 
       caption = 'Figure 3.4.2.2',
         x="Temperature", y="Mean Trip Count") +
    plotTheme() 

3.4.3 Wind Speed

The overall wind speed did not change a lot during the study period. As shown in Figure 3.4.3.2, the wind speed did not seem to affect the number of bikeshare trip.

ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed Across Time\n", 
       caption = 'Figure 3.4.3.1',
       x="Hour", y="Wind Speed") + plotTheme()

ride.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Wind_Speed = first(Wind_Speed)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Wind_Speed, Trip_Count)) + 
    geom_point() + geom_smooth(method = "lm", se= FALSE,color = 'red') +
    facet_wrap(~week, ncol=8) + 
    labs(title="Trip Count as A Function of Wind Speed by Week\n", 
       caption = 'Figure 3.4.3.2',
         x="Wind Speed", y="Mean Trip Count") +
    plotTheme() 

IV. Model Building and Prediction

In this section, models are estimated from the training set and tested on the test set to gauge how well the space/time features forecast ride share demand.

4.1 Model Building

This section created an initial model reg0 to help determine to rule out the features that are obviously less unrelated to the bikeshare counts. Upon the initial model, features such as travel time, means of transportation, White population percentage, number of bikes/empty docks, nearest parking lot were excluded due to irrelevance.

Then, this sections designs four different linear regressions are estimated on ride.Train, each with different fixed effects:

  • reg1 focuses on just time, including hour and weekday fixed effects, and other critical features
  • reg2 focuses on just space effects with the station and and other critical fixed effects
  • reg3 includes both time and space fixed effects.
  • reg4 adds the time lag features.

4.2 Predict on Testing Set

V. Model Validation

In this section, Mean Absolute Error (MAE) is calculated on ride.Test for each model.

5.1 MAE Examination by Model

Moving forward, let’s stick with reg1 and reg4 since the former one shows the smallest MAE while ’reg4` captures most of fluctuations across time.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week\n",
         caption = "Figure 5.1") +
  plotTheme()

5.2 Error Examination

This section includes error examinations on time, space and the combination of space and time across the four models. Figure 5.2.1 presents the plots of observed and predicted data and how they overlapped. We can conclude that the model did not do well on predicting the occurrence of bikeshare trips at peak time.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id)) %>%
    dplyr::select(interval60, start_station_id, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station_id) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
  scale_color_manual(values = palette2,
                          name = ' ') +
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed bike share time series", subtitle = "DC; A test set of Week35 - Week36",  x = "Hour", y= "Station Trips",
           caption = "Figure 5.2.1") +
      plotTheme()

Figure 5.2.2 mapped the mean error by station. Four models have similar performance, with MAE close to 0 in most area and MAE slightly higher in Southeast DC.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           start_lat = map(data, pull, start_lat), 
           start_lng = map(data, pull, start_lng)) %>%
    dplyr::select(interval60, start_station_id, start_lng, start_lat, Observed, Prediction, Regression) %>%
    unnest() %>%

  group_by(Regression, start_station_id, start_lng, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = dcTracts, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lng, y = start_lat, color = MAE), 
             fill = "transparent", alpha = 0.4)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lng), max(dat_census$start_lng))+
        facet_wrap(~Regression, ncol=4) +
  labs(title="Mean Abs Error by Station",
       subtitle = "DC; A test set of Week35 - Week36\n",
       caption = 'Figure 5.2.2')+
  mapTheme +
  plotTheme()

The following plot sets present the MAE on test set with space and time combined. Model 1 performed better than model 4. MAE in most areas are close to 0, in few stations MAE are around 0.04.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           start_lat = map(data, pull, start_lat), 
           start_lng = map(data, pull, start_lng),
           dotw = map(data, pull, dotw) ) %>%
    dplyr::select(interval60, start_station_id, start_lng, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "Model1")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station_id, weekend, time_of_day, start_lng, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = dcTracts, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lng, y = start_lat, color = MAE), 
             fill = "transparent", size = 0.5, alpha = 0.4)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "plasma")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lng), max(dat_census$start_lng))+
  facet_grid(weekend~time_of_day)+
  labs(title = "Error Examination by Space + Time, Model 1",
       subtitle ="Mean Absolute Errors on Test Set\n",
       caption = "Figure 5.2.3") +
  mapTheme +
  plotTheme()

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           start_lat = map(data, pull, start_lat), 
           start_lng = map(data, pull, start_lng),
           dotw = map(data, pull, dotw) ) %>%
    dplyr::select(interval60, start_station_id, start_lng, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "Model2")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station_id, weekend, time_of_day, start_lng, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = dcTracts, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lng, y = start_lat, color = MAE), 
             fill = "transparent", size = 0.5, alpha = 0.4)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "plasma")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lng), max(dat_census$start_lng))+
  facet_grid(weekend~time_of_day)+
  labs(title = "Error Examination by Space + Time, Model 4",
       subtitle ="Mean Absolute Errors on Test Set\n",
       caption = "Figure 5.2.4") +
  mapTheme +
  plotTheme()

5.3 Cross Validation

This section conducts “k-fold” cross-validation to test the goodness of fit on the whole dataset. To save the operation cost, we set 50 folds to test. If the model generalized well, the distribution of errors would cluster tightly together. As shown in Figure 5.3.1 and Figure 5.3.2, neither of the two models predict consistently, indicating neither of the models generalizes well.

fitControl <- caret:: trainControl(method = "cv", 
                                   number = 10,
                                   savePredictions = TRUE)

set.seed(717)

reg1.cv <- 
  caret::train(Trip_Count ~ ., data = ride.panel %>% 
                 dplyr::select(Trip_Count, interval60, dotw , Temperature , Precipitation , Wind_Speed , Med_Inc ,  Med_Age ,  Mean_Commute_Time , Percent_Taking_Public_Trans ,  metro_nn1 , metro_nn2 , grocery_nn1 , grocery_nn2) ,
               method = "lm", 
               trControl = fitControl, 
               na.action = na.pass)

kable(reg1.cv$resample) %>% 
  kable_styling(font_size = 12, full_width = F,
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  footnote(general = "Table 5.1 Cross-validation Test on Model 1: Summary of RMSE, R Squared and MAE\n",
           general_title= '\n')
RMSE Rsquared MAE Resample
0.0501636 0.0001310 0.0036701 Fold01
0.0380371 0.0000678 0.0031838 Fold02
0.0694172 0.0002979 0.0039842 Fold03
0.0576087 0.0006870 0.0035147 Fold04
0.1000711 0.0017033 0.0039193 Fold05
0.0450967 0.0001137 0.0038390 Fold06
0.0387222 0.0002460 0.0032275 Fold07
0.0527696 0.0001793 0.0035265 Fold08
0.0373173 0.0001831 0.0032516 Fold09
0.0414038 0.0000019 0.0035215 Fold10

Table 5.1 Cross-validation Test on Model 1: Summary of RMSE, R Squared and MAE
ggplot(data = reg1.cv$resample) +
  geom_histogram(aes(x = reg1.cv$resample$MAE), fill = '#88aab8') +
  labs(title="Distribution of Cross-validation MAE on Model 1",
       subtitle = "K = 10\n",
       caption = "Figure 5.3.1 ") +
  xlab('MAE of Model 1') +
  ylab('Count') +
  plotTheme()

reg4.cv <- 
  caret::train(Trip_Count ~ ., data = ride.panel %>% 
                 dplyr::select(Trip_Count, start_station_name,interval60, dotw , Temperature , Precipitation , Wind_Speed , Med_Inc ,  Med_Age ,  Mean_Commute_Time , Percent_Taking_Public_Trans ,  metro_nn1 , metro_nn2 , grocery_nn1 , grocery_nn2 ,
                   lagHour , lag2Hours ,lag3Hours , lag12Hours , lag1day) ,
               method = "lm", 
               trControl = fitControl, 
               na.action = na.pass)


kable(reg4.cv$resample) %>% 
  kable_styling(font_size = 12, full_width = F,
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  footnote(general = "Table 5.2. Cross-validation Test on Model 4: Summary of RMSE, R Squared and MAE\n",
           general_title= '\n')
RMSE Rsquared MAE Resample
0.0365032 0.0023376 0.0031095 Fold01
0.0502414 0.0074009 0.0038990 Fold02
0.0407899 0.0000392 0.0034911 Fold03
0.0415637 0.0000995 0.0035788 Fold04
0.0561904 0.0198370 0.0034580 Fold05
0.0417301 0.0227899 0.0034276 Fold06
0.0545024 0.0013299 0.0036813 Fold07
0.0995913 0.1083535 0.0041574 Fold08
0.0588255 0.0006548 0.0036825 Fold09
0.0571165 0.0035832 0.0035673 Fold10

Table 5.2. Cross-validation Test on Model 4: Summary of RMSE, R Squared and MAE
ggplot(data = reg4.cv$resample) +
  geom_histogram(aes(x = reg4.cv$resample$MAE), fill = '#88aab8') +
  labs(title="Distribution of Cross-validation MAE on Model 4",
       subtitle = "K = 10\n",
       caption = "Figure 5.3.2 ") +
  xlab('MAE of Model 4') +
  ylab('Count') +
  plotTheme()

VI. Conclusion

Overall, the current algorithm need refinements before applying to bike share re-balancing plan. First, it’s not clear how the COVID-19 pandemic affect users bike share trips. Before the pandemic is under control, outdoor activities were overall discouraged while people prefer cycling over walking as a safer exercise. Second, when analyzing trip count at small unit (hour), there is no sufficient data to support the analysis, most data being 1 or 0 is not helping. This ties to the third improvement, that despite the minor error rate of the model, the model is not doing a good job at predicting existing data.