In this post we’re departing a little from the replication of papers in economics and trying our hand at visualising networks followed by modelling passenger footfall across the London Underground during a week in November 2009. We’ll very briefly explore two different methods to predict passenger footfall, an LTSM neural net identical to a recent Rstudio Keras blog post and Facebook’s Prophet forecasting library. We will be visualising model results without making explicit formal comparisons between the two.

Cleaning Data

The dataset we use comes from TFL’s open data portal and represents a random sample (I hope) of TFL’s Oyster Card data during one week in November - the data provided represents approximately 5% of all journeys during this period.

First, we need to isolate Tube journeys from the Oyster Card CSV TFL provides. Next, we remove any journeys where we don’t know the start or end station and finally we remove any journeys that have an unknown start or end time.

library(readr)
library(dplyr)
library(knitr)
oyster <- read_csv('assets/Original Data/Nov09JnyExport.csv') 
oyster_underground <- oyster %>% 
  filter(SubSystem == "LUL") # LUL is the SubSystem code for Tube journeys
rm(oyster)

oyster_underground <- oyster_underground %>% 
  filter(StartStn != "Unstarted") %>% 
  filter(EndStation != "Unfinished") %>% 
  filter(!is.na(EntTimeHHMM)) %>% 
  filter(!is.na(EXTimeHHMM)) %>% 
  select(-SubSystem, -FinalProduct)

oyster_underground %>% 
  head() %>% 
  kable()
downodaytypeStartStnEndStationEntTimeEntTimeHHMMExTimeEXTimeHHMMZVPPTJNYTYPDailyCappingFFareDFareRouteID
2MonGoodge StreetTotteridge100016:40:00104117:21:00Z0110TKTN00XX
5ThuPreston RoadNorthwood100016:40:00102417:04:00Z0110TKTN00XX
5ThuHolbornBounds Green100016:40:00102817:08:00Z0104TKTN00XX
1SunEarls CourtPimlico100016:40:00102117:01:00——-PPYN160160XX
3TueVictoriaBethnal Green100016:40:00102717:07:00Z0102TKTN00XX
2MonHighburyStratford100016:40:00102017:00:00——-PPYN110110XX

Next we encode daytype as a factor with levels running from Monday to Sunday, we also add a weekday and weekend factor that will make histograms featuring aggregated data easier to order. Finally, we convert the timestamp TFL use to a more accessible format using lubridate’s ymd_hms() and create journey_time using lubridate’s interval() function.

library(forcats)
library(lubridate)
oyster_underground$daytype <- factor(oyster_underground$daytype, c("Mon",
                                                                   "Tue",
                                                                   "Wed",
                                                                   "Thu",
                                                                   "Fri",
                                                                   "Sat",
                                                                   "Sun",
                                                                   "Weekend",
                                                                   "Weekday"))

oyster_underground_lubridate <- oyster_underground %>% 
  mutate(entry_time = EntTimeHHMM %>% 
           as.POSIXct() %>% 
           ymd_hms(),
         exit_time = EXTimeHHMM %>% 
           as.POSIXct() %>% 
           ymd_hms(),
         journey_time = interval(entry_time, exit_time) %>% 
           time_length(unit = "minute")) %>% 
  filter(journey_time > 0)

Initial Visualisations

Plotly and GitHub Pages haven’t been playing together nicely recently so we’ll stick with ggplot2 for now. A binwidth of 600 seconds corresponds to ten minute intervals, geom_histogram creates a histogram corresponding to the density of time observations which is effectively a barchart of footfall against time for our purposes here.

library(ggplot2)
p <- ggplot(oyster_underground_lubridate,
            aes(x = EntTimeHHMM)) +
  geom_histogram(fill = "firebrick2", binwidth = 600, alpha = 0.2, color = "white") +
  theme_minimal() +
  xlab("Tap In Time") +
  ggtitle("Tap In Time for London Tube Stations") 
p
q <- ggplot(oyster_underground_lubridate,
            aes(x = EXTimeHHMM)) +
  geom_histogram(fill = "springgreen2", binwidth = 600, colour = "white") +
  theme_minimal() +
  ggtitle("Tap Out Time for London Tube Stations") +
  xlab("Tap Out Time")
q

Combining the information from each plot into an overlaid density plot clearly demonstrates the lead and lag time between commuters tapping in and tapping out upon completing their journey.

pq <- oyster_underground_lubridate %>% 
               select("Tap In" = EntTimeHHMM,"Tap Out" = EXTimeHHMM) %>%
               gather(type, time, "Tap In", "Tap Out") %>% 
  ggplot(aes(x = time, fill = type)) +
  geom_density(alpha = 0.2) +
  scale_fill_manual(values = c("firebrick2", "springgreen2"), "Type") +
  xlab("Time") +
  ggtitle("Tap In And Tap Out Time", subtitle = "London Tube Stations") +
  theme_minimal()
pq

Now, we bin each time period within a given day into five minute intervals and plot mean journey time as a function of tap in time. Here, we bin the intervals using two dplyr pipes and lubridate, alternatively we could have used the tibbletime library which we’ll explore briefly later.

library(RColorBrewer)
oyster_j_time <- oyster_underground_lubridate %>% 
  group_by(entry_time, daytype, downo) %>% 
  summarise(average_journey = mean(journey_time))
oyster_j_time_5 <- oyster_j_time %>% 
  mutate(five_min = round_date(entry_time, unit = "5 min")) %>% 
  group_by(five_min, daytype, downo) %>% 
  summarise(average_journey_5 = mean(average_journey))
p <- oyster_j_time_5 %>% 
  ggplot(aes(x = five_min,
             y = average_journey_5,
             colour = daytype)) +
  geom_point() +
  scale_color_brewer(palette = "OrRd", direction = -1, "Day") +
  theme_minimal() +
  xlab("Time") +
  ggtitle("Mean Journey Length For a Given Start Time") +
  ylim(0, 75) +
  labs(subtitle = "Lighter colours indicate days later in the week")
p

q <- oyster_j_time_5 %>% 
  ggplot(aes(x = five_min,
             y = average_journey_5,
             colour = daytype)) +
  geom_smooth(se = FALSE, size = 2) +
  scale_color_brewer(palette = "OrRd", direction = -1, "Day") +
  theme_minimal()
q