Modelling and Visualising London Tube Data

In this post we’re departing a little from the replication of papers in economics and trying our hand at visualising networks followed by modelling passenger footfall across the London Underground during a week in November 2009. We’ll very briefly explore two different methods to predict passenger footfall, an LTSM neural net identical to a recent Rstudio Keras blog post and Facebook’s Prophet forecasting library. We will be visualising model results without making explicit formal comparisons between the two.

Cleaning Data

The dataset we use comes from TFL’s open data portal and represents a random sample (I hope) of TFL’s Oyster Card data during one week in November - the data provided represents approximately 5% of all journeys during this period.

First, we need to isolate Tube journeys from the Oyster Card CSV TFL provides. Next, we remove any journeys where we don’t know the start or end station and finally we remove any journeys that have an unknown start or end time.

library(readr)
library(dplyr)
library(knitr)
oyster <- read_csv('assets/Original Data/Nov09JnyExport.csv') 
oyster_underground <- oyster %>% 
  filter(SubSystem == "LUL") # LUL is the SubSystem code for Tube journeys
rm(oyster)

oyster_underground <- oyster_underground %>% 
  filter(StartStn != "Unstarted") %>% 
  filter(EndStation != "Unfinished") %>% 
  filter(!is.na(EntTimeHHMM)) %>% 
  filter(!is.na(EXTimeHHMM)) %>% 
  select(-SubSystem, -FinalProduct)

oyster_underground %>% 
  head() %>% 
  kable()

downo	daytype	StartStn	EndStation	EntTime	EntTimeHHMM	ExTime	EXTimeHHMM	ZVPPT	JNYTYP	DailyCapping	FFare	DFare	RouteID
2	Mon	Goodge Street	Totteridge	1000	16:40:00	1041	17:21:00	Z0110	TKT	N	0	0	XX
5	Thu	Preston Road	Northwood	1000	16:40:00	1024	17:04:00	Z0110	TKT	N	0	0	XX
5	Thu	Holborn	Bounds Green	1000	16:40:00	1028	17:08:00	Z0104	TKT	N	0	0	XX
1	Sun	Earls Court	Pimlico	1000	16:40:00	1021	17:01:00	——-	PPY	N	160	160	XX
3	Tue	Victoria	Bethnal Green	1000	16:40:00	1027	17:07:00	Z0102	TKT	N	0	0	XX
2	Mon	Highbury	Stratford	1000	16:40:00	1020	17:00:00	——-	PPY	N	110	110	XX

Next we encode daytype as a factor with levels running from Monday to Sunday, we also add a weekday and weekend factor that will make histograms featuring aggregated data easier to order. Finally, we convert the timestamp TFL use to a more accessible format using lubridate’s ymd_hms() and create journey_time using lubridate’s interval() function.

library(forcats)
library(lubridate)
oyster_underground$daytype <- factor(oyster_underground$daytype, c("Mon",
                                                                   "Tue",
                                                                   "Wed",
                                                                   "Thu",
                                                                   "Fri",
                                                                   "Sat",
                                                                   "Sun",
                                                                   "Weekend",
                                                                   "Weekday"))

oyster_underground_lubridate <- oyster_underground %>% 
  mutate(entry_time = EntTimeHHMM %>% 
           as.POSIXct() %>% 
           ymd_hms(),
         exit_time = EXTimeHHMM %>% 
           as.POSIXct() %>% 
           ymd_hms(),
         journey_time = interval(entry_time, exit_time) %>% 
           time_length(unit = "minute")) %>% 
  filter(journey_time > 0)

Initial Visualisations

Plotly and GitHub Pages haven’t been playing together nicely recently so we’ll stick with ggplot2 for now. A binwidth of 600 seconds corresponds to ten minute intervals, geom_histogram creates a histogram corresponding to the density of time observations which is effectively a barchart of footfall against time for our purposes here.

library(ggplot2)
p <- ggplot(oyster_underground_lubridate,
            aes(x = EntTimeHHMM)) +
  geom_histogram(fill = "firebrick2", binwidth = 600, alpha = 0.2, color = "white") +
  theme_minimal() +
  xlab("Tap In Time") +
  ggtitle("Tap In Time for London Tube Stations") 
p
q <- ggplot(oyster_underground_lubridate,
            aes(x = EXTimeHHMM)) +
  geom_histogram(fill = "springgreen2", binwidth = 600, colour = "white") +
  theme_minimal() +
  ggtitle("Tap Out Time for London Tube Stations") +
  xlab("Tap Out Time")
q

Combining the information from each plot into an overlaid density plot clearly demonstrates the lead and lag time between commuters tapping in and tapping out upon completing their journey.

pq <- oyster_underground_lubridate %>% 
               select("Tap In" = EntTimeHHMM,"Tap Out" = EXTimeHHMM) %>%
               gather(type, time, "Tap In", "Tap Out") %>% 
  ggplot(aes(x = time, fill = type)) +
  geom_density(alpha = 0.2) +
  scale_fill_manual(values = c("firebrick2", "springgreen2"), "Type") +
  xlab("Time") +
  ggtitle("Tap In And Tap Out Time", subtitle = "London Tube Stations") +
  theme_minimal()
pq

Now, we bin each time period within a given day into five minute intervals and plot mean journey time as a function of tap in time. Here, we bin the intervals using two dplyr pipes and lubridate, alternatively we could have used the tibbletime library which we’ll explore briefly later.

library(RColorBrewer)
oyster_j_time <- oyster_underground_lubridate %>% 
  group_by(entry_time, daytype, downo) %>% 
  summarise(average_journey = mean(journey_time))
oyster_j_time_5 <- oyster_j_time %>% 
  mutate(five_min = round_date(entry_time, unit = "5 min")) %>% 
  group_by(five_min, daytype, downo) %>% 
  summarise(average_journey_5 = mean(average_journey))
p <- oyster_j_time_5 %>% 
  ggplot(aes(x = five_min,
             y = average_journey_5,
             colour = daytype)) +
  geom_point() +
  scale_color_brewer(palette = "OrRd", direction = -1, "Day") +
  theme_minimal() +
  xlab("Time") +
  ggtitle("Mean Journey Length For a Given Start Time") +
  ylim(0, 75) +
  labs(subtitle = "Lighter colours indicate days later in the week")
p

q <- oyster_j_time_5 %>% 
  ggplot(aes(x = five_min,
             y = average_journey_5,
             colour = daytype)) +
  geom_smooth(se = FALSE, size = 2) +
  scale_color_brewer(palette = "OrRd", direction = -1, "Day") +
  theme_minimal()
q

LOESS lines show a little clearer that it seems like weekends have higher average journey times. This is a little counter-intuitive as at first glance we’d expect weekday journeys to be longer due to rush hour traffic. However, a simple explanation (and classic confounder relationship) is that commuter preferences are systematically different between weekdays and weekends - short commutes to work will have different mean journey times than sightseeing trips on the weekend for instance.

We can formally test for this using a weekend dummy and regressing average journey time on each five minute bin and the created dummy - we destroy a lot of information here by averaging journey time for each bin but this is just meant to be a quick and dirty check anyway:

library(broom)
library(sandwich)
library(lmtest)
oyster_j_time_5 <- oyster_j_time_5 %>% 
  mutate(weeknd = ifelse((daytype == "Sat") | (daytype == "Sun"), 1, 0))
model <- lm(average_journey_5 ~ five_min + weeknd,
            data = oyster_j_time_5)
results <- coeftest(model, vcov = vcovHAC) %>% tidy()
results %>% kable(digits = 3)

term	estimate	std.error	statistic
(Intercept)	33.715	2.222	15.171
five_min	0.000	0.000	-3.516
weeknd	1.751	0.317	5.520

Network Visualisations

Now we move onto visualising the network aspect of London’s Tube data. Here we group_by StartStn and EndStation and find the number of journeys, mean journey length and mean fare.

stations_data <- oyster_underground_lubridate %>% 
  group_by(StartStn, EndStation) %>% 
  summarise(n = n(),
            ave_time = mean(journey_time, na.rm = TRUE),
            FFare_mean = mean(FFare, na.rm = TRUE)) %>% 
  arrange(-n)

Next, we create a series of the 15 largest station pairings by journey figures and select only the stations connecting to these “hub” stations. We use a random normal variable x to order our dataframe so that large stations aren’t bunched together in our plots.

Then, we use tidygraph and ggraph to plot a circular network graph.

library(tidygraph)
library(ggraph)
stations_filtered <- stations_data %>%
  ungroup() %>% 
  top_n(wt = n, 15) %>% 
  select(StartStn, EndStation) %>% 
  gather(type, station) %>% 
  select(station)
station_data_filtered <- stations_data %>% 
  mutate(x = rnorm(n())) %>% 
  arrange(x) %>% 
  filter((StartStn %in% stations_filtered$station) | (EndStation %in% stations_filtered$station)) %>% 
  filter(n > 250)
oyster_graph <- as_tbl_graph(station_data_filtered, directed  = TRUE)
ggraph(oyster_graph, layout = "linear", circular = TRUE) + 
    geom_edge_arc(aes(width = n, color = n, alpha = n,
                      
                      start_cap = label_rect(node1.name),
                       end_cap = label_rect(node2.name))) +
    guides(edge_width = 'none', edge_color = "none") +
    geom_node_text(aes(label = name), size = 3, repel = FALSE, color = "white") +
  ggtitle("Most Common Tube Journeys") +
  scale_edge_colour_distiller(palette = "Reds", direction = 1) +
  theme_graph(background = 'grey20', text_colour = 'white') +
  theme(legend.position = "none")

Here, we use a different type of network graph to display the directed relationship of the data. We use an extra large plot however it’s still a little messy.

ggraph(oyster_graph, layout = "fr") + 
    geom_edge_fan(aes(width = n, color = n, alpha = n,
                      start_cap = label_rect(node1.name),
                       end_cap = label_rect(node2.name)),
                  arrow = arrow(length = unit(1.5, 'mm'))) + 
    guides(edge_width = "none",
           edge_colour = "none",
           alpha = "none") +
    geom_node_text(aes(label = name), size = 3, repel = FALSE, colour = "white") +
    scale_edge_colour_distiller(palette = "Reds", direction = 1) +
    theme_graph(background = "grey20", text_colour = "white") +
  theme(legend.position = "none")

[EDIT: I’m not sure what’s up the large connection seemingly pointing to empty space in the South-West quadrant, it doesn’t show up on Rstudio’s viewer locally.]

I’m not sure if this one is informative but it looks funky I guess.

ggraph(oyster_graph %>% 
         activate(edges) %>% 
         mutate(mean_journey_length = mean(ave_time),
                short_journey = ifelse((ave_time < mean_journey_length),
                                       1, 0)), layout = "grid") + 
    geom_edge_fan(aes(width = n, color = n),
                  spread = 5,
                  arrow = arrow(length = unit(2, 'mm')), 
                  alpha = 0.5,
                   end_cap = circle(2, 'mm')) + 
    guides(edge_width = 'none',
           edge_colour = "none") +
    geom_node_text(aes(label = name), size = 3, repel = FALSE) +
    scale_edge_colour_distiller(palette = "Reds", direction = 1) +
    theme_graph()

Visualising Footfall

In this section we’ll explore how to predict Tube footfall using a number of different techniques, rather than drilling down on one particular method we’ll briefly compare model performance.

First, however, some more visualisations:

library(magrittr)
library(scales)
wknd <- oyster_underground_lubridate %>% 
  mutate(daytype = ifelse((daytype == "Sat") | (daytype == "Sun"), "Weekend", "Weekday"))
wknd$daytype <- factor(wknd$daytype,
                 c("Mon",
                   "Tue",
                   "Wed",
                   "Thu",
                   "Fri",
                   "Sat",
                   "Sun",
                   "Weekend",
                   "Weekday"))
  
duplicated_data <- bind_rows(oyster_underground_lubridate, wknd)
duplicated_data %>% 
  filter(StartStn == "Waterloo JLE") %>% 
  ggplot(aes(x = EntTimeHHMM %>% as.POSIXct(), fill = daytype)) +
  geom_histogram(colour = "black", bins = 24) +
  facet_wrap(~daytype, scale = "free_y") +
  theme_minimal() +
  guides(fill = "none") +
  scale_fill_brewer(palette = "Set3", direction = -1, "Day") +
  theme(axis.text.x = element_text(angle = -90, hjust = 0)) +
  scale_x_time(labels = date_format("%H:%M")) +
  xlab("Hour of Day") +
  labs(title = "Footfall Per Hour",
       subtitle = "Waterloo Tube Station",
       caption = "Note: Rescaled y-axes")

Above we use duplicated_data to observe each entry twice in the dataset. We see data points once on the actual day of observation and again aggregated in a weekday/weekend factor - this makes facet_wrap easier to use.

duplicated_data %>% 
  ggplot(aes(x = EntTimeHHMM %>% as.POSIXct(), fill = daytype)) +
  geom_histogram(colour = "black", bins = 24) +
  facet_wrap(~daytype, scale = "free_y") +
  theme_minimal() +
  guides(fill = "none") +
  scale_fill_brewer(palette = "Set3", direction = -1, "Day") +
  theme(axis.text.x = element_text(angle = -90, hjust = 0)) +
  scale_x_time(labels = date_format("%H:%M")) +
  xlab("Hour of Day") +
  labs(title = "Footfall Per Hour",
       subtitle = "All Stations",
       caption = "Note: Rescaled y-axes")

Time Series Creation

Now we ‘knit’ together each disparate date and time observation into a single time series. We need to do this because TFL record data for each day separately, rather than recording an observation of 0 footfall when the station is closed as we would like.

To do this first we exploit a little trick of unite that lets us place our own separator between united columns - we use “-11-09” as our separator since this, combined with the day of the week number in downo, will give us a full datetime variable¹.

oyster_knit <- oyster_underground %>%
  filter(!(is.na(EntTimeHHMM) & is.na(EXTimeHHMM))) %>% 
  mutate(downo = sub("^", "0", downo)) %>% 
  unite(col = entry_datetime, sep = "-11-09 ", downo, EntTimeHHMM, remove = FALSE)
oyster_knit <- oyster_knit %>% 
  unite(col = exit_datetime, sep = "-11-09 ", downo, EXTimeHHMM, remove = FALSE)
  
oyster_dates <- oyster_knit %>% 
  mutate(entry_datetime = dmy_hms(entry_datetime),
         exit_datetime = dmy_hms(exit_datetime))

Next we aggregate observations by time and date using group_by followed by as_tbl_time to create a “time-aware” tibble. Once we’ve done this we find the earliest and latest datetime and create an empty time series of 0s which we combine with our Oyster data to generate observations for every time period, including those when not footfall was recorded.

library(tibbletime)
pooled_data <- oyster_dates %>% 
  group_by(entry_datetime) %>% 
  summarise(y = n(),
            daytype = first(daytype),
            fare = mean(DFare, na.rm = TRUE)) %>% 
  rename(date = entry_datetime) %>% 
  as_tbl_time(index = date)

earliest_time <- pooled_data$date %>% min()
latest_time <- pooled_data$date %>% max()

empty_ts <- create_series(earliest_time ~ latest_time, "1 minutes")
pooled_time_data <- left_join(empty_ts, pooled_data, by = "date") %>% 
  mutate(y = ifelse(is.na(y), 0, y),
         fare = ifelse(is.na(fare), 0, fare),
         daytype = wday(date, label = TRUE))

pooled_weekday <- pooled_time_data %>% 
  filter(!((daytype == "Sat") | (daytype == "Sun"))) %>% 
  select(index = date, y) %>% 
  as_tbl_time(index = index)

pooled_time_data %>% 
  ggplot(aes(x = date,
             y = y,
             colour = y)) + 
  geom_point() +
  theme_minimal() +
  ylab("Footfall") +
  xlab("Date") +
  ggtitle("Total Tube Station Footfall")

Modelling Footfall

In this section we’re going to try and predict weekday footfall using two methods. First, we’re going to use an LTSM neural net - we’ll draw heavily on Rstudio’s Keras blog post here. We’re going to fit the neural net with minimal tweaks or tuning and visually inspect model results rather than formally comparing with alternative methods. Second, we’ll use Facebook’s Prophet library to estimate footfall too. Both of these models we’ll use pretty much straight out of the box - this is definitely a brief overview rather than in-depth exploration.²

Much of the code below is almost a verbatim copy from the linked article above, so be sure to check that out for a thorough explanation of what’s happening here:

library(tidyquant)
library(timetk)
library(glue)
library(cowplot)
library(rsample)
library(yardstick)
library(recipes)

periods_train <- 60 * 24 * 2
periods_test <- 60 * 24 * 1.5
skip_span <- 60 * 6

rolling_origin_resamples <- rolling_origin(pooled_weekday,
                                           initial = periods_train,
                                           assess = periods_test,
                                           cumulative = FALSE,
                                           skip = skip_span
                                        
)

plot_split <- function(split,
                       expand_y_axis = TRUE,
                       alpha = 1,
                       size = 1,
                       base_size = 14,
                       main_data = NULL){
  train_tbl <- training(split) %>% 
    add_column(key = "training")
  
  test_tbl <- testing(split) %>% 
    add_column(key = "testing")
  
  data_manipulated <- bind_rows(train_tbl,
                                test_tbl) %>% 
    as_tbl_time(index = index) %>% 
    mutate(key = fct_relevel(key,
                             "training",
                             "testing"))
  
  train_time_summary <- train_tbl %>% 
    tk_index() %>% 
    tk_get_timeseries_summary()
  
    test_time_summary <- test_tbl %>% 
    tk_index() %>% 
    tk_get_timeseries_summary()

  g <- data_manipulated %>% 
    ggplot(aes(x = index, y = y,
               color = key)) +
        geom_line(size = size,
              alpha = alpha) +
    theme_tq(base_size) +
    scale_color_tq() +
    labs(title = glue("Split: {split$id}"),
         subtitle = glue("{train_time_summary$start} to ",
                         "{test_time_summary$end}"),
         y = "",
         x = "") +
    theme(legend.position = "none")
  
   

  
  if (expand_y_axis == TRUE){
    time_summary <- main_data %>% 
      tk_index() %>% 
      tk_get_timeseries_summary()
    
    g <- g +
      scale_x_datetime(limits = c(time_summary$start,
                              time_summary$end))
  }
  
  g
}

rolling_origin_resamples$splits[[1]] %>% 
  plot_split(expand_y_axis = TRUE, main_data = pooled_weekday) +
  theme(legend.position = "bottom")

Now we plot the entire sampling plan:

plot_sampling_plan <- function(sampling_tbl,
                               expand_y_axis = TRUE,
                               main_data = NULL,
                               ncol = 3,
                               alpha = 1,
                               size = 1,
                               base_size = 14,
                               title = "Sampling Plan"){
  
  sampling_tbl_with_plots <- sampling_tbl %>% 
    mutate(gg_plots = map(splits,
                          plot_split,
                          expand_y_axis = expand_y_axis,
                          main_data = main_data,
                          alpha = alpha,
                          base_size = base_size))
  
  plot_list <- sampling_tbl_with_plots$gg_plots
  
  p_temp <- plot_list[[1]] + theme(legend.position = "bottom")
  legend <- get_legend(p_temp)
  
  p_body <- plot_grid(plotlist = plot_list,
                      ncol = ncol)
  p_title <- ggdraw() +
    draw_label(title,
               size = 14,
               fontface =  "bold",
               colour = palette_light()[[1]])
  g <- plot_grid(p_title,
                 p_body,
                 legend,
                 ncol = 1,
                 rel_heights =  c(0.05, 1, 0.05))
  g
}

rolling_origin_resamples %>% 
  plot_sampling_plan(main_data = pooled_weekday,
                     ncol = 3)

Originally, I’d intended to use cross-validation and fit 6 LTSM models however this became far too time consuming and on reflection each fold is very similar to the other folds since we have only one week’s worth of observations. Therefore, we’ll only be using the initial example split throughout the rest of this post.

Single Split Model

Creating the single example split to train the model:

example_split    <- rolling_origin_resamples$splits[[1]]
example_split_id <- rolling_origin_resamples$id[[1]]

df_trn <- analysis(example_split)[1:2000, , drop = FALSE]
df_val <- analysis(example_split)[2001:2880, , drop = FALSE]
df_tst <- assessment(example_split)


df <- bind_rows(df_trn %>% add_column(key = "training"),
                df_val %>% add_column(key = "validation"),
                df_tst %>% add_column(key = "testing")) %>% 
  as_tbl_time(index = index)


df %>%
  head() %>% 
  kable()

index	y	key
2009-11-02 00:00:00	0	training
2009-11-02 00:01:00	0	training
2009-11-02 00:02:00	0	training
2009-11-02 00:03:00	0	training
2009-11-02 00:04:00	0	training
2009-11-02 00:05:00	0	training

Here we use the recipes library to scale and standardise the data, being sure to save the mean and standard deviation used for later:

rec_obj <- recipe(y ~ ., df) %>% 
  step_sqrt(y) %>% 
  step_center(y) %>% 
  step_scale(y) %>% 
  prep()
df_processed_tbl <- bake(rec_obj,
                           df)

center_history <- rec_obj$steps[[2]]$means["y"]
scale_history <- rec_obj$steps[[3]]$sds["y"]

This code block transforms the data into “tensors” as required by Keras/TensorFlow - this section is a little intense:

n_timesteps <- 60
n_predictions <- n_timesteps
batch_size <- 10

build_matrix <- function(tseries, overall_timesteps) {
  t(sapply(1:(length(tseries) - overall_timesteps + 1), function(x) 
    tseries[x:(x + overall_timesteps - 1)]))
}

reshape_X_3d <- function(X){
  dim(X) <- c(dim(X)[1], dim(X)[2], 1)
  X
}

train_vals <- df_processed_tbl %>% 
  filter(key == "training") %>% 
  select(y) %>% 
  pull()
valid_vals <- df_processed_tbl %>% 
  filter(key == "validation") %>% 
  select(y) %>% 
  pull()
test_vals <- df_processed_tbl %>% 
  filter(key == "testing") %>% 
  select(y) %>%
  pull()


train_matrix <-
  build_matrix(train_vals, n_timesteps + n_predictions)
valid_matrix <-
  build_matrix(valid_vals, n_timesteps + n_predictions)
test_matrix <- build_matrix(test_vals, n_timesteps + n_predictions)


X_train <- train_matrix[, 1:n_timesteps]
y_train <- train_matrix[, (n_timesteps + 1):(n_timesteps * 2)]
X_train <- X_train[1:(nrow(X_train) %/% batch_size * batch_size), ]
y_train <- y_train[1:(nrow(y_train) %/% batch_size * batch_size), ]

X_valid <- valid_matrix[, 1:n_timesteps]
y_valid <- valid_matrix[, (n_timesteps + 1):(n_timesteps * 2)]
X_valid <- X_valid[1:(nrow(X_valid) %/% batch_size * batch_size), ]
y_valid <- y_valid[1:(nrow(y_valid) %/% batch_size * batch_size), ]

X_test <- test_matrix[, 1:n_timesteps]
y_test <- test_matrix[, (n_timesteps + 1):(n_timesteps * 2)]
X_test <- X_test[1:(nrow(X_test) %/% batch_size * batch_size), ]
y_test <- y_test[1:(nrow(y_test) %/% batch_size * batch_size), ]

X_train <- reshape_X_3d(X_train)
X_valid <- reshape_X_3d(X_valid)
X_test <- reshape_X_3d(X_test)

y_train <- reshape_X_3d(y_train)
y_valid <- reshape_X_3d(y_valid)
y_test <- reshape_X_3d(y_test)

Here we specify model parameters. I’ve left the original author’s comments in here as the code is less self-explanatory - again I recommend reading the original here.

library(keras)
library(tfruns)

FLAGS <- flags(
  # There is a so-called "stateful LSTM" in Keras. While LSTM is stateful
  # per se, this adds a further tweak where the hidden states get 
  # initialized with values from the item at same position in the previous
  # batch. This is helpful just under specific circumstances, or if you want
  # to create an "infinite stream" of states, in which case you'd use 1 as 
  # the batch size. Below, we show how the code would have to be changed to
  # use this, but it won't be further discussed here.
  flag_boolean("stateful", FALSE),
  # Should we use several layers of LSTM?
  # Again, just included for completeness, it did not yield any superior 
  # performance on this task.
  # This will actually stack exactly one additional layer of LSTM units.
  flag_boolean("stack_layers", FALSE),
  # number of samples fed to the model in one go
  flag_integer("batch_size", 10),
  # size of the hidden state, equals size of predictions
  flag_integer("n_timesteps", 60),
  # how many epochs to train for
  flag_integer("n_epochs", 100),
  # fraction of the units to drop for the linear transformation of the inputs
  flag_numeric("dropout", 0.2),
  # fraction of the units to drop for the linear transformation of the 
  # recurrent state
  flag_numeric("recurrent_dropout", 0.2),
  # loss function. Found to work better for this specific case than mean
  # squared error
  flag_string("loss", "logcosh"),
  # optimizer = stochastic gradient descent. Seemed to work better than adam 
  # or rmsprop here (as indicated by limited testing)
  flag_string("optimizer_type", "sgd"),
  # size of the LSTM layer
  flag_integer("n_units", 128),
  # learning rate
  flag_numeric("lr", 0.003),
  # momentum, an additional parameter to the SGD optimizer
  flag_numeric("momentum", 0.9),
  # parameter to the early stopping callback
  flag_integer("patience", 10)
)

# the number of predictions we'll make equals the length of the hidden state
n_predictions <- FLAGS$n_timesteps
# how many features = predictors we have
n_features <- 1
# just in case we wanted to try different optimizers, we could add here
optimizer <- switch(FLAGS$optimizer_type,
                    sgd = optimizer_sgd(lr = FLAGS$lr, 
                                        momentum = FLAGS$momentum)
                    )

# callbacks to be passed to the fit() function
# We just use one here: we may stop before n_epochs if the loss on the
# validation set does not decrease (by a configurable amount, over a 
# configurable time)
callbacks <- list(
  callback_early_stopping(patience = FLAGS$patience)
)

Finally, we fit the model:

# create the model
model <- keras_model_sequential()

# add layers
# we have just two, the LSTM and the time_distributed 
model %>%
  layer_lstm(
    units = FLAGS$n_units, 
    # the first layer in a model needs to know the shape of the input data
    batch_input_shape  = c(FLAGS$batch_size, FLAGS$n_timesteps, n_features),
    dropout = FLAGS$dropout,
    recurrent_dropout = FLAGS$recurrent_dropout,
    # by default, an LSTM just returns the final state
    return_sequences = TRUE
  ) %>% 
  time_distributed(layer_dense(units = 1))

model %>%
  compile(
    loss = FLAGS$loss,
    optimizer = optimizer,
    # in addition to the loss, Keras will inform us about current 
    # MSE while training
    metrics = list("mean_absolute_error")
  )

history <- model %>% fit(
  x          = X_train,
  y          = y_train,
  validation_data = list(X_valid, y_valid),
  batch_size = FLAGS$batch_size,
  epochs     = FLAGS$n_epochs,
  callbacks = callbacks
)
plot(history, metrics = "loss")

Model fitting is pretty fast on my laptop but adding additional LTSM layers and cross-fold validation quickly became prohibitively time intensive. Initially I was pretty puzzled by the fact that training loss exceeded validation loss however a quick google suggests that this indicates that the training set is much harder to predict than our validation set - I think that the footfall 0s during closing hours are causing this rather than any coding/modelling errors on my part but I could be very wrong.

This code block creates a wide dataset of rolling predictions from the training set which we’ll use shortly to calculate the model’s error as well as plot predictions:

pred_train <- model %>%
  predict(X_train, batch_size = FLAGS$batch_size) %>%
  .[, , 1]
# Retransform values to original scale
pred_train <- (pred_train * scale_history + center_history) ^2
compare_train <- df %>% filter(key == "training")

# build a dataframe that has both actual and predicted values
for (i in 1:nrow(pred_train)) {
  varname <- paste0("pred_train", i)
  compare_train <-
    mutate(compare_train,!!varname := c(
      rep(NA, FLAGS$n_timesteps + i - 1),
      pred_train[i,],
      rep(NA, nrow(compare_train) - FLAGS$n_timesteps * 2 - i + 1)
    ))
}

Here, we calculate the root mean squared error of the model using the training data - note the use of quo and !! necessary with the tidyverse’s quasiquotation philosophy.

coln <- colnames(compare_train)[4:ncol(compare_train)]
cols <- map(coln, quo(sym(.)))
rsme_train <-
  map_dbl(cols, function(col)
    rmse(
      compare_train,
      truth = y,
      estimate = !!col,
      na.rm = TRUE
    )) %>% mean()

rsme_train

[1] 29.35666

I depart from the RStudio blog post substantially here and transform the data into a long format which makes plotting significantly easier. I’m quite lazy and use separate rather than gsub and regex to remove unwanted strings.

compare_train_long <- compare_train %>%
  gather(key, val, pred_train1:pred_train1880) %>% 
  separate(key,into =  c("fluff", "train_number"), sep = "pred_train") %>% 
  select(-fluff)
compare_train_long$train_number <- as.numeric(compare_train_long$train_number)
compare_train_long_filtered <- compare_train_long %>% 
  filter(train_number%%120 == 0)

compare_train_long_filtered$train_number <- factor(compare_train_long_filtered$train_number)

p <- ggplot(compare_train_long_filtered,
            aes(x = index, y = y)) +
  geom_line(alpha = 0.1) +
  geom_line(data = compare_train_long_filtered %>% na.omit(),
            aes(x = index, y = val, group = train_number, colour = train_number)) +
  scale_color_discrete("Dark2") +
  guides(color = "none") +
  theme_minimal() +
    theme(legend.position = "none") +
  ggtitle("Training Set Predictions")

p

Visually it looks like the model does a pretty good job of predicting the next hour’s footfall based off current footfall. It’s a little disappointing that the neural net doesn’t manage to correctly predict 0 footfall - I imagine a Poisson or negative binomial loss function would really help here.

Below we do the same as above but with the test data.

pred_test <- model %>%
  predict(X_test, batch_size = FLAGS$batch_size) %>%
  .[, , 1]

# Retransform values to original scale
pred_test <- (pred_test * scale_history + center_history) ^2
compare_test <- df %>% filter(key == "testing")

# build a dataframe that has both actual and predicted values
for (i in 1:nrow(pred_test)) {
  varname <- paste0("pred_test", i)
  compare_test <-
    mutate(compare_test,!!varname := c(
      rep(NA, FLAGS$n_timesteps + i - 1),
      pred_test[i,],
      rep(NA, nrow(compare_test) - FLAGS$n_timesteps * 2 - i + 1)
    ))
}

coln <- colnames(compare_test)[4:ncol(compare_test)]
cols <- map(coln, quo(sym(.)))
rsme_test <-
  map_dbl(cols, function(col)
    rmse(
      compare_test,
      truth = y,
      estimate = !!col,
      na.rm = TRUE
    )) %>% mean()

rsme_test

[1] 31.90933

Testing error doesn’t look wildly different from training error on the subset of predictions we’ve selected to display:

compare_test_long <- compare_test %>%
  gather(key, val, pred_test1:pred_test2040) %>% 
  separate(key,into =  c("fluff", "test_number"), sep = "pred_test") %>% 
  select(-fluff)


compare_test_long$test_number <- as.numeric(compare_test_long$test_number)
compare_test_long_filtered <- compare_test_long %>% 
  filter(test_number%%120 == 0)
compare_test_long_filtered$test_number <- factor(compare_test_long_filtered$test_number)


p <- ggplot(compare_test_long_filtered,
            aes(x = index, y = y)) +
  geom_line(alpha = 0.1) +
  geom_line(data = compare_test_long_filtered %>% na.omit(),
            aes(x = index, y = val, group = test_number, colour = test_number)) +
  scale_color_discrete("Dark2") +
  guides(color = "none") +
  theme_minimal() +
    theme(legend.position = "none") +
  ggtitle("Testing Set Predictions")

p

So the LTSM model did a pretty good job of capturing some of the characteristics of tube footfall - the bimodal nature of footfall over time seemed to be modelled to some degree by the neural net and lower footfall during station closing hours could be seen in the model predictions.

Prophet

In this section we’ll use Facebook’s Prophet library to model footfall. Prophet’s design aim or philosophy is to make time series forecasting easy for small corporates and those less experienced with time series analysis and is built on top of Stan. Unfortunately, a tradeoff of Prophet’s incredible ease of use is that we can’t fiddle with all the nuts and bolts of Stan - for instance using a Poisson likelihood as mentioned earlier.

library(prophet)
options(mc.cores = parallel::detectCores())
prophet_train <- df %>% 
  filter(key != "testing") %>% 
  rename(ds = index) %>% 
  select(-key)
prophet_test <- df %>% 
  filter(key == "testing") %>% 
  rename(ds = index, y_test = y) %>% 
  select(-key)

model <- prophet(prophet_train,
                 daily.seasonality = TRUE,
                 changepoint.prior.scale = 0.01)

Initial log joint probability = -168.746
Optimization terminated normally: 
  Convergence detected: relative gradient magnitude is below tolerance

future_df <- make_future_dataframe(model, freq = 60,
                                   periods = 60*24)
forecast <- predict(model, future_df)
forecast_subset <- forecast %>%
  as_tibble() %>% 
  select(ds, yhat)

train_and_prediction <- left_join(forecast_subset, prophet_train, by = "ds")
all_prophet_data <- left_join(train_and_prediction, prophet_test, by = "ds")
all_prophet_data_long <- all_prophet_data %>% 
  gather(type, y, -ds)

ggplot(all_prophet_data_long %>% filter(type != "yhat"), aes(x = ds, y = y, colour = type)) +
  # geom_ribbon(alpha = 0.1) +
  geom_point(alpha = 0.1) +
  theme_minimal() +
  geom_line(data = all_prophet_data_long %>% filter(type == "yhat"), aes(x = ds, y = y), size = 1)

Like the LTSM model Prophet does an okay job at capturing the twin rush hour peaks, although the LTSM model seems to get a better handle on the magnitude of these peaks. Again, the model struggles with 0 footfall, adding a dummy regressor for midnight/early morning hours would probably help here.

Another feature of Prophet is that we can plot each trend component. The plot below clearly displays the rush hour trend as well as a slight increase in passenger numbers as we move from Monday to Friday:

prophet_plot_components(model, forecast)

Prophet also has built in cross-validation features. We ape the neural net set-up by feeding in an initial training period of an hour in order to predict the next hour, however I think Prophet uses a cumulative training set rather than discrete training splits which renders already fuzzy comparisons between the two models even more otiose.

df_cv <- cross_validation(model, horizon = 60, initial = 60, units = "mins")

When we plot the cross-validated root mean squared error we find, unsurprisingly, that the model error increases as prediction horison increases:

plot_cross_validation_metric(df_cv, metric = "rmse")

Conclusion

We’ve briefly explored a few aspects of TFL’s Tube data using ggraph and tidygraph network visulatisations as well as toying with Facebook’s Prophet library and Keras. Clearly the models we’ve developed have a great deal of room for improvement - this was far more of a learning experience for me than previous posts - however, on the whole I think we managed to capture the idiosyncrasies of the data pretty well.

My next post will probably return to Mankiw, Romer and Weil’s empirics of growth paper using Bayesian hierarchical models to estimate Solow model parameters as an alternative to the separate regression framework we’ve been using so far.

This only works because Sunday is recorded as downo equal to 1 and the first week in November 2009 also started on a Sunday, otherwise we’d have to shuffle downo values to reflect the actual day on a week in November 09↩
For instance, I think there are far better ways to model this data and multiple improvements we could use such as a Poisson loss metric/likelihood.↩