The LATE theorem is a cornerstone of identification in economics and labour economists in particular have popularised and preached its virtues. Considering the theorem’s importance relatively little effort has been expended ensuring the theorem’s assumptions are met in the dataset in question. This essay aims to promote a practical methodology that can be incorporated into current instrumental variables workflow in order to clearly aid researchers in verifying that necessary, but not sufficient, conditions are met empirically in the sample.

Checking that the first stage “goes the right way” through the use of interaction terms or subsampling is common knowledge amongst applied practitioners but rarely discussed in depth. I formalise this notion and present some novel contributions of potential interest to social science researchers using saturated first stage regressions and negative log-log \(p\)-value transforms popularised by genome wide association studies (GWAS) to present visually striking evidence for or against monotonicity. Next, I show how advances in causal econometric inference using machine learning techniques, developed by Wager and Athey (2015), can be adapted to estimate the underlying first-stage heterogeneity and provide an alternative to the saturated regression model. I explore test power in simulations under a range of data generating processes and find promising results; test power approaches 100% in large datasets as defiance nears a third of population.

Finally, I take the tests to the data by applying the methodologies to various seminal papers in labour economics and find scant evidence of defiance. Comparing model results graphically provides some illumination and identification of likely defier characteristics which could be of potential interest to policy makers and applied practitioners but ultimately the null hypothesis of zero defiance cannot be rejected in any of the papers considered.

library(Rfast)
library(margins)
library(dplyr)
library(ggplot2)
library(ggExtra)
library(grf)
library(tibble)
library(AER)
library(stargazer)
library(purrr)
library(sandwich)
library(scales)
library(tidyr)
library(gt)
library(haven)
library(broom)
library(grf)
library(StepwiseTest)
library(modelr)
library(furrr)
set.seed(1234)

## Common transformation used to rescale graphics in the essay
negative_log_trans <- function(base = exp(1)) {
  trans <- function(x) -log(x, base)
  inv <- function(x) base^(-x)
  trans_new(paste0("negative_log-", format(base)), trans, inv, 
            log_breaks(base = base), 
            domain = c(1e-100, Inf))
}

The LATE Theorem

The LATE theorem (G. W. Imbens and Angrist 1994) describes four key assumptions under which instrumental variables regression will identify a causal effect for a specific subpopulation, “compliers”, in the presence of heterogeneous effects.

Using the random coefficient notation of Mostly Harmless Econometrics (Angrist and Pischke 2009) there is a potential outcome \(Y_i(d, z)\) corresponding to an individual \(i\) with treatment status \(D_i = d\) and instrument value \(Z_i = z\). Observed treatment status is:

\[ D_i = D_{0i} + (D_{1i} - D_{0i})Z_i = \pi_0 + \pi_{1i}Z_i + \epsilon_i \]

where \(\pi_{1i} = (D_{1i} - D_{0i})\) is the heterogeneous causal effect of \(Z_i\) on \(D_i\) The LATE assumptions are:

  1. Independence. \(\{Y_i(D_{1i}, 1), Y_i(D_{0i}, 0), D_{1i}, D_{0i}\}\ \amalg \ Z_i\).
  2. Exclusion. \(Y_i(d, 0) = Y_i(d, 1) = Y_{di} \ \text{for} \ d = 0,1\).
  3. First-Stage. \(E[D_{1i} - D_{0i}] \neq 0\).
  4. Monotonicity. \(D_{1i} - D_{0i} \geq 0 \ \forall \ i \ \text{or vice versa.}\)1

Under assumptions 1-4:

\[ \frac{E[Y_i | Z_i = 1] - E[Y_i | Z_i = 0]}{E[D_i | Z_i = 1] - E[D_i | z_i = 0]} = E[Y_{1i} - Y_{0i} | D_{1i} > D_{01}] = E[\rho_i | \pi_{1i} > 0] \]

The Wald estimator is the Local Average Treatment Effect (LATE) - that is, the average treatment effect for the compliers. The monotonicity assumption is key because it rules out “defiers”; those who would have taken the treatment if they hadn’t been assigned the instrument but upon instrument assignment are induced not to take the treatment. In the presence of defiers the equation above no longer estimates LATE but some weighted average of the effect for compliers and defiers.

The Wald estimator under assumptions 1-4 is often criticised for lack of external validity - it’s hard to argue that the effect for compliers can be generalised to the rest of the population unless the treatment effect is the same for everyone (unlikely) or assignment to complier, defier, always-taker etc. is as good as random (i.e. \(corr(\rho_i, \pi_{1i}) = 0\)). However, an advantage of LATE is that the estimate is often exactly what we want to know from a policy perspective - a programme’s effect on those likely to respond to the programme is more interesting to us than the effect on the entire population if univeral compliance is unlikely.

On the other hand the Wald estimator in the presence of defiers is near useless to us. Not only does it lack external validity but it fails to uncover a meaningful internally valid estimate. This, in part, motivates the importance of checking monotonicity.

Literature and Data

Literature

There are two papers currently in the literature that explore monotonicity tests in the LATE theorem. Kitagawa (2015) derives a test of independence, exclusion and monotonicity which he collectively labels validity using a “variance weighted Kolmogorov-Smirnov test statistic”. Essentially, Kitagawa’s test recognises that, under instrument validity, the probability density functions of treated and control groups must be nested in certain ways. Kitagawa’s test doesn’t rely on covariates unlike the tests presented in this essay, however calculating the test statistic is far harder than the simple tests presented in this essay.

Another paper considering the monotonicity assumption is Mourifie and Wan (2017)’s. Mourifie and Wan’s approach complements Kitagawa’s but differs slightly. Rather than testing empirical distribution functions, Mourifie and Wan consider the problem using a conditional moment inequality which leads to testable implications using results from Chernozhukov, Lee, and Rosen (2009). Mourife and Wan’s approach is simple to implement in Stata using appropriate packages. Unfortunately, due to essay word limits and the fact that honest causal forest is yet to be implemented in Stata, comparisons between Mourifie and Wan’s test and my approach aren’t presented in this essay.

Data

I collect data from three well-known papers in the labour literature, each with a slightly different flavour of instrument: Angrist and Evans (1998) use the now famous same-sex binary instrument and corresponding Wald estimator to estimate fertility’s effect on household labour supply; Lavy and Angrist (1999) use Maimonides’ Rule as an instrumental variable in a fuzzy RDD identification strategy to uncover the causal effects of class size on test scores and finally I explore Autor, Dorn, and Hanson (2013)’s “Bartik” or shift-share instrument for Chinese imports and regional labour market outcomes.

Data is first cleaned, following provided code files as best as possible; summarised and checked against the original paper and finally main results are replicated. Every paper considered successfuly replicates and often coefficient and standard error estimates are identical to their original counterparts. Replication steps and details of each paper’s main result can be found in the appendix.

The Tests

First Stage Interactions

Common practice in applied work involves checking that the regression first stage “goes the same way” for all subgroups present in a dataset. In the simple binary covariate case this involves checking that \(\hat{\pi}_{1im} > 0\) for \(m = 0,1\) where \(m\) indicates which sub-population the sample is drawn from, male or female say. This is equivalent to running the saturated regression model: \[ D_i = \pi_0 + \pi_{1i}Z_i + \pi_{2i}m_i + \pi_{3i}(Z_i \times m_i) + \epsilon_i \]

and testing \(\hat{\pi}_{1i} > 0\) and \(\hat{\pi}_{1i} + \hat{\pi}_{3i} > 0\). Extending this framework to multiple, continous and binary-valued covariates is relatively straightforward:

\[ D_i = \pi_0 + \pi_{1i}Z_i + \sum_{k=1}^K (\gamma_{ki}x_{ki} + \delta_{ki}(Z_i \times x_{ki})) + \epsilon_i \]

where \(x_{ki}\) is a vector of \(K\) observed covariates. Differentiation2 with respect to the instrument \(Z_i\) gives:

\[ \frac{\partial D_i}{\partial Z_i} = \pi_{1i} + \sum_{k=1}^K \delta_{ki}x_{ki} \]

and therefore we wish to test hypotheses of the form:

\[ \begin{aligned} H_{0i}: \frac{\partial D_i}{\partial Z_i} = \pi_{1i} + \sum_{k=1}^K \delta_{ki}x_{ki} &\geq 0 \\ H_{1i}: \frac{\partial D_i}{\partial Z_i} = \pi_{1i} + \sum_{k=1}^K \delta_{ki}x_{ki} &< 0 \end{aligned} \] In the binary covariate case we could simply estimate the saturated model and test each partial derivative separately - one for \(m = 1\) and another for \(m = 0\). Moving to the continuous case is essentially the same however now we have as many partial derivatives we wish to test as datapoints - our hypotheses are indexed by \(i\) because unlike the discrete covariate case, every individual takes a unique \(X_i\) value. Furthermore, LATE theorem assumption 4 require monotonicity for all \(i\) - the partial derivative, \(\frac{\partial D_i}{\partial Z_i}\) must be positive for all \(i\).

Therefore, our monotonicity test neatly corresponds with the set, or complete, null (Shaffer 1995):

\[ \begin{aligned} H_0^C: \cap_{i \in N} \pi_{1i} + \sum_{k=1}^K \delta_{ki}x_{ki} &\geq 0 \\ H_1^C: \cap_{i \in N} \pi_{1i} + \sum_{k=1}^K \delta_{ki}x_{ki} &< 0 \end{aligned} \]

Any test of this nature immediately faces two problems. First, we must consider whether control of the family-wise error rate (FWER) in the face of \(N\) simultaneous statistical inferences is necessary. Second, we must be confident of test power when performing \(N\) inferences based off \(N\) datapoints.

In the binary covariate case outlined above it’s clear that the number of simultaneous statistical inferences considered is two; one hypothesis tests of defiance for males and one for females. Extending this to a setting with \(K\) “factor” variables, which take \(L\) unique, discrete values leads to the following number of comparisons:

\[ C = \prod^K_{i} L_i \]

If the number of levels is the same for each variable we can rewrite this as \(C = L^K\). Therefore, the number of comparisons we must adjust for grow quickly in the number of levels and exponentially fast in number of covariates. In the continuous case, as \(L \rightarrow N\), the testing procedure is practically paralysed by this adjustment \(C = N^K\) since any multiple comparison adjustment necessarily trades off control of test size for power (Shaffer 1995).

Therefore, in order to adequately control test size without rendering tests powerless I discretise each continuous variable into \(n\)-quantile bins and create a corresponding dummy variable to interact with the instrument.

This essay faces a constant tradeoff between size and power. An increase in number of quantile bins not only leads to power loss through multiplicity adjustments, but also through weaker identification of interaction effects. Back of the envelope calculations by Gelman (2018) suggest interaction terms require 16 times more observations than corresponding main effects for accurate identification at a given power level.

Moving onto the second concern, test power: in the continuous case, model parameters are well-identified since whilst I notionally test \(N\) hypotheses, the number of effective parameters to estimate are much lower in fact, only \(1 + K + (K \times L)\) - this corresponds to the instrument’s main effect, each covariate’s main effect and \(K \times L\) interaction terms. The problem can be reconceptualised as constructing prediction intervals for the partial derivates which, whilst heterogeneous, only vary at the subgroup level. This reduction in granularity speaks to the trade-off inherent in heterogeneous causal identification. Whilst we may believe treatment effects vary at the individual level, using conventional methods in the cross-section, we can only identify an average treatment effect for observed subgroups - later in the essay I attempt to use Wager and Athey (2015)’s causal random forest to address this limitation.

To alleviate any potential concerns surrounding size and power I present an array of simulations varying number of quantiles; number of covariates; adjustment method and data generating process.

In conclusion, the first stage interaction test involves estimating the saturated regression, formed by interacting the instrument with discretised, available covariates and estimating the instrument partial effect, i.e.the partial derivative \(\frac{ \hat{\partial D_i}}{ \hat{\partial Z_i}}\). Next, we find the most extreme defier with the most adverse adjusted test statistic/smallest \(p\)-value and test the one-sided null hypothesis for negative (positive) partial effects.

# Run Honest Causal Forest on dataframe with VERY specific naming conventions - only use with data from
# create_fake_data etc.
sim_first_stage_forest <- function(dataset){
  forest_sim <- dataset %>% 
    select_at(vars(contains("V"), -contains(":"))) %>% # Dropping interactions
    as.matrix() %>% 
    causal_forest(X = .,
                  Y = dataset$D,
                  W = dataset$Z,
                  num.trees = 4000) %>% 
    predict(., estimate.variance = TRUE) %>% 
    as_tibble() %>% 
    mutate(sigma_hat = sqrt(variance.estimates),
           prediction_lo = predictions - 1.96*sigma_hat,
           prediction_hi = predictions + 1.96*sigma_hat,
           t_stat = predictions / sigma_hat,
           pval_one_neg = pnorm(t_stat),
           pval_one_pos = pnorm(-t_stat),
           row_id = row_number())
  return(forest_sim)
}

# Find the standard errors from estimated partial effects
find_SEs <- function(model_data_no_y, model_vcov, instrument){
  model_data_no_y <- model_data_no_y %>% 
    rename("instrument" = instrument)
  model_formula <- as.formula(paste0("~", "instrument", "*.")) # Creating interaction model formula
  
  model_matrix <- model_data_no_y %>% 
    mutate(instrument = 1) %>%  # i.e. just main effect by itself, not times covariate Z value
    model.matrix(model_formula, data = .)
  
  gradient_matrix_kinda <- model_matrix %>% # Gradients computed by hand as we know only interactions present
    as_tibble() %>% 
    mutate_at(vars(-contains(":")), ~(. = 0)) %>% 
    mutate(instrument = 1) %>% 
    as.matrix()
  # Splitting indices so that vectorised matrix inversion isn't too painful
  split_indices <- seq(from = 1, to = nrow(gradient_matrix_kinda), by = 10000) %>% 
    c(nrow(gradient_matrix_kinda))
  
  if (length(split_indices) < 3){
    vcov_dydx_intermediate <- Rfast::mat.mult(gradient_matrix_kinda, model_vcov)
    vcov_dydx <- Rfast::mat.mult(vcov_dydx_intermediate,
                                 Rfast::transpose(gradient_matrix_kinda))
    SE_dydx <- sqrt(diag(vcov_dydx))
    
  } else {
    baby_SE <- matrix(nrow = nrow(gradient_matrix_kinda), ncol = 1)
    for (i in 1:(length(split_indices)-1)){
      baby_matrix <- gradient_matrix_kinda[split_indices[[i]]:split_indices[[i+1]], ]
      # print(paste0(split_indices[i],"to", split_indices[i+1]))
      baby_vcov <- Rfast::mat.mult(baby_matrix, model_vcov)
      baby_vcov <- Rfast::mat.mult(baby_vcov, Rfast::transpose(baby_matrix))
      baby_SE[split_indices[[i]]:split_indices[[i+1]], ] <- sqrt(diag(baby_vcov))
      i <- i + 1
    }
    SE_dydx <- baby_SE[, 1]
  }
  
  return(SE_dydx)
  
} 

discretise_df <- function(dataset,
                       vars_to_discretise,
                       n_tiles){
  new_df <- vars_to_discretise %>% 
    map(~select(dataset, .) %>% 
          pull()) %>% 
    map_dfc(ntile, n_tiles) %>% 
    unite(unique_subgroup, everything(), sep = "", remove = FALSE) %>% 
    mutate_all(factor) 
  
  n_comparisons <- new_df %>%
                   select(unique_subgroup) %>% 
                   n_distinct()

  joint_df <- dataset %>% 
    select(-vars_to_discretise) %>% 
    bind_cols(new_df) %>% 
    mutate(n_comparisons = n_comparisons)
  
  return(joint_df)
}

run_first_stage_interactions_fast <- function(dataset,
                                              dependent_variable,
                                              instrument,
                                              weights = NULL,
                                              vcov_func = vcov,
                                              n_tiles = 2,
                                              ...){
  # Making names consisten to ease quotation
  dataset <- dataset %>% 
    rename("instrument" = instrument)
  
  # Interaction model formula
  model_formula <- as.formula(paste0(dependent_variable,  "~ ", "instrument", "*."))
  
  vars_to_discretise <- dataset %>% 
    select(-dependent_variable, -instrument) %>% 
    select_if(function(col) n_distinct(col) > 2) %>% 
    colnames()
  
  # discretising dataset
  dataset_discrete <- discretise_df(dataset = dataset,
                                    vars_to_discretise = vars_to_discretise,
                                    n_tiles = n_tiles) 
  
  n_comparisons <- unique(dataset_discrete$n_comparisons)
  
  dataset_discrete <- dataset_discrete %>% 
    select(-n_comparisons)
  dataset$unique_subgroup <- dataset_discrete$unique_subgroup 
  
  # Running conventional OLS
  first_stage_fit <- lm(data = dataset_discrete %>% 
                          select(-unique_subgroup), formula = model_formula, weights = weights)
  degrees_freedom <- first_stage_fit$df.residual
  # Dropping dependent variable - not strictly necessary
  dataset_unique <- dataset_discrete %>% 
    select(-dependent_variable) %>% 
    group_by(unique_subgroup) %>% 
    summarise_all(first)
  
  first_stage_margins <- dydx(model = first_stage_fit, data = dataset_unique, variable = "instrument") 
  df_margins <- bind_cols(first_stage_margins %>% 
                            select(contains("dydx")),
                          dataset_unique) %>% 
    as_tibble()
  df_margins$SE_dydx_instrument <- find_SEs(dataset_unique %>% 
                                              select(-unique_subgroup), vcov_func(first_stage_fit, ...), instrument)
  
  df_margins <- df_margins %>% 
    mutate(dydx_lo = dydx_instrument - qt(0.975, degrees_freedom) * SE_dydx_instrument,
           dydx_hi = dydx_instrument + qt(0.975, degrees_freedom) * SE_dydx_instrument,
           t_stat = dydx_instrument/SE_dydx_instrument,
           pval_one_neg = pt(t_stat, degrees_freedom),
           pval_one_pos = pt(-t_stat, degrees_freedom),
           pval_neg_holm = p.adjust(pval_one_neg,
                                    method = "holm",
                                    n = n_comparisons))
  
  df_all_marginals <- inner_join(
                                dataset,
                                df_margins %>% 
                                  select(unique_subgroup,
                                         dydx_lo,
                                         dydx_hi,
                                         dydx_instrument,
                                         SE_dydx_instrument,
                                         t_stat,
                                         pval_one_neg,
                                         pval_one_pos,
                                         pval_neg_holm),
                                by = "unique_subgroup"
                                ) %>% 
    mutate(n_comparisons = n_comparisons)
  return(df_all_marginals)
}


run_first_stage_interactions_continuous <- function(dataset,
                                              dependent_variable,
                                              instrument,
                                              weights = NULL,
                                              vcov_func = vcov,
                                              ...){
  # Making names consisten to ease quotation
  dataset <- dataset %>% 
    rename("instrument" = instrument)
  
  # Interaction model formula
  model_formula <- as.formula(paste0(dependent_variable,  "~ ", "instrument", "*."))
  
  # Running conventional OLS
  first_stage_fit <- lm(data = dataset, formula = model_formula, weights = weights)
  degrees_freedom <- first_stage_fit$df.residual
  # Dropping dependent variable - not strictly necessary
  dataset_unique <- dataset %>% 
    select(-dependent_variable)
  
  first_stage_margins <- dydx(model = first_stage_fit, data = dataset_unique, variable = "instrument") 
  df_margins <- bind_cols(first_stage_margins %>% 
                            select(contains("dydx")),
                          dataset_unique) %>% 
    as_tibble()
  df_margins$SE_dydx_instrument <- find_SEs(dataset_unique, vcov_func(first_stage_fit, ...), instrument)
  
  df_margins <- df_margins %>% 
    mutate(dydx_lo = dydx_instrument - qt(0.975, degrees_freedom) * SE_dydx_instrument,
           dydx_hi = dydx_instrument + qt(0.975, degrees_freedom) * SE_dydx_instrument,
           t_stat = dydx_instrument/SE_dydx_instrument,
          pval_one_neg = pt(t_stat, degrees_freedom),
           pval_one_pos = pt(-t_stat, degrees_freedom),
           pval_neg_holm = p.adjust(pval_one_neg,
                                    method = "holm",
                                    n = nrow(dataset)^(ncol(dataset)-2)))
  df_margins$dependent_variable <- dataset %>% 
    select(dependent_variable) %>% 
    pull()
  return(df_margins)
}

Honest Causal Forest

Honest causal forest (Athey and Imbens 2016; Athey, Tibshirani, and Wager 2016; Wager and Athey 2015) attempts to estimate heterogeneous treatment effects through the use of random forest methods. In short, an honest regression tree is fit; honesty, and eventually unbiasedness, comes from splitting the data into two samples. One sample is used to fit the tree and another sample to decide tree splits or partitions. At the end of the tree the fit sample is dispersed into separate leaves and heterogeneous causal effects are estimated.

Individual trees are prone to overfitting, to overcome this bootstrapped aggregation (bagging) is used to fit many trees using bootstrapped samples - hence a random forest. This method gives heterogeneous estimates from regressing the treatment on instrument status.

\[ D_i = \pi_0 + \pi_{1i}(x_{1i}, x_{2i}, ...)Z_i + \sum_{k=1}^K \gamma_{ki}x_{ki} + \epsilon_i \] Athey, Tibshirani, and Wager (2016) motivate the honest causal forest as a tool for detecting heterogeneous treatment effects that are some unknown function of the covariates - here \(\pi_{1i}(x_{1i}, x_{2i}, ...)\) indicates that individual’s treatment effect is a function of the \(x_i\)s.

The testing procedure is therefore similar to the saturated first stage method. We estimate the equation above, using the honest causal forest implemented in the grf R library by Athey, Tibshirani, and Wager (2016). Next, we construct a test statistic for the most extreme defier, which will be asymptotically normally distributed, and perform a one-sided hypothesis test of the heterogeneous main effect.

An advantage of honest causal forest is that it offers a method to estimate heterogeneous individual level causal effects rather than each individual’s subgroup heterogeneous effect which only varies within the subgroup due to variation in an individual’s covariates. However, there’s a tradeoff - all else equal we’d expect the uncertainty of individual causal estimates to be far larger than those estimated at the subgroup level. Indeed, this greater uncertainty appears empirically when we take the test to the data. Furthermore, it’s no longer immediately obvious how many adjustments are required to control test size. In size simulations later in this essay I show that a naive adjustment of \(N\) comparisons for \(N\) estimated heterogeneous effects inadequately controls test size; even when using a Bonferroni adjustment, notorious for its conservative nature.

Another disadvantage of honest causal forest is its relatively poor performance with few numbers of covariates. Athey, Tibshirani, and Wager (2016) recommend at least three covariates; the power simulations presented below use 4 covariates and every paper replicated in this essay have at least three covariates so this constraint isn’t as strict as it may appear at first.

Finally, another issue with honest causal forest - in this essay at least - is its large computational cost. Fitting 24,000 simulation draws of honest causal forest is extremely computationally intensive. Furthermore, fitting Angrist and Evans (1998)’s census data proved prohibitively expensive and a random, smaller, sample is fit instead. This increased complexity and computational cost may be a stumbling block for applied practitioners seeking to implement honest causal forest in their own work.

Simulations

Size Simulations

Saturated First Stage

Test size, typically denoted by \(\alpha\), is set by the analyst and corresponds to the probability of incorrectly rejecting a true null hypothesis. When performing simultaneous inferences a common concern is that test properties at the individual level, usually an error rate, no longer hold on aggregate. To combat this concern we now examine test size using 2000 simulated datasets of 1000 observations each generated under the null hypothesis - where no defiers are present.

To illustrate the multiple comparison problems’ dual drivers, number of covariates used and number of quantile bins, I undertake the simulations twice. In the first simulation I vary the number of covariates considered and fix number of quantiles at 2 (i.e. a binary dummy variable with value 1 if an observation is above its median). In the second the number of quantile bins used is varied and covariates are fixed at 4.

p.adjust_df <- function(p, method, n){
  df <- method %>% 
    map_dfc(~p.adjust(p = p,
                      method = .x,
                      n = n))
  colnames(df) <- method
  return(df)
}

source("simulations/code/sfs_size_sim.R")
adjusted_pvals_cov <- new_size_sims_cov %>% 
  nest(-draw) %>% 
  mutate(adjusted_pvals = map(.x = data,
                               ~p.adjust_df(p = .x$pvals,
                                         method = c("none", "fdr","holm"),
                                              n = 2^unique(.x$n_cov))))



tidy_adj_p_cov <- adjusted_pvals_cov %>% 
  unnest() %>% 
  select(-pvals) %>% 
  gather(method, pval, holm:none) %>% 
  mutate(reject_null = ifelse(pval < 0.05, TRUE, FALSE)) %>% 
  group_by(draw, n_cov, n_tiles,method) %>% 
  summarise(null_rejected_in_test = ifelse(sum(reject_null) > 0, TRUE, FALSE),
            n_groups = mean(n_groups)) %>% 
  ungroup() %>% 
  group_by(n_cov,n_tiles, method) %>% 
  summarise(empirical_alpha = sum(null_rejected_in_test) / n(),
            n_groups = first(n_groups)) %>% 
  arrange(n_cov) %>% 
  mutate(n_comp = n_tiles^n_cov,
    theoretical_alpha = 1 - 0.95^n_comp)


 adjusted_pvals_tile <- new_size_sims_tile %>% 
   nest(-draw) %>% 
   mutate(adjusted_pvals = map(.x = data,
                               ~p.adjust_df(p = .x$pvals,
                                            method = c("none", "fdr", "holm"),
                                            n = unique(.x$n_tiles)^unique(.x$n_cov))))
 
tidy_adj_p_tiles <- adjusted_pvals_tile %>% 
   unnest() %>% 
   select(-pvals) %>% 
   gather(method, pval, holm:none) %>% 
   mutate(reject_null = ifelse(pval < 0.05, TRUE, FALSE)) %>% 
   group_by(draw, n_tiles, n_cov, method) %>% 
   summarise(null_rejected_in_test = ifelse(sum(reject_null) > 0, TRUE, FALSE),
             n_groups = mean(n_groups)) %>%
   ungroup() %>% 
   group_by(n_tiles,n_cov,  method) %>% 
   summarise(empirical_alpha = sum(null_rejected_in_test) / n(),
             n_groups = first(n_groups)) %>% 
   arrange(n_tiles) %>% 
   mutate(n_comp = n_tiles^n_cov,
          theoretical_alpha = 1 - 0.95^n_comp)

The plot below shows empirical \(\alpha\) - the proportion of null hypotheses rejected in simulations - at a given number of covariates. The blue line shows unadjusted \(p\)-values closely following the theoretical, unadjusted, test size \(\alpha^* = 1 - (1 - 0.05)^{2^K}\) indicating that test size is inadequately controlled.

The Holm-Bonferroni procedure (Holm 1979) controls the family-wise error rate under the same assumptions as a classical Bonferroni procedure but is uniformly more powerful and therefore used throughout this essay unless otherwise specified. For comparison, I also show control of the false discovery rate, proposed by Benjamini and Hochberg (1995), in orange. Since the monotonicity assumption fails under the presence of just one defier we wish to control the family wise error rate, the probability of at least one false rejection, and not the false discovery rate, which corresponds to the proportion of false rejections out of all null hypotheses rejected.

tidy_adj_p_cov %>% 
    filter(method == "none" | method == "fdr" | method == "holm") %>%
ggplot(aes(y = empirical_alpha,
           x = n_cov,
           colour = method)) +
  geom_line() +
  geom_point() +
  scale_y_log10() +
  theme_minimal() +
  geom_line(aes(y = theoretical_alpha), linetype = "longdash", colour = "black") +
  geom_hline(yintercept = 0.05, linetype = 4)  +
  labs(x = "Number of Covariates",
       y = "Empirical Alpha",
       title = "Saturated First Stage Test Size Simulations",
       caption = "Note: Dotted lines indicate theoretical test size after considering \n number of multiple comparisons and a test size of 0.05 respectively.",
       colour = "Method") +
  theme(legend.position = "bottom") +
  scale_x_continuous(breaks = 1:10)

Moving onto the number of quantile bins the multiple comparison problem becomes even more apparent. Using just 3 quantile bins - low, medium and high for instance - with 4 explanatory covariates leads to rejection of the null hypothesis in almost 90% of cases when the null is in fact true.

tidy_adj_p_tiles %>% 
    filter(method == "none" | method == "fdr" | method == "holm") %>%
  ggplot(aes(y = empirical_alpha,
             x = n_tiles,
             colour = method)) +
  geom_line() +
  geom_point() +
  scale_y_log10() +
  theme_minimal() +
  geom_line(aes(y = theoretical_alpha), linetype = "longdash", colour = "black") +
  geom_hline(yintercept = 0.05, linetype = 4) +
  labs(x = "Number of Quantile Bins",
       y = "Empirical Alpha",
       title = "Saturated First Stage Test Size Simulations",
       caption = "Note: Dotted lines indicate theoretical test size after considering \n number of multiple comparisons and a test size of 0.05 respectively.",
       colour = "Method") +
  theme(legend.position = "bottom") +
  scale_x_continuous(breaks = 2:9)

Again, test size is adequately controlled after a Holm-Bonferroni adjustment. Adjustments in both examples appear to err on the side of caution - alternative methodologies not considered here, such as a Romano-Wolf adjustment or Bayesian hierarchical model, could offer potential room for improvement.

In conclusion, simulation results clearly show saturated first stage test size \(\geq \alpha\) without any additional control procedures. In the remainder of the essay \(p\)-values are adjusted for the number of comparisons made.

Honest Causal Forest

Controlling test size in the honest causal forest setting is a little more challenging. A priori it’s not immediately obvious how many comparisons to adjust for. Tentative evidence from simulations suggest that adjustments for \(N^2\) comparisons is adequate however I offer no formal proof of this.

source("simulations/code/rf_size_sim.R")

The honest causal forest test faces a far greater multiple comparison problem - test size simulations using 2000 simulated datasets with 1000 observations each reject the null hypothesis in almost every dataset with unadjusted \(p\)-values. The family wise error rate is only controlled at a level less than or equal to \(\alpha\) when adjusting for one million (\(1000^2\)) comparisons.3

holm_adjusted_pvals <- seq(from = 1000, to = 1000000, by = 1000) %>% 
  map_df(~p.adjust_df(n = ., method = c("none", "fdr", "holm"), p = rf_size_sims$pval_one_neg) %>% 
           mutate(n = .x))

holm_adjustment_summary <- holm_adjusted_pvals %>% 
  gather(method, pval, -n) %>% 
  mutate(reject_null = ifelse(pval < 0.05, TRUE, FALSE)) %>% 
  group_by(method, n) %>% 
  summarise(empirical_alpha = sum(ifelse(reject_null, 1, 0)) / n())


holm_adjustment_summary %>%  
  ggplot(aes(x = n, y = empirical_alpha, colour = method)) +
  geom_line(size = 1) + 
  geom_hline(yintercept = 0.05, linetype = "longdash") +
  scale_y_log10() +
  theme_minimal() +
  labs(x = "Multiple Comparison Adjustments",
       y = "Empirical Alpha",
       title = "Honest Causal Forest Test Size Simulations",
       caption = "Note: Dotted line indicates a test size of 0.05.",
       colour = "Method") +
  theme(legend.position = "bottom") +
  scale_x_continuous(label = comma)

 

Now that test size has been explored in both tests, I move onto test power. Given the large adjustments that must be made in both tests it’s reasonable to worry that test power is particularly low - fortunately, simulations show that this only seems to be true in smaller datasets of around 1000 observations.

Power Simulations

Test power is the ability of a test to correctly reject a null hypothesis. Classical frequentist testing in econometrics usually sets a desired test size (control of the type I error rate) and uses the most powerful test possible with little thought to test size-power trade-off. To measure test power I generate 2000 draws of a simulated dataset, with 1,000, 5,000 and 10,000 observations respectively. To assess power under a range of data generating processes (DGPs) I use three specifications:

Subgroup Heterogeneity

First, a conventional setting where treatment heterogeneity is only generated at the subgroup level through the use of interaction terms:

\[ D_i = \pi_0 + \pi_{1}Z_i + \sum_{k=1}^4 (\gamma_{k}x_{ki} + \delta_{k}(Z_i \times x_{ki})) + \epsilon_i; \ \epsilon_i \sim N(0, 5^2) \\ \pi_1, \gamma_k, \delta_k \sim U(-1, 1) \] Here covariates are drawn from a multivariate normal with mean \(0\) and variance \(\Sigma\).4 Z is a binary instrument drawn from a Bernoulli distribution with probability one half.

Model parameters are drawn from a uniform distribution with support from -1 to 1. Rather than explicitly generating a positive main effect and defiers at given proportions I let the random draws determine the simulated data properties and discretise the data into 1% defier bins - this gives good coverage of the defier space up to concentrations of 25%. However, past 25% the simulations become particularly uninformative - there’s very few simulation draws in this region, typically less than 1% of all draws due to the nature of the DGP used. Furthermore, the distinction between defier and complier becomes murky when approaching an even split in the population - if this situation arises in practice it’s questionable whether LATE really is the estimator of interest to a researcher.

In this context we’d expect the saturated first stage to perform best since it exactly matches the data generating process whilst the honest causal forest should still manage to infer the underlying heterogeneous treatment effects created by the interaction terms.

Individual Heterogeneity

Second, a truly heterogeneous setting where the same DGP as above is used but now individual treatment effects, \(\pi_{1i}, \gamma_{ki}, \delta_{ki}\) are sampled from a multivariate normal distribution centred around a population treatment effect with means \(\Pi_k, \Gamma_k, \Delta_{k}\) and variance \(I_k\).

\[ D_i = \pi_0 + \pi_{1i}Z_i + \sum_{k=1}^4 (\gamma_{ki}x_{ki} + \delta_{ki}(Z_i \times x_{ki})) + \epsilon_i; \ \epsilon_i \sim N(0, 5^2) \\ \pi_{1i} \sim N(\Pi_1, 1); \ \gamma_{ki} \sim N(\Gamma_{k}, I_k); \ \delta \sim N(\Delta_k, I_k) \\ \Pi_1, \Gamma_k, \Delta_k \sim U(-1, 1) \]

This true heterogeneity should pose a challenge for both tests since the saturated first stage model can only estimate an average treatment effect at the subgroup level and individual level heterogeneity is fully independent of covariates so cannot be accurately inferred by the honest causal forest.

Non-Linear Heterogeneity

\[ D_i = \pi_0 + \pi_{1}(x_{1i})Z_i + \sum_{k=1}^4 (\gamma_{k}x_{ki}) + \epsilon_i; \ \epsilon_i \sim N(0, 5^2) \\ \pi_{1}( \cdot) = max(\cdot, -1.5); \ \ \gamma_k \sim N(0, I_k) \]

Finally, I create defiers through a non-linear data generating process similar to the canonical case used by the authors of the honest causal forest R package (Athey, Tibshirani, and Wager 2016). In this setting \(\pi_{1}\), the instrument main effect, is a function of the first covariate. That is, \(\pi_{1}(x_{1i}) = max(x_{1i}, -1.5)\), the remaining covariates in this case are simply homogeneous treatment effects drawn from a standard normal - crucially, there are no interaction terms present in the true DGP. In this setting we’d expect the honest causal forest to outperform its conventional cousin, the saturated first stage test.

Simulation Results

Since I generate highly non-linear and distinct DGPs it’s hard to maintain uniform defier coverage and defier “strength” - the magnitude of a defiers’ treatment effect in comparison to a complier - between contexts. Therefore, comparisons across DGPs should be made sparingly and with caution. Comparisons within simulation settings between tests is the focus of this essay.

source("simulations/code/subgroup_sim.R")
simulations_power_subgroup <- bind_rows(
                                       simulations_power_subgroup_small,
                                       simulations_power_subgroup_medium,
                                        simulations_power_subgroup_large
                                       ) %>% 
    mutate(N = factor(N, levels = c("Small", "Medium", "Large")),
           n_comparisons = ifelse(model == "Saturated First Stage", 2^4, NA),
           n_comparisons = ifelse(model == "Forest" & N == "Small", 1000^2, n_comparisons),
           n_comparisons = ifelse(model == "Forest" & N == "Medium",
                                  5000^2,
                                  n_comparisons),
           n_comparisons = ifelse(model == "Forest" & N == "Large",
                                  10000^2,
                                  n_comparisons))



rm(simulations_power_subgroup_small,
   simulations_power_subgroup_medium,
   simulations_power_subgroup_large)


simulations_power_subgroup <- simulations_power_subgroup %>% 
  rowwise() %>% 
  mutate(pval_holm = p.adjust(pval_defier, method = "holm", n = n_comparisons)) %>% 
  ungroup()

Throughout the essay I use an inverted y-axis when displaying \(p\)-values - the reason for this is two-fold. First, when considering thousands of \(p\)-values, as we are in this essay, it’s common to use a negative log transform, popularised by biostatistics and genome studies. Displaying the “most significant” \(p\)-values at the very top of the visualisation makes better use of the space in a figure - we care less about the null hypothesis \(p\)-values we’re least able to reject. Secondly, inverting the y-axis maintains consistency when we move to the negative log transform rather than switching between visualisation paradigms.

Subgroup Heterogeneity

It’s clear from the figure below that at even moderate defier levels both tests are extremely powerful. When considering the saturated first stage test this is hardly surprising - it has long been acknowledged that testing whether a first stage “goes the right way” using sub-samples or interaction terms is a valid test of monotonicity and we’d expect test properties to be similar to standard frequentist testing of OLS.

The random forest method, too, performs surprisingly well - even better than the saturated first stage test it would seem at first glance. Both tests improve substantially as sample size increases and as defier proportion increases as we’d expect.

simulations_power_subgroup %>% 
  filter(pct_defiers_true > 0) %>% 
  ggplot(aes(x = pct_defiers_true,
             y = pval_holm,
             colour = model)) +
  geom_point(alpha = 0.15) +
  geom_hline(yintercept = 0.05,
             linetype = "longdash",
             alpha = 0.5) +
  scale_y_reverse() +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
  facet_grid(N~model) +
  labs(x = "Defier Percentage",
       y = "P-Value",
       caption = "Note: Y-axis scale inverted. \n Dashed line indicates a p-value of 0.05.",
       title = "P-Values from Power Simulations") +
  theme_minimal() +
  guides(colour = "none") 

The difference between the figure using a linear scale and \(p\)-values displayed on a log scale, below, clearly motivates the use of a log transform. \(p\)-values from the medium and larger dataset are orders of magnitude smaller than the \(p\)-values from the smallest dataset as we’d expect.

simulations_power_subgroup %>% 
  unite(mod_N, model, N, remove = FALSE) %>% 
  filter(pct_defiers_true > 0) %>% 
  ggplot(aes(x = pct_defiers_true,
             y = pval_holm,
             group = mod_N)) +
  geom_point(alpha = 0.15, aes(colour = model)) +
  geom_hline(yintercept = 0.05,
             linetype = "longdash",
             alpha = 0.5) +
  scale_y_continuous(trans = negative_log_trans(10)) +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
  guides(colour = "none") +
  theme_minimal() +
  facet_grid(N~model, 
             scales = "free_y") +
  labs(x = "Defier Percentage",
       y = "P-Value",
       caption = "Note: Y-Axis on negative log(10) scale. \n Dashed line indicates a p-value of 0.05. \n Line of best fit in purple.",
       title = "P-Values from Power Simulations") +
  geom_smooth(method = lm,
              se = FALSE,
              alpha = 0.2,
              colour = "violetred")

As defier proportion increases test power increases rapidly. For the larger and medium dataset the ability to detect defiers at even levels as small as 2% is particularly impressive.

simulations_power_subgroup %>% 
  filter(pct_defiers_true > 0) %>% 
  ggplot(aes(x = pct_defiers_true,
             y = pval_holm,
             colour = model)) +
  geom_point(alpha = 0.15,
             aes(colour = model)) +
  geom_hline(yintercept = 0.05, linetype = "longdash") +
  scale_y_reverse() +
  geom_smooth(se = TRUE,
              colour = "violetred",
              alpha = 0.5) +
  scale_x_continuous(limits = c(0, 0.1),
                     labels = scales::percent_format(accuracy = 1)) +
  theme_minimal() +
  facet_grid(N~model, scales = "free_y") +
  guides(colour = "none") +
  labs(x = "Defier Percentage",
       y = "P-Value",
       title = "Power Simulation Results at Low Defier Proportion",
       caption = "Note: Y-axis scale inverted. \n Dashed line indicates a p-value of 0.05.")

One concern may be that whilst the tests are capable of detecting defiers, this may be driven by defier activity in the tails. That is, the tests may work but only because of extremely “defiant” defiers since we only consider the most extreme defier in each sample. Fortunately, we can turn to each test’s respective model predictions and compare estimated number of defiers against the actual number of defiers present in the data. The results are reassuring, as the number of defiers grow the number of predicted defiers grows too.

simulations_power_subgroup %>% 
  ggplot(aes(x = n_defiers_true,
             y = n_defiers_estimated,
             colour = model)) +
  geom_point(alpha = 0.15) +
  facet_grid(N~model) +
  geom_abline(slope = 1,
              intercept = 0) +
  theme_minimal() +
  guides(colour = "none") +
  labs(x = "N True Defiers",
       y = "N Predicted Defiers",
       title = "Defier Predictions vs Reality",
       caption = "Note: Line of equality in black.")

The simulations show that using medium to large datasets the tests are extremely powerful - even at defier proportions of just 5% both tests have a near 100% chance of correctly rejecting the null. The smaller dataset is clearly noisier but again shows promising results - at defier proportions around 20% the tests are able to pick up the presence of defiers in a sample almost perfectly.

It’s hard to make inferences comparing the power of the two tests without creating some measure of underlying uncertainty, such as bootstrapping the simulations again, however with 2000 draws of the data we can be reasonably confident at concluding that the saturated first stage test is more powerful in smaller datasets but is surpassed when moving to the moderate and larger dataset. This result, again, is hardly surprising, we’d hope that methods drawing on advances in machine learning have a comparative advantage as \(N\) increases.

Moving from power comparisons to mean \(p\)-value comparisons makes it easier to estimate the underlying uncertainty by calculating the standard error of the mean in each percentile bin - the reduction in uncertainty is stark when moving from the smaller to larger datasets.

sim_bin <- simulations_power_subgroup %>% 
  mutate(gr=cut(pct_defiers_true, labels = FALSE, breaks= seq(0, 1, by = 0.01))/100) %>% 
  group_by(model,
           N,
           gr) %>%
  mutate(n= n()) %>%
  arrange(gr) %>% 
  mutate(n_rejected = ifelse(pval_holm < 0.05, 1, 0)) %>% 
  mutate(pct_rejected = sum(n_rejected)/n) 

sim_bin_summ <- simulations_power_subgroup %>% 
  group_by(model,
           N,
           gr=cut(pct_defiers_true,label = FALSE, breaks= seq(0, 1, by = 0.01))) %>% 
  mutate(n= n()) %>%
  arrange(as.numeric(gr)) %>% 
  mutate(n_rejected = ifelse(pval_defier < 0.05, 1, 0)) %>% 
  summarise(pct_rejected = mean(n_rejected),
            sd_rejected = sd(n_rejected),
            n = unique(n),
            mean_true_defier = mean(pct_defiers_true))


sim_bin %>% 
  na.omit() %>% 
  filter(pct_defiers_true < 0.25) %>% 
  ggplot(aes(x = gr, y = pct_rejected, colour = model, group = model)) +
  geom_point() + 
  geom_line() +
  theme_minimal() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  guides(size = "none")  +
  theme(legend.position = "bottom") +
  labs(y = "Proportion Null Rejected",
       x = "Defier Proportion",
       title = "Power As a Function of Defier Proportion") +
  facet_wrap(~N, ncol = 1) +
  scale_x_continuous(breaks = seq(from = 0, 1, 0.05),
                     labels = scales::percent_format(accuracy = 1))

sim_bin %>% 
  na.omit() %>% 
  filter(pct_defiers_true < 0.25) %>% 
  summarise(m_pval = mean(pval_defier),
         sd_pval = sd(pval_defier)) %>% 
  ggplot(aes(x = gr, y = m_pval, colour = model, group = model)) +
  geom_point() + 
  geom_line() + 
  geom_linerange(aes(x = gr,
                     ymin = m_pval - 1.96*sd_pval,
                     ymax = m_pval + 1.96*sd_pval)) +
  theme_minimal() +
  guides(size = "none",
         colour = "none") +
  facet_grid(N~model) +
    labs(y = "Mean P-Value",
       x = "Defier Proportion",
       caption = "Note: Dashed line indicates p-value of 0.05. \n 95% Confidence Intervals Displayed.",
       title = "Mean P-Value as a Function of Defier Proportion") +
  geom_hline(yintercept = 0.05,
             linetype = "longdash",
             alpha = 0.5) +
  scale_y_reverse() +
  scale_x_continuous(breaks = seq(from = 0,
                                  to = 1,
                                  0.1),
                     labels = scales::percent_format(accuracy = 1))

# sim_bin_summ %>% 
#   ggplot(aes(x = gr, y = pct_rejected, colour = model, group = model)) +
#   geom_point() + 
#   geom_line() + 
#   geom_linerange(aes(x = gr,
#                      ymin = pct_rejected - 1.96*sd_rejected,
#                      ymax = pct_rejected + 1.96*sd_rejected)) +
#   theme_minimal() +
#   scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
#   guides(size = "none",
#          colour = "none") +
#   facet_grid(N~model) +
#     labs(y = "Proportion Null Rejected",
#        x = "Defier Proportion",
#        caption = "Note: Point size indicates number of draws in bin. \n 95% Confidence Intervals Displayed.",
#        title = "Power As a Function of Defier Proportion")

Individual Heterogeneity

Again, there seems to be large dispersion in estimated \(p\)-values even under similar data generating processes and defier coverage. As expected, \(p\)-values fall (i.e. move up the \(y\)-axis using a negative log scale.) as defier coverage increases.

source("simulations/code/hetero_sim.R")

simulations_power_indiv <- bind_rows(
                                    simulations_power_hetero_small,
                                    simulations_power_hetero_medium,
                                    simulations_power_hetero_large
                                    ) %>% 
  as_tibble() %>% 
    mutate(N = factor(N, levels = c("Small", "Medium", "Large")),
           n_comparisons = ifelse(model == "Saturated First Stage", 2^4, NA),
           n_comparisons = ifelse(model == "Forest" & N == "Small",
                                  1000^2,
                                  n_comparisons),
           n_comparisons = ifelse(model == "Forest" & N == "Medium",
                                  5000^2,
                                  n_comparisons),
           n_comparisons = ifelse(model == "Forest" & N == "Large",
                                  10000^2,
                                  n_comparisons))


rm(simulations_power_hetero_small,
   simulations_power_hetero_medium,
   simulations_power_hetero_large)

simulations_power_indiv <- simulations_power_indiv %>% 
  rowwise() %>% 
  mutate(pval_holm = p.adjust(pval_defier, method = "holm", n = n_comparisons)) %>% 
  ungroup()
simulations_power_indiv %>% 
  unite(mod_N, model, N, remove = FALSE) %>% 
  filter(pct_defiers_true > 0) %>% 
  filter(pval_holm > 1e-300) %>% 
  ggplot(aes(x = pct_defiers_true,
             y = pval_holm,
             group = mod_N)) +
  geom_point(alpha = 0.15, aes(colour = model)) +
  geom_hline(yintercept = 0.05,
             linetype = "longdash",
             alpha = 0.5) +
  scale_y_continuous(trans = negative_log_trans(10)) +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
  guides(colour = "none") +
  theme_minimal() +
  facet_grid(N~model, scales = "free_y") +
  labs(x = "Defier Percentage",
       y = "P-Value",
       caption = "Note: Y-Axis on negative log(10) scale. \n Dashed line indicates a p-value of 0.05. \n Line of best fit in purple.",
       title = "P-Values from Power Simulations") +
  geom_smooth(method = lm,
              se = FALSE,
              alpha = 0.2,
              colour = "violetred")

Similar to the subgroup heterogeneous simulations, \(p\)-values swiftly increase in significance as defier proportion increases even at very low levels of defiance. The honest causal forest’s superiority in this context is transparent - the LOESS line of best fit seems to be concave rather than convex, increasing at a faster rate than it’s traditional counterpart.

simulations_power_indiv %>% 
  filter(pct_defiers_true > 0) %>% 
  ggplot(aes(x = pct_defiers_true,
             y = pval_holm,
             colour = model)) +
  geom_point(alpha = 0.15,
             aes(colour = model)) +
  geom_hline(yintercept = 0.05, linetype = "longdash") +
  scale_y_reverse() +
  geom_smooth(se = TRUE,
              colour = "violetred",
              alpha = 0.5) +
  scale_x_continuous(limits = c(0, 0.1), labels = scales::percent_format(accuracy = 1)) +
  theme_minimal() +
  facet_grid(N~model, scales = "free_y") +
  guides(colour = "none") +
  labs(x = "Defier Percentage",
       y = "P-Value",
       title = "Power Simulation Results at Low Defier Proportion",
       caption = "Note: Y-axis scale inverted. \n Dashed line indicates a p-value of 0.05.")

A common feature of the individual heterogeneity specification seems to be frequent over reporting of true defier figures - this could, in part, explain some of the discrepancies in test performance when we take the tests to the data since the saturated first stage seems particularly prone to this issue.

simulations_power_indiv %>% 
  ggplot(aes(x = n_defiers_true,
             y = n_defiers_estimated,
             colour = model)) +
  geom_point(alpha = 0.15) +
  facet_grid(N~model) +
  geom_abline(slope = 1, intercept = 0) +
  theme_minimal() +
  guides(colour = "none") +
  labs(x = "N True Defiers",
       y = "N Predicted Defiers",
       title = "Defier Predictions vs Reality",
       caption = "Note: Line of equality in black.")

Despite misgivings concerning test performance under model misspecification both tests still perform strongly with almost perfect identification at 15% defier proportions in the larger datasets. The apparent paradox between a noisier DGP and better test performance highlights why comparisons between tests across DGPs should be undertaken with caution. Power is a function of defier magnitude; defier proportion and the size of the error term but only defier proportion and error variance is held constant across specifications.

sim_bin_indiv <- simulations_power_indiv %>% 
  mutate(gr=cut(pct_defiers_true, labels = FALSE, breaks= seq(0, 1, by = 0.01))/100) %>% 
  group_by(model,
           N,
           gr) %>%
  mutate(n= n()) %>%
  arrange(gr) %>% 
  mutate(n_rejected = ifelse(pval_holm < 0.05, 1, 0)) %>% 
  mutate(pct_rejected = sum(n_rejected)/n) 

sim_bin_indiv %>% 
  na.omit() %>% 
  ggplot(aes(x = gr,
             y = pct_rejected,
             colour = model,
             group = model)) +
  geom_point() + 
  geom_line() +
  theme_minimal() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  guides(size = "none")  +
  theme(legend.position = "bottom") +
  labs(y = "Proportion Null Rejected",
       x = "Defier Proportion",
       title = "Power As a Function of Defier Proportion") +
  facet_wrap(~N, ncol = 1) +
  scale_x_continuous(breaks = seq(from = 0, 1, 0.05),
                     labels = scales::percent_format(accuracy = 1))

Overall it seems both tests maintain reasonable levels of power under model mispecification issues although the saturated first stage suffers a little more - again this in part, could help explain the observed divergence between the two tests later.

Non-Linear Heterogeneity

source("simulations/code/non_linear_sim.R")
simulations_power_nl <- bind_rows(
                                  simulations_power_non_linear_small,
                                  simulations_power_non_linear_medium,
                                  simulations_power_non_linear_large
                                  ) %>% 
  as_tibble() %>% 
    mutate(N = factor(N, levels = c("Small", "Medium", "Large"))) 

rm(simulations_power_non_linear_small,
   simulations_power_non_linear_medium,
   simulations_power_non_linear_large)

Using the non-linear data generating process the magnitude of \(p\)-values has increased significantly i.e. \(p\)-values are much less small compared to the subgroup and individually heterogeneous cases. Now the smallest \(p\)-values are around \(1 \times e^{-35}\) as opposed to \(1 \times e^{-250}\) as seen above. Interestingly, the honest causal forest seems to produce far smaller \(p\)-values at given defier proportions than the honest causal forest. The honest causal forest test enjoys a comparative advantage at picking up non-linearities at even the tiniest sign of defiance.

simulations_power_nl %>% 
  unite(mod_N, model, N, remove = FALSE) %>% 
  filter(pct_defiers_true > 0) %>%
  filter(pval_defier > 1e-300) %>% 
  ggplot(aes(x = pct_defiers_true,
             y = pval_defier,
             group = mod_N)) +
  geom_point(alpha = 0.15, aes(colour = model)) +
  geom_hline(yintercept = 0.05,
             linetype = "longdash",
             alpha = 0.5) +
  scale_y_continuous(trans = negative_log_trans(10)) +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
  guides(colour = "none") +
  theme_minimal() +
  facet_grid(N~model,
             scales = "free_y") +
  labs(x = "Defier Percentage",
       y = "P-Value",
       caption = "Note: Y-Axis on negative log(10) scale. \n Dashed line indicates a p-value of 0.05. \n Line of best fit in purple.",
       title = "P-Values from Power Simulations") +
  geom_smooth(method = lm,
              se = FALSE,
              alpha = 0.2,
              colour = "violetred")

In a non-linear context it seems that saturated first stage struggles to detect defiance in all three of the datasets considered whereas honest causal forest only struggles in the very smallest dataset.

simulations_power_nl %>% 
  filter(pct_defiers_true > 0) %>% 
  ggplot(aes(x = pct_defiers_true,
             y = pval_defier,
             colour = model)) +
  geom_point(alpha = 0.15,
             aes(colour = model)) +
  geom_hline(yintercept = 0.05,
             linetype = "longdash") +
  scale_y_reverse() +
  geom_smooth(se = TRUE,
              colour = "violetred",
              alpha = 0.5) +
  scale_x_continuous(limits = c(0, 0.1),
                     labels = scales::percent_format(accuracy = 1)) +
  theme_minimal() +
  facet_grid(N~model, scales = "free_y") +
  guides(colour = "none") +
  labs(x = "Defier Percentage",
       y = "P-Value",
       title = "Power Simulation Results at Low Defier Proportion",
       caption = "Note: Y-axis scale inverted. \n Dashed line indicates a p-value of 0.05.")

The overestimation of defiers exhibited by the individual heterogeneous effects has vanished and defier estimation using honest causal forest seems to be consistent and well behaved throughout. Again, at first glance there’s reassuring evidence that honest causal forest is accurately estimating the number, if not identity, of defiers in a dataset and test performance isn’t driven by “tail defiance”. However, the saturated first stage model now has the opposite issue - it systematically underestimates the number of defiers in a dataset.

simulations_power_nl %>% 
  ggplot(aes(x = n_defiers_true,
             y = n_defiers_estimated,
             colour = model)) +
  geom_point(alpha = 0.15) +
  facet_grid(N~model) +
  geom_abline(slope = 1,
              intercept = 0) +
  theme_minimal() +
  guides(colour = "none") +
  labs(x = "N True Defiers",
       y = "N Predicted Defiers",
       title = "Defier Predictions vs Reality",
       caption = "Note: Line of equality in black.")

The simulation results suggest the non-linear specification causes the most issues for the tests with small N - this is particularly worrying considering two of the papers used later in the essay have roughly 1400 and 2000 datapoints each. However, at even moderate sized datasets both tests perform admirably. Again, the honest causal forest’s comparative advantage appears to be large N, low defier situations however it also holds an absolute advantage when using a non-linear DGP - this is the only DGP considered where honest causal forest outperformed the saturated first stage test in the smallest dataset.

sim_bin_nl <- simulations_power_nl %>% 
  mutate(gr=cut(pct_defiers_true,
                labels = FALSE,
                breaks= seq(0, 1, by = 0.01))/100) %>% 
  group_by(model,
           N,
           gr) %>%
  mutate(n= n()) %>%
  arrange(gr) %>% 
  mutate(n_rejected = ifelse(pval_defier < 0.05, 1, 0)) %>% 
  mutate(pct_rejected = sum(n_rejected)/n) 



sim_bin_nl %>% 
  na.omit() %>% 
  filter(pct_defiers_true < 0.35) %>% 
  ggplot(aes(x = gr,
             y = pct_rejected,
             colour = model,
             group = model)) +
  geom_point() + 
  geom_line() +
  theme_minimal() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  guides(size = "none")  +
  theme(legend.position = "bottom") +
  labs(y = "Proportion Null Rejected",
       x = "Defier Proportion",
       title = "Power As a Function of Defier Proportion") +
  facet_wrap(~N, ncol = 1) +
  scale_x_continuous(breaks = seq(from = 0, 1, 0.05),
                     labels = scales::percent_format(accuracy = 1))

Overall, both methods fare well under a range of data generating processes. Each test has individual strengths and weaknesses suggesting a thorough investigation of defiance should include both.

Empirical Results

## Angrist and Evans 1998

#### 1990 Census Data

df_90 <- read_sas("Angrist and Evans 90.sas7bdat")


# Cleaning data according to Angrist and Evans SAS replication codes:

clean_angrist_evans_data <- function(dataset){
  df_clean <- dataset %>% 
    filter((AGEM <= 35) & (AGEM >= 21)) %>% 
    filter(KIDCOUNT >= 2) %>% 
    filter(AGE2NDK >= 1) %>% 
    filter(AAGE == "0" & AAGE2ND == "0" & ASEX == "0" & ASEX2ND == "0") %>% 
    mutate(SEXK = as.numeric(SEXK),
           SEX2NDK = as.numeric(SEX2NDK),
           WEEK89D = as.numeric(WEEK89D),
           WEEK89M = as.numeric(WEEK89M),
           FERTIL = as.numeric(FERTIL)) %>% 
    mutate(fertdif = as.numeric(KIDCOUNT) - (FERTIL - 1),
           agefstm = as.numeric(AGEM) - as.numeric(AGEK)) %>% 
    filter(agefstm >= 15) %>% 
    filter(PWGTM1 > 0) %>% 
    mutate(samesex = (SEXK == SEX2NDK),
           morekids = (KIDCOUNT > 2)) %>% 
    rename_all(tolower)
  return(df_clean)
}
df_90_clean <- df_90 %>%
  clean_angrist_evans_data()
rm(df_90)

#### 1980 Census Data



df_80 <- read_sas("Angrist and Evans 80.sas7bdat")

df_80_clean <- df_80 %>%
    mutate(SEXK = as.numeric(SEXK),
         SEX2ND = as.numeric(SEX2ND),
         WEEKSD = as.numeric(WEEKSD),
         WEEKSM = as.numeric(WEEKSM),
         YOBM = as.numeric(YOBM),
         AGEQK = as.numeric(AGEQK),
         YOBD = 80 - as.numeric(AGED),
         YOBD = ifelse(QTRBTHD == 0, YOBD, YOBD - 1)) %>% 
  mutate(samesex = (SEXK == SEX2ND),
         morekids = (KIDCOUNT > 2),
         ageqm = 4*(80-YOBM) - as.numeric(QTRBTHM) - 1,
         ageqd = 4*(80 - YOBD) - as.numeric(QTRBKID),
         agefstm = round((ageqm-AGEQK)/4),
         agefstd = round((ageqd - AGEQK)/4),
         QTRMAR = as.numeric(QTRMAR),
         QTRBTHM = as.numeric(QTRBTHM),
         AGEMAR = as.numeric(AGEMAR),
         QTRBKID = as.numeric(QTRBKID),
         FERTIL = as.numeric(FERT)
         ) %>% 
  rename_all(tolower)




df_80_filtered <- df_80_clean %>% 
  filter((agem <= 35) & (agem >= 21)) %>% 
  filter(kidcount >= 2) %>%
  filter(ageq2nd > 4) %>% 
  filter(agefstm >= 15) %>%
  filter(agefstd >= 15 | is.na(agefstd)) %>%
  filter(asex == 0 &
         aage == 0 & 
         aqtrbrth == 0 &
         asex2nd == 0 &
         aage2nd == 0 &
         aqtrbrth == 0)



df_80_filtered_two <- df_80_filtered %>% 
  mutate(
         qtrmar = ifelse(qtrmar > 0, qtrmar - 1, qtrmar),
         yom = ifelse(qtrbthm <= qtrmar, yobm + agemar, yobm+agemar+1),
         dom_q = yom + (qtrmar/4),
         do1b_q = yobk + qtrbkid/4,
         illegit = ifelse(dom_q - do1b_q > 0, 1, 0)
         ) %>%  
  mutate(
         msample = ifelse(
           !is.na(aged) &
           timesmar == 1 &
           marital == 0 &
           illegit == 0 &
           agefstd >= 15 &
           agefstm >= 15,
           1, 0)
         ) %>% 
  filter(msample == 1)

######################

df_80_subset <- df_80_filtered_two %>% 
  rename(kid_age = agek,
         m_age = agem,
         d_age = aged) %>% 
  select(-starts_with("a")) %>% 
  mutate_at(c("racek",
              "birthplk",
              "schoolk",
              "state",
              "spanishm",
              "spanishd",
              "poverty"), factor) %>% 
  mutate_if(is.character, as.numeric)



df_80_final <- df_80_subset %>% 
  mutate(boy1st = (sexk == 0),
         boy2nd = (sex2nd == 0),
         boys2 = (sexk == 0) & (sex2nd == 0),
         girls2 = (sexk == 1) & (sex2nd == 1),
         samesex = boys2 | girls2,
         morekids = kidcount > 2,
         black_m = (racem == 2),
         hisp_m = (racem == 12),
         white_m = (racem == 01),
         other_race_m = 1 - black_m - hisp_m - white_m,
         black_d = (raced == 2),
         hisp_d = (raced == 12),
         white_d = (raced == 1),
         other_race_d = 1 - black_d - hisp_d - white_d,
         worked_m = (weeksm > 0),
         worked_d = (weeksd > 0),
         income_m = 2.099173554*(income1m + pmax(0, income2m)),
         income_d = (income1d + pmax(0, income2d)),
         fam_inc = pmax(faminc*2.099173554, 1),
         nonmoi = fam_inc - income1m*2.099173554,
         nonmomil = log(pmax(1, nonmoi)))
rm(df_80,
   df_80_clean,
   df_80_filtered,
   df_80_filtered_two,
   df_80_subset)



## data transformations
transform_clean_angrist_evans <- function(dataset) {
  transformed_df <- dataset %>%
    rename(
      kid_age = agek,
      kid2_age = age2ndk,
      m_age = agem,
      d_age = aged
    ) %>%
    select(-starts_with("a")) %>%
    mutate_at(c(
      "racek",
      "birthplk",
      "schoolk",
      "state",
      "hispm",
      "hispd",
      "poverty",
      "pobm",
      "hispd",
      "pobd",
      "hispk"
    ), factor) %>%
    mutate_if(is.character, as.numeric) %>%
    mutate(
      boy1st = (sexk == 0),
      boy2nd = (sex2ndk == 0),
      boys2 = (sexk == 0) & (sex2ndk == 0),
      girls2 = (sexk == 1) & (sex2ndk == 1),
      samesex = boys2 | girls2,
      morekids = kidcount > 2,
      black_m = (racem == 2),
      hisp_m = (racem == 12),
      white_m = (racem == 01),
      other_race_m = 1 - black_m - hisp_m - white_m,
      black_d = (raced == 2),
      hisp_d = (raced == 12),
      white_d = (raced == 1),
      other_race_d = 1 - black_d - hisp_d - white_d,
      worked_m = (week89m > 0),
      worked_d = (week89d > 0)
    )

  return(transformed_df)
}

df_90_subset <- df_90_clean %>%
  transform_clean_angrist_evans()

  
  
  
df_90_final <- df_90_subset %>% 
  mutate(
         income_m = 1.2883*(incomem1 + pmax(0, incomem2)),
         income_d = (incomed1 + pmax(0, incomed2)),
         fam_inc = pmax(faminc*1.2883, 1),
         nonmoi = fam_inc - incomem1*1.2883,
         nonmomil = log(pmax(1, nonmoi))
  )

rm(df_90_clean, df_90_subset)




## Maimonides Rule


# Mention somewhere HAC instead of moulton factor adjustment - only makes a large difference in the discontinuity sample.

#### Cleaning Grade 5
clean_maimonides_data <- function(dataset){
  clean_df <- dataset %>%
    mutate(
      avgverb = ifelse(avgverb > 100, avgverb - 100, avgverb),
      avgmath = ifelse(avgmath > 100, avgmath - 100, avgmath),
      func1 = c_size/(as.integer((c_size-1)/40)+1),
      func2 = cohsize/(as.integer(cohsize/40)+1),
      avgverb = ifelse(verbsize == 0, NA, avgverb),
      passverb = ifelse(verbsize == 0, NA, passverb),
      avgmath = ifelse(mathsize == 0, NA, avgmath),
      passmath = ifelse(mathsize == 0, NA, avgmath),
      disc = (c_size >= 36 & c_size <= 45) | 
             (c_size >= 76 & c_size <= 85) |
             (c_size >= 116 & c_size <= 125),
      all = 1,
      c_size_squared = (c_size^2)/100,
      trend = ifelse(c_size >= 0 & c_size <= 40, c_size, NA),
      trend = ifelse(c_size >= 41 & c_size <= 80, 20 + (c_size/2), trend),
      trend = ifelse(c_size >= 81 & c_size <= 120, (100/3) + (c_size/3), trend),
      trend = ifelse(c_size >= 121 & c_size <= 160, (130/3) + (c_size/4), trend)
    ) %>% 
    filter(
      classize > 1 & classize < 45 & c_size > 5
    ) %>% 
    filter(
      c_leom == 1 & c_pik < 3
    )
  return(clean_df)
  
}

df_final5_cleaned <- read_dta("Angrist and Lavy Grade 5.dta") %>% 
  clean_maimonides_data()





#### Cleaning Grade 4
df_final4_cleaned <- read_dta("Angrist and Lavy Grade 4.dta") %>% 
  clean_maimonides_data()

Saturated First Stage Test

Angrist and Evans

Angrist and Evans (1998) explores the effect of fertility on household labour supply - they instrument fertility, the number of children a household has, with a same-sex variable that takes the value 1 if a household has two children of identical sex and 0 otherwise. The paper’s findings suggest that having an additional child reduces a mother’s labour supply somewhat substantially.

In this context, a complier is a couple that are more likely to have an additional child if their current children have the same-sex - there is a well documented gender mix preference amongst parents. However, for the LATE estimate to be valid there cannot be any couples that are induced not to have an additional child if their current children are the same-sex.

A plausible explanation for potential defiance could be the presence of economies of scale when raising same-sex children that are potentially lost when considering a third child of ex-ante unknown gender. In this case, couples may be induced to stop having children if both their current off-spring are of the same gender.

AE_data_80 <- df_80_final %>% 
  select(
    samesex,
    morekids,
    black_m,
    hisp_m,
    other_race_m,
    black_d,
    hisp_d,
    other_race_d,
    gradem,
    graded,
    m_age,
    d_age
  ) %>% 
  na.omit()

AE_data_90 <- df_90_final %>% 
  select(samesex,
         morekids,
         black_m,
         hisp_m,
         other_race_m,
         black_d,
         hisp_d,
         other_race_d,
         yearschm,
         yearschd, 
         m_age,
         d_age,
         pwgtm1,
         pwgtd1) %>% 
  na.omit()





first_stage_interactions_80 <- AE_data_80 %>% 
  run_first_stage_interactions_fast(dataset = .,
                                    dependent_variable = "morekids",
                                    instrument = "samesex",
                                    vcov_func = vcovHC,
                                    n_tiles = 3) %>% 
  mutate(dataset = "1980") %>% 
  select(dydx_instrument, SE_dydx_instrument, pval_one_neg, dataset, dydx_lo, dydx_hi, t_stat, n_comparisons, pval_neg_holm)
first_stage_interactions_90 <- AE_data_90 %>% 
  run_first_stage_interactions_fast(dataset = .,
                                   dependent_variable = "morekids",
                                   instrument = "samesex",
                                   vcov_func = vcovHC,
                                   n_tiles = 3) %>% 
  mutate(dataset = "1990") %>% 
  select(dydx_instrument, SE_dydx_instrument, pval_one_neg, dataset, dydx_lo, dydx_hi, t_stat, n_comparisons, pval_neg_holm) 

first_stage_interactions_AE <- bind_rows(
  first_stage_interactions_90,
  first_stage_interactions_80) 

## Number of corrections (theoretical)= 2^6  * 3^4 = 5184 (I think)

## Adjusting confidence intervals for multiple comparisons

first_stage_interactions_AE <- first_stage_interactions_AE %>% 
  mutate(dydx_lo_a = dydx_instrument - SE_dydx_instrument*qt(1-(0.05/2)/n_comparisons, df = n()/2),
         dydx_hi_a = dydx_instrument +  SE_dydx_instrument*qt(1-(0.05/2)/n_comparisons, df = n()/2)) 

Angrist and Evans (1998) run a first stage regression of the form: \[ morekids_i = \alpha + \beta samesex_i + \gamma X_i + u_i \] where \(X_i\) is a vector of controls including age; age at first birth; indicators for Boy 1st, Boy 2nd, Black, Hispanic, and Other race. In addition to this, I include controls for parental years of schooling and parental weight.

The distribution of estimated partial effects is plotted below. Instrument partial effects seem to be particularly homogeneous; there’s very little dispersion around mean partial effects of 0.067, 0.068 respectively. Promisingly, results below zero appear to be nonexistent.

colour_pair <- c("hotpink", "darkorange")

first_stage_interactions_AE %>% 
  ggplot(aes(x = dydx_instrument, fill = dataset)) +
  geom_histogram(colour = "black") +
  facet_wrap(~dataset, scales = "free") +
  theme_minimal() +
  guides(fill = "none") +
  scale_fill_manual(values = colour_pair) +
  labs(x = "Instrument Partial Effect",
       title = "Partial Effects - Same-Sex Instrument")

Ranking instrument partial effects and corresponding confidence intervals shows two things. First, all the partial effects estimated are positive. Second, effects are estimated with large differences in precision. This makes sense, if there’s one subgroup with less data or less precisely estimated effects this will show up in the uncertainty of partial effect estimates.

first_stage_interactions_AE %>% 
  group_by(dataset) %>% 
  sample_n(10000) %>% 
  arrange(dydx_instrument) %>% 
  mutate(rank = row_number()) %>% 
  ggplot(aes(x = rank, y = dydx_instrument, ymin = dydx_lo_a, ymax = dydx_hi_a)) +
  geom_point(aes(colour = dataset)) +
  geom_ribbon(alpha = 0.2) +
  theme_minimal() +
  facet_wrap(~dataset, scales = "free_x") +
  geom_hline(yintercept = 0, linetype = "longdash") +
  guides(colour = "none") +
  scale_colour_manual(values = colour_pair) +
  labs(y = "Instrument Partial Effect",
       title = "Partial Effects Ranked - Same-Sex Instrument")

The histogram of \(p\)-values corresponding to the one-sided null hypothesis of no defiers is rather uninformative and only displayed here as further motivation for the negative log transform of \(p\)-value quantiles used for the remainder of the essay. Notably, most \(p\)-values are very close to 1 - as we’d expect when the vast majority of partial effects are positive and significant.

first_stage_interactions_AE %>% 
  ggplot(aes(x = pval_one_neg, fill = dataset)) +
  geom_histogram(colour = "black") +
  facet_wrap(~dataset, scales = "free") +
  theme_minimal() +
  scale_fill_manual(values = colour_pair) +
  guides(fill = "none") +
  labs(x = "P-Value Defier",
       title = "Histogram of One-Sided P-Values - Same-Sex Instrument")

first_stage_interactions_AE %>% 
  group_by(dataset) %>% 
  sample_n(10000) %>% 
  ggplot(aes(sample = pval_one_neg, colour = dataset)) +
  stat_qq(distribution = qunif) +
  geom_abline(intercept = 0, slope = 1) +
  theme_minimal() +
  scale_x_continuous(trans=negative_log_trans(10)) +
  scale_y_continuous(trans=negative_log_trans(10)) +
  labs(title = "QQ plot of P-values - Same-Sex Instrument",
       caption = "Note: Both axes use negative log(10) scale.") +
  facet_wrap(~dataset) +
  guides(colour = "none") +
  scale_colour_manual(values = colour_pair)

Above is the quantile-quantile plot of unadjusted \(p\)-values. The plot displays the sample versus theoretical quantiles of partial effect \(p\)-values using a uniform distribution. Under a two-sided null hypothesis \(p\)-values will be uniformly distributed and therefore the empirical CDF will be a straight line from 0 to 1 with slope 1.

This method of visualisation is popular in Genome Wide Association Studies (GWAS) when displaying thousands of \(p\)-values. Typically analysts are interested in the hypotheses corresponding to small \(p\)-values (i.e. statistically significant \(p\)-values) and therefore both axes are inverted on a log-log scale. This means the “most significant” \(p\)-values will appear at the top right of the plots.

\(p\)-values hugging the line of equality indicate evidence in favour of the null hypothesis. We’d expect a strong first stage to exhibit near horizontal movement to the right - there are many more large \(p\)-values in the sample than we’d expect theoretically under the null. Therefore, in this essay evidence of defiance would appear as a horizontal line, due to strong first stages in all papers considered, followed by curving back towards or even past the line of equality.

This tool is useful for two reasons - it displays the strength of the first stage but also evidence of defiance in a scale invariant manner. In Angrist and Evans’ paper we see both examples to a degree. In the 1980 sample there’s a strong first stage with no evidence of defiance. The 1990 sample exhibits a little curve back because some partial effects are smaller and closer to 0 than the others, but no evidence of actual defiance - the smallest \(p\)-values are far from reasonable levels of statistical significance.

Maimonides’ Rule

Lavy and Angrist (1999) use Maimonides’ rule, which dictates that maximum class size should be 40 students, as an instrument for actual class size in a fuzzy regression discontinuity design exploring the effect of class size on test scores in English and maths. Their first stage takes the following form:

\[ class\ size_i = \alpha + \beta predicted\ class\ size_i + \gamma_1 enrollment_i + \gamma_2 enrollment^2_i + \gamma_3 pct\ disadvantaged_i + u_i \] where predicted class size is the class size according to Maimonides’ rule. Due to the difficulties of accurately identifying interaction effects with limited data I choose to use the full sample (parametric fuzzy RDD) rather than the discontinuity sample with only approximately 400 observations.

In the context of Angrist and Lavy’s research design, defiance would require that the predicted class size fall, due to Maimonides’ essentially arbitrary cut-off, would lead to an increase in class size rather than a decrease. Perhaps upon exceeding the cut off a school principal decides that class sizes’ are already breaking Maimonides’ rule therefore more children might as well be added to a class to economise on resources.

It’s hard to argue that defiers are likely to be present in this dataset, particularly with the covariates available. Furthermore, the dataset only contains approximately 2,400 datapoints so simulations suggest test power is relatively low.

first_stage_interactions_maim_5 <-  df_final5_cleaned %>% 
    select(classize,
         tipuach,
         c_size,
         func1) %>% 
  run_first_stage_interactions_fast(dataset = .,
                                    dependent_variable = "classize",
                                    instrument = "func1",
                                    vcov_func = vcovCL,
                                    cluster = df_final5_cleaned$classid,
                                    n_tiles = 3) %>% 
  mutate(dataset = "Fifth Grade") %>% 
  select(dydx_instrument, SE_dydx_instrument, pval_one_neg, dataset, dydx_lo, dydx_hi, t_stat, pval_neg_holm, n_comparisons)

first_stage_interactions_maim_4 <- df_final4_cleaned %>% 
  select(classize,
         tipuach,
         c_size,
         func1) %>% 
  run_first_stage_interactions_fast(dataset = .,
                                   dependent_variable = "classize",
                                   instrument = "func1",
                                   vcov_func = sandwich::vcovCL,
                                   cluster = df_final4_cleaned$classid,
                                   n_tiles = 3) %>% 
  mutate(dataset = "Fourth Grade") %>% 
  select(dydx_instrument, SE_dydx_instrument, pval_one_neg, dataset, dydx_lo, dydx_hi, t_stat, pval_neg_holm, n_comparisons)

first_stage_interactions_maimonides <- bind_rows(first_stage_interactions_maim_5, first_stage_interactions_maim_4) 


# N comparisons = 3^2 = 9

first_stage_interactions_maimonides <- first_stage_interactions_maimonides %>% 
    mutate(dydx_lo_a = dydx_instrument - SE_dydx_instrument*qt(1-(0.05/2)/n_comparisons, df = n()/2),
         dydx_hi_a = dydx_instrument +  SE_dydx_instrument*qt(1-(0.05/2)/n_comparisons, df = n()/2)) 

The distribution of instrument partial effects using Maimonides’ rule appear to be definitive; there’s no evidence of defiers in either fourth or fifth grade classes. Ranking the instrument partial effects leads to very similar conclusions - in fact, there are no estimated defiers in either dataset. There are less partial effects estimated using Angrist and Lavy’s dataset since there are fewer covariates available which leads to sparsity of estimated effects and discrete jumps plotted below.

first_stage_interactions_maimonides %>% 
  ggplot(aes(x = dydx_instrument, fill = dataset)) +
  geom_histogram(colour = "black") +
  facet_wrap(~dataset, scales = "free") +
  theme_minimal() +
  guides(fill = "none") +
  scale_fill_manual(values = colour_pair) +
  labs(x = "Instrument Partial Effects",
       title = "Partial Effects - Maimonides' Instrument")

# first_stage_interactions_maimonides %>% 
#   ggplot(aes(x = pval_one_neg, fill = dataset)) +
#   geom_histogram(colour="black") +
#   facet_wrap(~dataset, scales = "free") +
#   guides(fill = "none") +
#   theme_minimal() +
#   labs(title = "P-Values - Maimonides' Instrument",
#        x = "P-Value") +
#   scale_fill_manual(values = colour_pair)

The quantile-quantile plot is horizontal throughout, again, indicative of a strong first stage with no evidence of defiers.

first_stage_interactions_maimonides %>% 
  group_by(dataset) %>% 
  arrange(dydx_instrument) %>% 
  mutate(rank = row_number()) %>% 
  ggplot(aes(x = rank, y = dydx_instrument, ymin = dydx_lo_a, ymax = dydx_hi_a)) +
  geom_point(aes(colour = dataset)) +
  geom_ribbon(alpha = 0.2) +
  theme_minimal() +
  facet_wrap(~dataset, scales = "free_x") +
  geom_hline(yintercept = 0, linetype = "longdash") +
  labs(y = "Partial Effect",
       title = "Partial Effect Ranked - Maimonides' Instrument",
       caption = "Note: Standard errors clustered by class.") +
  scale_colour_manual(values = colour_pair) +
  guides(colour = "none")

first_stage_interactions_maimonides %>% 
  group_by(dataset) %>% 
  ggplot(aes(sample = pval_one_neg, colour = dataset)) +
  stat_qq(distribution = qunif) +
  geom_abline(intercept = 0, slope = 1) +
  theme_minimal() +
  scale_x_continuous(trans=negative_log_trans(10), limits = c(1, 0.000001)) +
  scale_y_continuous(trans=negative_log_trans(10), limits = c(1, 0.000001)) +
  labs(title = "QQ plot of P-values - Maimonides' Instrument",
       caption = "Note: Both axes negative log(10) scale. 
       \n Standard errors clustered by class.") +
  facet_wrap(~dataset) +
  scale_colour_manual(values = colour_pair) +
  guides(colour = "none")

Autor Dorn Hanson

In Autor, Dorn, and Hanson (2013) the authors investigate the effect of Chinese import competition on US labour markets. The authors use a set of exogenous shocks or shifts, which industries are differentially exposed to, to obtain causal estimates. The identification strategy uses a so called “Bartik” or “shift-share” design - instrumenting US imports with changes in Chinese imports by other countries to generate plausibly exogenous temporal variation.

This context is probably the most likely to be susceptible to defiers - it’s plausible that US imports from China won’t react monotonically to changes in third-party Chinese import trends.

df_china <- read_dta("Autor Dorn and Hanson China.dta")
first_stage_interactions_ADH_manu <- df_china %>% 
  select(d_sh_empl_mfg,
         d_tradeusch_pw,
         d_tradeotch_pw_lag,
         l_shind_manuf_cbp,
         starts_with("reg"),
         l_sh_popedu_c,
         l_sh_popfborn,
         l_sh_empl_f,
         l_sh_routine33,
         l_task_outsource,
         t2) %>% 
  run_first_stage_interactions_fast(dataset = .,
                                    dependent_variable = "d_tradeusch_pw",
                                    instrument = "d_tradeotch_pw_lag",
                                    weights = df_china$timepwt48,
                                    vcov_func = vcovCL,
                                    cluster = df_china$statefip,
                                    n_tiles = 3) %>% 
  mutate(dataset = "ADH") %>% 
  select(dydx_instrument, SE_dydx_instrument, pval_one_neg,  dataset, dydx_lo, dydx_hi, t_stat, pval_neg_holm, n_comparisons)

# N_comparisons theoretical: 3^7 * 2 = 4374, empirically 571

first_stage_interactions_ADH_manu <- first_stage_interactions_ADH_manu %>% 
    mutate(dydx_lo_a = dydx_instrument - SE_dydx_instrument*qt(1-(0.05/2)/n_comparisons, df = n()),
         dydx_hi_a = dydx_instrument +  SE_dydx_instrument*qt(1-(0.05/2)/n_comparisons, df = n())) 

Autor, Dorn, and Hanson (2013)’s dataset is by far the smallest considered in the essay so simulation results suggest test power will be relatively low and the saturated first stage test should have a comparative advantage over the causal forest test used next. However, even with this caveat the distribution of instrument partial effects seems to display clear presence of defiance.

first_stage_interactions_ADH_manu %>% 
  ggplot(aes(x = dydx_instrument, fill = dataset)) +
  geom_histogram(colour = "black") +
  guides(fill = "none") +
  scale_fill_manual(values = colour_pair) +
  theme_minimal() +
  labs(x = "Instrument Partial Effects",
       title = "Partial Effects - Bartik Instrument")

# first_stage_interactions_ADH_manu %>% 
#   ggplot(aes(x = pval_one_neg, fill = dataset)) +
#   geom_histogram(colour = "black") +
#   scale_fill_manual(values = colour_pair) +
#   labs(x = "P-Value",
#        title = "P-Values - Bartik Instrument") +
#   theme_minimal()

This seems to be the most “defiant dataset” yet with a large amount of defiers present in the negative tail. Estimates seem to be relatively precisely estimated with little variation between partial effect precision. However, there’s still too much uncertainty to ddefinitively conclude defiers are present in Autor, Dorn and Hanson’s dataset. Therefore, whilst the model identifies potential defiers, we are unable to conclude with statistical certainty that the monotonicity assumption fails.

first_stage_interactions_ADH_manu %>% 
  arrange(dydx_instrument) %>% 
  mutate(rank = row_number()) %>% 
  ggplot(aes(x = rank, y = dydx_instrument, ymin = dydx_lo_a, ymax = dydx_hi_a)) +
  geom_point(aes(colour = dataset)) +
  geom_ribbon(alpha = 0.2) +
  theme_minimal() +
  geom_hline(yintercept = 0, linetype = "longdash") +
  scale_colour_manual(values = colour_pair) +
  labs(title = "Partial Effects Ranked  Bartik Instrument",
       y = "Partial Effect",
       caption = "Note: Standard errors clustered by state.") +
  guides(colour = "none")

The quantile-quantile plot shows reversion to the line of equality and the smallest \(p\)-value will clearly be significant at standard levels of significance before multiple comparison adjustments. The QQ plot suggests ADH’s first stage isn’t particularly strong - the horizontal \(p\)-value line makes up a relatively small proportion of \(p\)-values and there’s a whip-like curving back indicative of defier presence.

first_stage_interactions_ADH_manu %>% 
  ggplot(aes(sample = pval_one_neg, colour = dataset)) +
  stat_qq(distribution = qunif) +
  geom_abline(intercept = 0, slope = 1) +
  theme_minimal() +
  scale_x_continuous(trans=negative_log_trans(10)) +
  scale_y_continuous(trans=negative_log_trans(10)) +
  labs(title = "QQ plot of P-values - Bartik Instrument",
       caption = "Note: Both axes use negative log(10) scale.  Standard errors clustered by state.") +
  guides(colour = "none") +
  scale_colour_manual(values = colour_pair) 

Results

first_stage_results <- suppressWarnings(bind_rows(
  first_stage_interactions_AE %>% 
    mutate(paper = "Same-Sex"),
  first_stage_interactions_maimonides %>% 
    mutate(paper = "Maimonides' Rule"),
  first_stage_interactions_ADH_manu %>% 
    mutate(paper = "Shift-Share")
))

## TEST
# 
# a <- first_stage_results %>% 
#   group_by(paper,
#            dataset) %>% 
#   filter(pval_one_neg == min(pval_one_neg))
# 
# b <- first_stage_results %>% 
#   group_by(paper,
#            dataset) %>% 
#   filter(t_stat == min(t_stat))
# 
# all_equal(a, b)




first_stage_summary <- first_stage_results %>% 
  group_by(paper,
           dataset) %>% 
  summarise(estimated_defiers = sum(ifelse(dydx_instrument < 0, 1, 0)),
            n = n(),
            pct_defiers = estimated_defiers / n,
         defier_pval = min(pval_one_neg),
         defier_pval_adjusted = min(pval_neg_holm),
         defier_tstat = min(t_stat)) %>% 
  ungroup() %>% 
  mutate(dataset = factor(dataset,
                          levels = c("1980",
                                     "1990",
                                     "Fourth Grade",
                                     "Fifth Grade",
                                     "ADH")))

adjusted_p_vals <- first_stage_summary %>% 
  filter(paper == "Same-Sex") %>%
  mutate(p_adj = p.adjust(defier_pval_adjusted, "holm")) %>%
  select(p_adj) %>% 
  pull() %>% 
  round(., 3)

In conclusion, the saturated first stage test using the three papers shows similar results. There are no estimated defiers in the first two papers considered and only approximately 6% in Autor, Dorn and Hanson’s paper. Even without any multiple comparison adjustments we cannot reject the null of no defiance in Autor, Dorn and Hanson’s paper with a \(p\)-value of 0.052. Furthermore, after controlling for multiple comparison adjustments, it’s clear that evidence of defiance is statistically insignificant.

The evidence in favour of the null using Maimonides’ rule and the same-sex instrumental variables strategy is particularly strong - both adjusted and unadjusted \(p\)-values are particularly large at 1 or 0.611 in the 1990 same-sex sample.

colnames(first_stage_summary) <- c("Paper",
                                   "Dataset",
                                   "Defiers",
                                   "N",
                                   "% Defier",
                                   "P-Value",
                                   "Adjusted P-Value",
                                   "T-Stat")

temp_df <- as_tibble(cbind(nms = names(first_stage_summary), t(first_stage_summary)))


colnames(temp_df) <- temp_df[2, ]

temp_df <- temp_df %>% 
  filter(Dataset != "Paper" & Dataset != "Dataset") %>% 
  mutate_at(vars(-Dataset),
            as.numeric)

temp_df %>% 
  gt(rowname_col = "Dataset")  %>%
  tab_spanner(
    label = "Same-Sex",
    columns = vars(`1980`, `1990`)
  ) %>%
  tab_spanner(
    label = "Maimonides' Rule",
    columns = vars("Fourth Grade", "Fifth Grade")
  ) %>%
  tab_spanner(
    label = "Shift-Share",
    columns = vars("ADH")
  ) %>% 
  fmt_number(
    columns = vars(`1990`,
                   `1980`,
                   `Fourth Grade`,
                   `Fifth Grade`,
                   `ADH`),
    decimals = 3,
    drop_trailing_zeros = TRUE
    ) %>% 
  fmt_percent(
    columns = vars(`1980`,
                   `1990`,
                   `Fourth Grade`,
                   `Fifth Grade`,
                   `ADH`),
    rows = 3
  ) %>% 
  tab_header(
    title = "Saturated First Stage Results"
  ) 
Saturated First Stage Results
Maimonides' RuleSame-SexShift-Share
Fifth GradeFourth Grade19801990ADH
Defiers000084
N2,0242,053255,010302,8361,444
% Defier0.00%0.00%0.00%0.00%5.82%
P-Value1110.6110.052
Adjusted P-Value11111
T-Stat3.9294.4193.5070.283-1.627
rm(temp_df)

The \(p\)-values reported in the table can be interpreted as the probability, under the null of no defiers, of observing a weakly more extreme defier in the sample. Both Angrist and Lavy’s and Autor, Dorn and Hanson’s datasets are relatively small in comparison to the census data used by Angrist and Evans where the evidence in favour of the null seems rather definitive. Overall, it would seem that the saturated first stage test is unable to detect the presence of defiers in the datasets considered.

Honest Causal Forest Test

Angrist and Evans

N_obs <- 10000
df_rf_AE_80 <- AE_data_80 %>% 
  sample_n(N_obs)

X_matrix_AE_80 <- df_rf_AE_80 %>% 
  select(-samesex, -morekids) %>% 
  as.matrix()

forest_first_stage_AE_80 <- causal_forest(X = X_matrix_AE_80,
                                    Y = df_rf_AE_80$morekids,
                                    W = df_rf_AE_80$samesex,
                                    num.trees = 4000) 
tau_hat_oob_AE_80 <- predict(forest_first_stage_AE_80, estimate.variance = TRUE) %>%
  as_tibble() %>% 
  mutate(dataset = "1980")

## Now 90s

df_rf_AE_90 <- AE_data_90 %>% 
  sample_n(N_obs)

X_matrix_AE_90 <- df_rf_AE_90 %>% 
  select(-samesex, -morekids) %>% 
  as.matrix()

forest_first_stage_AE_90 <- causal_forest(X = X_matrix_AE_90,
                                          Y = df_rf_AE_90$morekids,
                                          W = df_rf_AE_80$samesex,
                                          num.trees = 4000)
tau_hat_oob_AE_90 <- predict(forest_first_stage_AE_90, estimate.variance = TRUE) %>% 
  as_tibble() %>% 
  mutate(dataset = "1990")

tau_hat_oob_AE <- bind_rows(tau_hat_oob_AE_90,
                            tau_hat_oob_AE_80)

The estimated heterogeneous instrument effects produce very different results, compared to the saturated test, using Angrist and Evans (1998)’s same-sex data. Whilst comparing instrument partial effects and heterogeneous treatment effects isn’t quite an apples to apples comparison, it’s surprising that the results are so different.

There’s significant histogram mass less than 0, with non-negligible mass appearing at -0.1 in the 1990 census data. Moving onto the ranked heterogeneous effects, in the 1990 sample almost half of the observed individuals appear to be defiers. The increase in uncertainty estimating individual heterogeneous effects is stark. However, it’s still suprising that so many defiers appear to be present when only 0.05% of individuals were identified as defiers by the first test. Perhaps there’s significant individual level heterogeneity that is destroyed by estimation at the subgroup level by the saturated first stage model.

ggplot(tau_hat_oob_AE,
       aes(x = predictions,
           fill = dataset)) +
  geom_histogram(colour = "black") +
  facet_wrap(~dataset) +
  scale_fill_manual(values = colour_pair) +
  labs(x = "Heterogeneous Effects",
       title = "Heterogeneous Effects - Same-Sex Instrument") +
  theme_minimal() +
  guides(fill = "none")

tau_hat_results <- tau_hat_oob_AE %>%
  group_by(dataset) %>% 
  arrange(predictions) %>% 
  mutate(sigma_hat = sqrt(variance.estimates),
         prediction_lo = predictions - 1.96*sigma_hat,
         prediction_hi = predictions + 1.96*sigma_hat,
         prediction_lo_a = predictions - sigma_hat*qt(1-(0.05/2)/(n()^2), df = n()),
         prediction_hi_a = predictions + sigma_hat*qt(1-(0.05/2)/(n()^2), df = n()), 
         t_stat = predictions / sigma_hat,
         critical_value = qt(0.95, N_obs - 13, lower.tail = FALSE),
         pval = pt(t_stat, df = N_obs - 13),
         pval_holm = p.adjust(pval, method = "holm", n = n()^2),
         reject_H0 = ifelse(pval_holm < 0.05, 1, 0),
         rank = row_number())

tau_hat_results %>% 
  ggplot(aes(x = rank, y = predictions, ymin = prediction_lo_a, ymax = prediction_hi_a)) + 
  geom_point(aes(colour = dataset)) +
  geom_ribbon(alpha = 0.1) +
  theme_minimal() +
  geom_hline(yintercept = 0, linetype = "longdash") +
  facet_wrap(~dataset) +
  scale_colour_manual(values = colour_pair) +
  labs(y = "Heterogeneous Effects",
       title = "Heterogeneous Effects Ranked - Same-Sex Instrument") +
  guides(colour = "none")

The QQ plot of \(p\)-values in the 1990 sample almost perfectly fits the line of equality followed by movement above the line of equality at the 0.01th theoretical quantile. The causal forest method produces slightly different results for the 1980 sample. Here there’s little evidence of the horizontal CDF we’ve grown to expect and limited evidence of defiers, in stark contrast to the 1990 sample.

Such a large difference in results between tests and within tests between samples is puzzling particularly after the power simulations displayed such similar results under multiple data generating scenarios. Furthermore, Angrist and Evan’s use a census dataset with over 100,000 observations - whilst I don’t conduct any simulations with this many observations - it’s hard to rationalise such a large departure in test results.

tau_hat_results %>% 
  ggplot(aes(sample = pval, colour = dataset)) +
  stat_qq(distribution = qunif)  +
  geom_abline(intercept = 0, slope = 1) +
  theme_minimal() +
  scale_x_continuous(trans=negative_log_trans(10)) +
  scale_y_continuous(trans=negative_log_trans(10)) +
  labs(title = "QQ plot of P-values",
       caption = "Note: Both axes use negative log(10) scale.") +
  facet_wrap(~dataset) +
  scale_colour_manual(values = colour_pair) +
  guides(colour = "none",
         fill = "none")

Maimonides’ Rule

Estimated instrument effects are diffuse and, again, not strictly positive in the Maimonides’ rule instrument variable strategy. This is even more puzzling than the Angrist and Evans (1998)’s case discussed earlier since our first stage test found no evidence of defiance - not only were there no statistically significant defiers, there were no defiers detected at all.

X_matrix_maim_5 <- df_final5_cleaned %>% 
  select(
         tipuach,
         c_size,
         c_size_squared) %>% 
  as.matrix()

forest_first_stage_maim_5 <- causal_forest(X = X_matrix_maim_5,
                                    Y = df_final5_cleaned$classize,
                                    W = df_final5_cleaned$func1,
                                    num.trees = 4000) 
tau_hat_oob_maim_5 <- predict(forest_first_stage_maim_5, estimate.variance = TRUE) %>%
  as_tibble() %>% 
  mutate(dataset = "Fifth Grade")

## Now fourth grade


X_matrix_maim_4 <- df_final4_cleaned %>% 
  select(
         tipuach,
         c_size,
         c_size_squared) %>% 
  as.matrix()

forest_first_stage_maim_4 <- causal_forest(X = X_matrix_maim_4,
                                    Y = df_final4_cleaned$classize,
                                    W = df_final4_cleaned$func1,
                                    num.trees = 4000) 
tau_hat_oob_maim_4 <- predict(forest_first_stage_maim_4, estimate.variance = TRUE) %>%
  as_tibble() %>% 
  mutate(dataset = "Fourth Grade")


tau_hat_oob_maimonides <- bind_rows(tau_hat_oob_maim_5,
                                    tau_hat_oob_maim_4)
ggplot(tau_hat_oob_maimonides,
       aes(x = predictions,
           fill = dataset)) +
  geom_histogram(colour = "black") +
  facet_wrap(~dataset) +
  scale_fill_manual(values = colour_pair) +
  labs(x = "Heterogeneous Effects",
       title = "Heterogeneous Effects - Maimonides' Instrument") +
  theme_minimal() +
  guides(fill = "none")

N_obs <- nrow(tau_hat_oob_maimonides)
tau_hat_results_maimonides <- tau_hat_oob_maimonides %>%
  group_by(dataset) %>% 
  arrange(predictions) %>% 
  mutate(sigma_hat = sqrt(variance.estimates),
         prediction_lo = predictions - 1.96*sigma_hat,
         prediction_hi = predictions + 1.96*sigma_hat,
         prediction_lo_a = predictions - sigma_hat*qt(1-(0.05/2)/(n()^2), df = n()),
         prediction_hi_a = predictions + sigma_hat*qt(1-(0.05/2)/(n()^2), df = n()),
         t_stat = predictions / sigma_hat,
         critical_value = qt(0.95, N_obs - 4, lower.tail = FALSE),
         pval = pt(t_stat, df = N_obs - 13),
         pval_holm = p.adjust(pval, method = "holm", n = n()^2),
         reject_H0 = ifelse(pval_holm < 0.05, 1, 0),
         rank = row_number())

The relatively small dataset, where we know the saturated first stage test outperforms causal forest, is reflected in the huge confidence intervals when displaying the ranked \(p\)-values. Many of the estimated negative effects aren’t significantly different from 0 however this offers little comfort when there’s still such an inexplicable difference between test results.

tau_hat_results_maimonides %>% 
  ggplot(aes(x = rank,
             y = predictions,
             ymin = prediction_lo_a,
             ymax = prediction_hi_a)) + 
  geom_point(aes(colour = dataset)) +
  geom_ribbon(alpha = 0.1) +
  theme_minimal() +
  geom_hline(yintercept = 0, linetype = "longdash") +
  facet_wrap(~dataset) +
  scale_colour_manual(values = colour_pair) +
  labs(title = "Heterogeneous Effects Ranked - Maimonides' Instrument") +
  guides(colour = "none",
         fill = "none")

The Maimonides’ rule \(p\)-value QQ plot shows little evidence of a strong positive effect - there are very few \(p\)-values deep in the null. Evidence from the fifth grade suggests we may be concerned not only with a defiance problem, but also a weak instruments problem; the \(p\)-values hug the null hypothesis line of equality relatively closely.

tau_hat_results_maimonides %>% 
  ggplot(aes(sample = pval, colour = dataset)) +
  stat_qq(distribution = qunif)  +
  geom_abline(intercept = 0, slope = 1) +
  theme_minimal() +
  scale_x_continuous(trans=negative_log_trans(10)) +
  scale_y_continuous(trans=negative_log_trans(10)) +
  labs(title = "QQ plot of P-values - Maimonides' Rule",
       caption = "Note: Both axes use negative log(10) scale.") +
  facet_wrap(~dataset) +
  scale_colour_manual(values = colour_pair) +
  guides(colour = "none",
         fill = "none")

Autor Dorn Hanson

ADH_matrix <- df_china %>% 
  select(d_sh_empl_mfg,
         l_shind_manuf_cbp,
         starts_with("reg"),
         l_sh_popedu_c,
         l_sh_popfborn,
         l_sh_empl_f,
         l_sh_routine33,
         l_task_outsource,
         t2) %>% 
  as.matrix()



forest_first_stage_ADH <- causal_forest(X = ADH_matrix,
                                        Y = df_china$d_tradeusch_pw,
                                        W = df_china$d_tradeotch_pw_lag,
                                        num.trees = 4000)

tau_hat_oob_ADH <- predict(forest_first_stage_ADH, estimate.variance = TRUE) %>%
  as_tibble() %>% 
  mutate(dataset = "ADH")

Moving onto Autor, Dorn, and Hanson (2013)’s paper investigating imports and labour market outcomes leads to, yet again, wildly different conclusions to the saturated first stage test. However, in this case we find no evidence of defiers whereas before this was the most defiant dataset.

The distribution of heterogeneous effects is strictly positive with a long right tail. This is the smallest dataset in the empirical application of our tests so results should be interpreted with caution, especially with such a large divergence between tests.

ggplot(tau_hat_oob_ADH,
       aes(x = predictions, fill = dataset)) +
  geom_histogram(colour = "black") +
  scale_fill_manual(values = colour_pair) +
  labs(x = "Heterogeneous Effects",
       title = "Heterogeneous Effects - Bartik Instrument") +
  theme_minimal() +
  guides(fill = "none")

N_obs <- nrow(tau_hat_oob_ADH)
tau_hat_results_ADH <- tau_hat_oob_ADH %>%
  arrange(predictions) %>% 
  mutate(sigma_hat = sqrt(variance.estimates),
         prediction_lo = predictions - 1.96*sigma_hat,
         prediction_hi = predictions + 1.96*sigma_hat,
         prediction_lo_a = predictions - sigma_hat*qt(1-(0.05/2)/(n()^2), df = n()),
         prediction_hi_a = predictions + sigma_hat*qt(1-(0.05/2)/(n()^2), df = n()),
         t_stat = predictions / sigma_hat,
         critical_value = qt(0.95, N_obs - 12, lower.tail = FALSE),
         pval = pt(t_stat, df = N_obs - 13),
         pval_holm = p.adjust(pval, method = "holm", n = n()^2),
         reject_H0 = ifelse(pval_holm < 0.05, 1, 0),
         rank = row_number())

tau_hat_results_ADH %>% 
  ggplot(aes(x = rank, y = predictions, ymin = prediction_lo_a, ymax = prediction_hi_a)) + 
  geom_point(aes(colour = dataset)) +
  geom_ribbon(alpha = 0.1) +
  theme_minimal() +
  geom_hline(yintercept = 0, linetype = "longdash") +
  scale_colour_manual(values = colour_pair) +
  labs(y = "Heterogeneous Effects",
       title = "Heterogeneous Effects Ranked - Bartik Instrument") +
  guides(colour = "none",
         fill = "none")

The quantile-quantile plot shows no evidence of “curve back” as we’d expect in a dataset with no estimated defiers and the smallest \(p\)-value is safely insignificant at all reasonable testing levels.

tau_hat_results_ADH %>% 
  ggplot(aes(sample = pval, colour = dataset)) +
  stat_qq(distribution = qunif)  +
  geom_abline(intercept = 0, slope = 1) +
  theme_minimal() +
  scale_x_continuous(trans=negative_log_trans(10)) +
  scale_y_continuous(trans=negative_log_trans(10)) +
  labs(title = "QQ plot of P-values - Bartik Instrument",
       caption = "Note: Both axes use negative log(10) scale.") +
  guides(colour = "none",
         fill = "none") +
  scale_colour_manual(values = colour_pair)

Results

As we’ve seen, the honest causal forest estimates are an almost mirror opposite to the saturated first stage results. The only model with no detected defiers is Autor, Dorn, and Hanson (2013)’s whereas the model identifies a large number of defiers in every other dataset.

tau_hat_results_all <- suppressWarnings(bind_rows(
  tau_hat_results %>% 
    mutate(paper = "Same-Sex"),
  tau_hat_results_maimonides %>% 
    mutate(paper = "Maimonides' Rule"),
  tau_hat_results_ADH %>% 
    mutate(paper = "Shift-Share")
))

## TEST
# 
# a <- first_stage_results %>% 
#   group_by(paper,
#            dataset) %>% 
#   filter(pval_one_neg == min(pval_one_neg))
# 
# b <- first_stage_results %>% 
#   group_by(paper,
#            dataset) %>% 
#   filter(t_stat == min(t_stat))
# 
# all_equal(a, b)






tau_hat_summary <- tau_hat_results_all %>% 
  group_by(paper,
           dataset) %>% 
  summarise(estimated_defiers = sum(ifelse(predictions < 0, 1, 0)),
            n = n(),
            pct_defiers = estimated_defiers / n,
         defier_pval = min(pval),
         defier_pval_adjusted = min(pval_holm),
         
         defier_tstat = min(t_stat)) %>% 
  ungroup() %>% 
  mutate(dataset = factor(dataset,
                          levels = c("1980",
                                     "1990",
                                     "Fourth Grade",
                                     "Fifth Grade",
                                     "ADH")))
colnames(tau_hat_summary) <- c("Paper",
                                   "Dataset",
                                   "Defiers",
                                   "N",
                                   "% Defier",
                                   "P-Value",
                                    "Adjusted P-Value",
                                   "T-Stat")


temp_df <- as_tibble(cbind(nms = names(tau_hat_summary), t(tau_hat_summary)))


colnames(temp_df) <- temp_df[2, ]

temp_df <- temp_df %>% 
  filter(Dataset != "Paper" & Dataset != "Dataset") %>% 
  mutate_at(vars(-Dataset),
            as.numeric)

temp_df %>% 
  gt(rowname_col = "Dataset")  %>%
  tab_spanner(
    label = "Same-Sex",
    columns = vars(`1980`, `1990`)
  ) %>%
  tab_spanner(
    label = "Maimonides' Rule",
    columns = vars("Fourth Grade", "Fifth Grade")
  ) %>%
  tab_spanner(
    label = "Shift-Share",
    columns = vars("ADH")
  ) %>% 
  fmt_number(
    columns = vars(`1990`,
                   `1980`,
                   `Fourth Grade`,
                   `Fifth Grade`,
                   `ADH`),
    decimals = 3,
    drop_trailing_zeros = TRUE
    ) %>% 
  fmt_percent(
    columns = vars(`1980`,
                   `1990`,
                   `Fourth Grade`,
                   `Fifth Grade`,
                   `ADH`),
    rows = 3
  ) %>% 
  tab_header(
    title = "Honest Causal Forest Results"
  )
Honest Causal Forest Results
Maimonides' RuleSame-SexShift-Share
Fifth GradeFourth Grade19801990ADH
Defiers3284291,3784,9660
N2,0242,05310,00010,0001,444
% Defier16.21%20.90%13.78%49.66%0.00%
P-Value0.00100.01200.62
Adjusted P-Value11111
T-Stat-3.07-3.352-2.267-4.560.307
rm(temp_df)

We know that the honest causal forest has an extreme test size problem when multiple comparisons aren’t adjusted for, therefore it’s hardly surprising that the model predicts a large number of defiers. Fortunately, after multiple comparison adjustments it’s clear that none of the estimated defiers are precisely estimated enough to reject the null hypothesis of no defiance.

However, the heterogeneous point estimates producing so many defiers is a little troubling. Honest causal forest is a relatively new econometric tool but it’s still shocking how many defiers it predicts. Not only is there sizable variation between tests within datasets, when comparing the saturated first stage defier predictions with honest causal forest which only found defiers in ADH’s dataset, but also there’s a large divergence within tests across samples - it’s hard to believe that approximately only 10% of the census population were defiers in 1980 but within 10 years this figure rises to 50% of the population. Furthermore, if defiers truly make up half the population we’d expect Angrist and Evans to face a severe weak instruments problem which isn’t borne out in reality.

The same-sex 1990 sample demonstrates just how important the adjustment process is - the \(p\)-value is corrected from a value insignificantly different from 0 at 4 decimal places to 1.

In conclusion, the honest causal forest test also fails to reject the null hypothesis of no defiers. However, model identification of defiers leads to very different conclusions to the saturated first stage test.

Detected Defiers

In a bid to reconcile such a large difference in estimated defier proportions I now turn to some simple direct comparisons between model results concerning defier identification. First, we consider defiers detected by honest causal forest followed by the saturated first stage - I use data from Angrist and Lavy as well as Autor, Dorn and Hanson since these papers predominantly condition on continous rather than discrete covariates. Furthermore, since we only wish to compare model identification of defiers without performing inference I avoid discretising the data to increase power and improve comparability between the models.

Maimonides’ Defiers

I plot instrument partial effects, from the saturated first stage model, on the \(y\)-axis against covariates on the \(x\)-axis; below this is a plot of heterogeneous treatment effects, estimated using honest causal forest, against covariates. Points in black indicate defiers detected by honest causal forest.

This exercise is useful for multiple reasons. Firstly, we can characterise common defier properties - for instance, the second plot suggests there are no defiers in schools with enrollment greater than 150, although admittedly there are few observations in this region. Overall, it seems that defiers are fairly evenly distributed across covariate values according to the honest causal forest.

Secondly, we can compare those identified as defiers by the causal forest with results from the saturated first stage. Promisingly, defiers estimated by the honest causal forest aren’t wildly different under the saturated first stage. That is, black points, the defiers, appear towards the lower end of the first plot with partial effects not far from 0.

first_stage_interactions_maim_4_continuous <-  df_final4_cleaned %>% 
    select(classize,
         tipuach,
         c_size,
         func1) %>% 
  run_first_stage_interactions_continuous(dataset = .,
                                    dependent_variable = "classize",
                                    instrument = "func1",
                                    vcov_func = vcovCL,
                                    cluster = df_final4_cleaned$classid,
                                    n_tiles = 3) %>% 
  mutate(dataset = "Fourth Grade") %>% 
  select(dydx_instrument, SE_dydx_instrument, pval_one_neg, dataset, dydx_lo, dydx_hi, t_stat, pval_neg_holm)




maim_4_plot <- df_final4_cleaned %>% 
  select(classize,
         tipuach,
         c_size,
         c_size_squared,
         func1) %>% 
  bind_cols(first_stage_interactions_maim_4_continuous) %>% 
  gather(key, val, classize:func1)
maim_4_plot$prediction_rf <- tau_hat_oob_maim_4$predictions
maim_4_plot <- maim_4_plot %>% 
  mutate(colour = ifelse(prediction_rf > 0, key, "Defier"),
         colour = factor(colour)) %>% 
  mutate(colour = forcats::fct_relevel(colour, c("Defier")))

gg_color_hue <- function(n) {
  hues = seq(15, 375, length = n + 1)
  hcl(h = hues, l = 65, c = 100)[1:n]
}
my_cols <- gg_color_hue(5)
my_cols <- c("#000000", my_cols)

p <- maim_4_plot %>% 
  ggplot(aes(x = val, y = dydx_instrument, colour = colour)) +
  geom_point() +
  facet_wrap(~key, scales = "free") +
  theme_minimal() +
  guides(colour = "none") +
  labs(title = "Saturated First Stage - Partial Effects",
       caption = "Note: Defiers detected by honest causal forest in black.") +
  scale_colour_manual(values = my_cols)



q <- maim_4_plot %>% 
  ggplot(aes(x = val, y = prediction_rf, colour = colour)) +
  geom_point() +
  facet_wrap(~key, scales = "free") +
  theme_minimal() +
  guides(colour = "none") +
  labs(title = "Random Forest - Heterogeneous Effects",
       caption = "Note: Defiers detected by honest causal forest in black.") +
  scale_colour_manual(values = my_cols)




p

q

Making any further comparisons between the plots above is difficult since they estimate essentially different features of the data. However, at first glance a big driver in test divergence appears to be non-linearity of the heterogeneous causal effects estimated.

Autor Dorn Hanson Defiers

Moving onto Autor, Dorn and Hanson’s first stage results we instead display defiers detected by the saturated first stage in purple.

There seems to be a larger disparity between model results in Autor, Dorn and Hanson’s paper - some defiers detected by the saturated first stage have rather large, postive estimated heterogeneous treatment effects. In contrast to Angrist and Lavy’s results there seems to be a systematic pattern to defiance and little evidence of non-linearity. For example, defiers seem concentrated in areas with a low share of industrial manufacturing and proportion of college educated between 30-50%.

first_stage_interactions_ADH_manu_continuous <- df_china %>% 
  select(d_sh_empl_mfg,
         d_tradeusch_pw,
         d_tradeotch_pw_lag,
         l_shind_manuf_cbp,
         starts_with("reg"),
         l_sh_popedu_c,
         l_sh_popfborn,
         l_sh_empl_f,
         l_sh_routine33,
         l_task_outsource,
         t2) %>% 
  run_first_stage_interactions_continuous(dataset = .,
                                    dependent_variable = "d_tradeusch_pw",
                                    instrument = "d_tradeotch_pw_lag",
                                    weights = df_china$timepwt48,
                                    vcov_func = vcovCL,
                                    cluster = df_china$statefip,
                                    n_tiles = 3) %>% 
  mutate(dataset = "ADH") %>% 
  select(dydx_instrument, SE_dydx_instrument, pval_one_neg,  dataset, dydx_lo, dydx_hi, t_stat, pval_neg_holm)

my_cols <- gg_color_hue(8)
my_cols <- c("purple", my_cols)

ADH_plot <- ADH_matrix %>% 
  as_tibble() %>% 
  select_at(vars(-contains("reg"))) %>% 
  bind_cols(first_stage_interactions_ADH_manu_continuous) %>% 
  gather(key,
         val,
         d_sh_empl_mfg:t2) %>% 
  mutate(colour = ifelse(dydx_instrument > 0, key, "Defier"),
         colour = factor(colour)) %>% 
  mutate(colour = forcats::fct_relevel(colour, c("Defier")))


ADH_plot$prediction_rf <- tau_hat_oob_ADH$predictions
p <- ADH_plot %>% 
  ggplot(aes(x = val, y = dydx_instrument, colour = colour)) +
  facet_wrap(~key, scales = "free") +
  geom_point() +
  guides(colour = "none") +
  scale_colour_manual(values = my_cols) +
  theme_minimal() +
      labs(title = "Saturated First Stage - Partial Effects",
       caption = "Note: Defiers detected by saturated first stage in purple.")

q <- ADH_plot %>% 
  ggplot(aes(x = val, y = prediction_rf, colour = colour)) +
  facet_wrap(~key, scales = "free") +
  geom_point() +
  scale_colour_manual(values = my_cols) + 
  theme_minimal() +
  guides(colour = "none") +
      labs(title = "Random Forest - Heterogeneous Effects",
       caption = "Note: Defiers detected by saturated first stage in purple.")


p 

q 

Whilst not the central focus of this essay, identifiying defiers and those most likely to react negatively to a treatment is a crucial concern for policymakers concerned with welfare and equity. Further work, using a blend of traditional econometrics and machine learning techniques could potentially rationalise the difference in results found above. As it stands, it’s difficult to ex-post rationalise defier characteristics when both tests fail to agree on defier identification.

Conclusion

In this essay I’ve presented two methods to detect defiers in datasets of interest to economists. Simulations show that both tests are powerful and correctly sized under a range of adverse data generating specifications and in the face of model misspecification, although both tests struggle to detect non-linear, heterogeneous defiance in smaller datasets.

Both tests produce similar results in simulations although model predictions of defier identity rarely agree. One possible explanation is that Simpson’s paradox is striking with a vengeance. Heterogeneous causal effects generated on aggregate at the observed subgroup level aren’t representative of individual heterogeneous effects and the causal forest method is picking this up whereas the saturated first stage model isn’t. However, I’m cautious of fully accepting the above interpretation - honest causal forest is a relatively new estimation strategy and its properties in real-world, applied settings still aren’t well understood. Occam’s razor would perhaps suggest that the first stage tests should be trusted most - without a doubt further exploration is required.

Furthermore, two of the datasets from labour economics considered are particularly small and graphing the heterogeneous effects suggest at least one of the datasets’ first stage displays non-linear properties therefore test power may be low.

Throughout the essay I’ve tried to present visualisations as a convenient tool to assist labour economists in detecting the presence of defiers in their own datasets. Stata’s margins command and R’s port of the command in the margins package make the saturated first stage test particularly easy to implement for future researchers. However, a major drawback of the tests presented here is a reliance on observed covariates - if a study only collects information on an outcome variable, the instrumented variable of interest and the instrument neither test will be appropriate.

The evidence upon taking the tests to the data is conclusive. Whilst the honest causal forest test predicts many defiers with a high degree of individual certainty, on aggregate after adjusting for multiple comparisons these defiers are no longer statistically significant. The saturated first stage method fails to reject the null hypothesis of no defiers and rarely predicts partial effect point estimates indicative of defiance. It seems reasonable to conclude that LATE estimates by Angrist and Evans; Angrist and Lavy; and Autor, Dorn and Hanson are well identified with no evidence of defiance.

Appendix

Angrist and Evans Replication

Table Two Summary Statistics

Trying to recreate table 2

AE_90_summary <- df_90_final %>% 
  summarise(children_ever_born = mean(fertil - 1),
            more_than_2 = mean(morekids),
            mean_boy_first = mean(boy1st),
            boy_2nd = mean(boy2nd),
            two_boys = mean(boys2),
            two_girls = mean(girls2),
            samesex = mean(samesex),
            age = mean(m_age),
            worked = mean(worked_m),
            weeks = mean(week89m),
            hrs_wk = mean(hour89m),
            labour_income_mum = mean(income_m),
            labour_income_dad = mean(income_d, na.rm = TRUE),
            fam_income = mean(fam_inc),
            mean_non_wife = mean(nonmoi)) %>% 
  gather(term, mean_90) 

AE_80_summary <- df_80_final %>% 
    summarise(children_ever_born = mean(fertil - 1),
            more_than_2 = mean(morekids),
            mean_boy_first = mean(boy1st),
            boy_2nd = mean(boy2nd),
            two_boys = mean(boys2),
            two_girls = mean(girls2),
            samesex = mean(samesex),
            age = mean(m_age),
            worked = mean(worked_m),
            weeks = mean(weeksm),
            hrs_wk = mean(hoursm),
            labour_income_mum = mean(income_m),
            labour_income_dad = mean(income_d, na.rm = TRUE),
            fam_income = mean(fam_inc),
            mean_non_wife = mean(nonmoi)) %>% 
  gather(term, mean_80) 

AE_summary <- inner_join(AE_80_summary, AE_90_summary, by = "term") %>% 
  knitr::kable(digits = 3)
AE_summary
termmean_80mean_90
children_ever_born2.5092.495
more_than_20.3810.370
mean_boy_first0.5140.513
boy_2nd0.5120.511
two_boys0.2660.264
two_girls0.2390.240
samesex0.5050.504
age30.38830.434
worked0.5280.667
weeks19.02026.388
hrs_wk16.70522.690
labour_income_mum6249.9139870.505
labour_income_dad18532.17729197.906
fam_income47658.29144317.127
mean_non_wife41649.27234882.804

Regression Results

Recreating Table 5

table_5_models_90 <- c("worked_m",
  "week89m",
  "hour89m",
  "income_m",
  "log(fam_inc)") %>%
  map(~(
  paste0(., " ~ morekids | samesex") %>% 
  as.formula() %>% 
  ivreg(., data = df_90_final)))

table_5_models_80 <- c("worked_m",
  "weeksm",
  "hoursm",
  "income_m",
  "log(fam_inc)") %>% 
  map(~(
  paste0(., " ~ morekids | samesex") %>% 
  as.formula() %>% 
  ivreg(., data = df_80_final)))

stargazer(table_5_models_80,
          omit = "Constant",
          title = "WALD ESTIMATES OF LABOR-SUPPLY MODELS 1980 DATA - REPLICATED",header = FALSE,type = "html",
          dep.var.labels = c("Worked",
                            "Weeks",
                            "Hours",
                            "Income",
                            "Family Income"))
WALD ESTIMATES OF LABOR-SUPPLY MODELS 1980 DATA - REPLICATED
Dependent variable:
Worked Weeks Hours Income Family Income
(1) (2) (3) (4) (5)
morekids -0.134*** -6.149*** -5.499*** -1,601.982*** -0.039
(0.029) (1.283) (1.080) (600.143) (0.062)
Observations 255,010 255,010 255,010 255,010 255,010
R2 0.012 0.014 0.008 0.011 0.002
Adjusted R2 0.012 0.014 0.008 0.011 0.002
Residual Std. Error (df = 255008) 0.496 21.713 18.269 10,154.510 1.043
Note: p<0.1; p<0.05; p<0.01
stargazer(table_5_models_90,
          omit = "Constant",
          title = "WALD ESTIMATES OF LABOR-SUPPLY MODELS 1990 DATA - REPLICATED",header = FALSE,type = "html",
          dep.var.labels = c("Worked",
                            "Weeks",
                            "Hours",
                            "Income",
                            "Family Income"))
WALD ESTIMATES OF LABOR-SUPPLY MODELS 1990 DATA - REPLICATED
Dependent variable:
Worked Weeks Hours Income Family Income
(1) (2) (3) (4) (5)
morekids -0.087*** -4.947*** -3.757*** -1,736.529** -0.127*
(0.024) (1.165) (0.974) (688.333) (0.073)
Observations 380,007 380,007 380,007 380,007 380,007
R2 0.015 0.021 0.016 0.014 0.006
Adjusted R2 0.015 0.021 0.016 0.014 0.006
Residual Std. Error (df = 380005) 0.468 22.632 18.934 13,377.480 1.428
Note: p<0.1; p<0.05; p<0.01
rm(table_5_models_80,
   table_5_models_90)

Now table 6

create_model_formula <- function(dependent_var){
  model <- as.formula(paste0(dependent_var, "~ morekids + m_age + boy1st + boy2nd + black_m + hisp_m + other_race_m | samesex +  m_age + boy1st + boy2nd + black_m + hisp_m + other_race_m"))
  return(model)
}

table_6_models_80 <- c("worked_m",
  "weeksm",
  "hoursm",
  "income_m",
  "log(fam_inc)") %>% 
  map(~(create_model_formula(.) %>% 
          ivreg(data = df_80_final)))
  

table_6_models_90 <- c("worked_m",
                    "week89m",
                    "hour89m",
                    "income_m",
                    "log(fam_inc)") %>% 
  map(~(create_model_formula(.) %>% 
          ivreg(data = df_90_final)))

stargazer(table_6_models_80,
          keep = "morekidsTRUE",
          title = "2SLS ESTIMATES OF LABOR-SUPPLY MODELS USING 1980 CENSUS DATA - REPLICATION",
          notes = "Age at first birth omitted since I can't seem to find it.",
          dep.var.labels = c("Worked",
                            "Weeks",
                            "Hours",
                            "Income",
                            "Family Income"),
          type = "html")
2SLS ESTIMATES OF LABOR-SUPPLY MODELS USING 1980 CENSUS DATA - REPLICATION
Dependent variable:
Worked Weeks Hours Income Family Income
(1) (2) (3) (4) (5)
morekids -0.124*** -5.566*** -5.118*** -1,336.928** -0.029
(0.029) (1.247) (1.053) (581.528) (0.060)
Observations 255,010 255,010 255,010 255,010 255,010
R2 0.028 0.044 0.030 0.045 0.031
Adjusted R2 0.028 0.044 0.030 0.045 0.031
Residual Std. Error (df = 255002) 0.492 21.385 18.064 9,976.483 1.028
Note: p<0.1; p<0.05; p<0.01
Age at first birth omitted since I can’t seem to find it.
stargazer(table_6_models_90, keep = "morekidsTRUE",
          title = "2SLS ESTIMATES OF LABOR-SUPPLY MODELS USING 1990 CENSUS DATA - REPLICATION",
          notes = "Age at first birth omitted since I can't seem to find it.",
          dep.var.labels = c("Worked",
                            "Weeks",
                            "Hours",
                            "Income",
                            "Family Income"),
          type = "html")
2SLS ESTIMATES OF LABOR-SUPPLY MODELS USING 1990 CENSUS DATA - REPLICATION
Dependent variable:
Worked Weeks Hours Income Family Income
(1) (2) (3) (4) (5)
morekids -0.092*** -5.318*** -3.936*** -1,937.590*** -0.123*
(0.024) (1.148) (0.969) (676.699) (0.070)
Observations 380,007 380,007 380,007 380,007 380,007
R2 0.030 0.049 0.026 0.047 0.088
Adjusted R2 0.030 0.049 0.026 0.047 0.088
Residual Std. Error (df = 379999) 0.464 22.311 18.833 13,150.230 1.368
Note: p<0.1; p<0.05; p<0.01
Age at first birth omitted since I can’t seem to find it.
rm(table_6_models_80, table_6_models_90)

Maimonides’ Rule

Replicating OLS

## Grade 5
col_1 <- lm(avgverb ~ classize, data = df_final5_cleaned)
col_2 <- lm(avgverb ~ classize + tipuach, data = df_final5_cleaned)
col_3 <- lm(avgverb ~ classize + tipuach + c_size, data = df_final5_cleaned)
col_4 <- lm(avgmath ~ classize, data = df_final5_cleaned)
col_5 <- lm(avgmath ~ classize + tipuach, data = df_final5_cleaned)
col_6 <- lm(avgmath ~ classize + tipuach + c_size, data = df_final5_cleaned)
## Grade 4
col_7 <- lm(avgverb ~ classize, data = df_final4_cleaned)
col_8 <- lm(avgverb ~ classize + tipuach, data = df_final4_cleaned)
col_9 <- lm(avgverb ~ classize + tipuach + c_size, data = df_final4_cleaned)
col_10 <- lm(avgmath ~ classize, data = df_final4_cleaned)
col_11 <- lm(avgmath ~ classize + tipuach, data = df_final4_cleaned)
col_12 <- lm(avgmath ~ classize + tipuach + c_size, data = df_final4_cleaned)

SEs <- list(col_1,
            col_2,
            col_3,
            col_4,
            col_5,
            col_6,
            col_7,
            col_8,
            col_9,
            col_10,
            col_11,
            col_12) %>% 
  map(~(vcovHAC(.) %>% 
          diag() %>% 
          sqrt))

stargazer(col_1,
          col_2,
          col_3,
          col_4,
          col_5,
          col_6,
          col_7,
          col_8,
          col_9,
          col_10,
          col_11,
          col_12,
          se = SEs,
          omit.table.layout = "sn",
          type = "html",
          omit = "Constant",
          dep.var.labels = c("English Comprehension",
                             "Maths",
                             "English Comprehension",
                             "Maths"),
          column.labels = c("Grade 5",
                            "Grade 4"),
          column.separate = c(6, 6),
          initial.zero = FALSE,
          notes = "HAC standard errors used instead of Moulton factor adjustment.")
Dependent variable:
English Comprehension Maths English Comprehension Maths
Grade 5 Grade 4
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)
classize .221*** -.031 -.025 .322*** .076* .019 .141*** -.053* -.040 .221*** .055 .009
(.035) (.027) (.033) (.044) (.039) (.045) (.037) (.028) (.033) (.041) (.036) (.040)
tipuach -.350*** -.351*** -.340*** -.332*** -.339*** -.341*** -.289*** -.281***
(.015) (.016) (.021) (.021) (.016) (.017) (.019) (.020)
c_size -.002 .017** -.004 .014**
(.006) (.008) (.006) (.007)
rm(col_1,
   col_2,
   col_3,
   col_4,
   col_5,
   col_6,
   col_7,
   col_8,
   col_9,
   col_10,
   col_11,
   col_12
   )

Replicating IV

## Grade 5


iv_maim_verb_full_5 <- ivreg(avgverb ~ classize + tipuach + c_size + c_size_squared | func1 + tipuach + c_size + c_size_squared, data = df_final5_cleaned)
iv_maim_math_full_5 <- ivreg(avgmath ~ classize + tipuach + c_size + c_size_squared | func1 + tipuach + c_size + c_size_squared, data = df_final5_cleaned)

iv_maim_verb_disc_5 <- ivreg(avgverb ~ classize + tipuach + c_size + c_size_squared | func1 + tipuach + c_size + c_size_squared, data = df_final5_cleaned %>% filter(disc == 1))
iv_maim_math_disc_5 <- ivreg(avgmath ~ classize + tipuach + c_size + c_size_squared | func1 + tipuach + c_size + c_size_squared, data = df_final5_cleaned %>% filter(disc == 1))


## Grade 4

iv_maim_verb_full_4 <- ivreg(avgverb ~ classize + tipuach + c_size + c_size_squared | func1 + tipuach + c_size + c_size_squared, data = df_final4_cleaned)
iv_maim_math_full_4 <- ivreg(avgmath ~ classize + tipuach + c_size + c_size_squared | func1 + tipuach + c_size + c_size_squared, data = df_final4_cleaned)

iv_maim_verb_disc_4 <- ivreg(avgverb ~ classize + tipuach + c_size + c_size_squared | func1 + tipuach + c_size + c_size_squared, data = df_final4_cleaned %>% filter(disc == 1))
iv_maim_math_disc_4 <- ivreg(avgmath ~ classize + tipuach + c_size + c_size_squared | func1 + tipuach + c_size + c_size_squared, data = df_final4_cleaned %>% filter(disc == 1))




SEs_IV_5 <- list(
  iv_maim_verb_full_5,
  iv_maim_math_full_5,
  iv_maim_verb_disc_5,
  iv_maim_math_disc_5
) %>%
  map(~(vcovHAC(.) %>% 
          diag() %>% 
          sqrt))
stargazer(iv_maim_verb_full_5,
          iv_maim_math_full_5,
          iv_maim_verb_disc_5,
          iv_maim_math_disc_5,
          se = SEs_IV_5,
          column.labels = c("Full Sample", "Discontinuity Sample"),
          column.separate = c(2, 2),
          type = "html",
          title = "IV Maimonides Rule Fifth Grade",
          notes = "HAC standard errors used instead of Moulton factor adjustment.")
IV Maimonides Rule Fifth Grade
Dependent variable:
avgverb avgmath avgverb avgmath
Full Sample Discontinuity Sample
(1) (2) (3) (4)
classize -0.263*** -0.264** -0.599*** -0.490**
(0.093) (0.120) (0.225) (0.246)
tipuach -0.369*** -0.350*** -0.457*** -0.423***
(0.017) (0.023) (0.051) (0.066)
c_size 0.013 0.063* 0.126 0.284**
(0.026) (0.036) (0.114) (0.132)
c_size_squared 0.004 -0.010 -0.044 -0.124*
(0.010) (0.014) (0.060) (0.075)
Constant 86.129*** 75.993*** 92.499*** 74.510***
(1.755) (2.334) (4.624) (6.698)
Observations 2,019 2,018 471 471
R2 0.343 0.229 0.232 0.201
Adjusted R2 0.341 0.228 0.225 0.194
Residual Std. Error 6.236 (df = 2014) 8.435 (df = 2013) 7.200 (df = 466) 9.161 (df = 466)
Note: p<0.1; p<0.05; p<0.01
HAC standard errors used instead of Moulton factor adjustment.
SEs_IV_4 <- list(
  iv_maim_verb_full_4,
  iv_maim_math_full_4,
  iv_maim_verb_disc_4,
  iv_maim_math_disc_4
) %>%
  map(~(vcovHAC(.) %>% 
          diag() %>% 
          sqrt))
stargazer(iv_maim_verb_full_4,
          iv_maim_math_full_4,
          iv_maim_verb_disc_4,
          iv_maim_math_disc_4,
          se = SEs_IV_4,
          column.labels = c("Full Sample", "Discontinuity Sample"),
          column.separate = c(2, 2),
          type = "html",
          title = "IV Maimonides Rule Fourth Grade",
          notes = "HAC standard errors used instead of Moulton factor adjustment.")
IV Maimonides Rule Fourth Grade
Dependent variable:
avgverb avgmath avgverb avgmath
Full Sample Discontinuity Sample
(1) (2) (3) (4)
classize -0.074 -0.033 -0.131 0.033
(0.069) (0.085) (0.131) (0.167)
tipuach -0.346*** -0.284*** -0.350*** -0.291***
(0.017) (0.020) (0.040) (0.048)
c_size -0.040* 0.007 -0.057 -0.016
(0.022) (0.028) (0.097) (0.116)
c_size_squared 0.021** 0.006 0.045 0.023
(0.009) (0.012) (0.061) (0.074)
Constant 81.036*** 72.775*** 82.406*** 70.967***
(1.521) (1.923) (4.635) (6.562)
Observations 2,049 2,049 415 415
R2 0.312 0.206 0.281 0.184
Adjusted R2 0.311 0.205 0.274 0.177
Residual Std. Error 6.635 (df = 2044) 7.818 (df = 2044) 6.671 (df = 410) 8.244 (df = 410)
Note: p<0.1; p<0.05; p<0.01
HAC standard errors used instead of Moulton factor adjustment.
rm(SEs_IV_4,
   SEs_IV_5,
   SEs,
   iv_maim_math_disc_4,
   iv_maim_math_full_4,
   iv_maim_verb_disc_4,
   iv_maim_verb_full_4,
   iv_maim_math_disc_5,
   iv_maim_math_full_5,
   iv_maim_verb_disc_5,
   iv_maim_verb_full_5)
save.image("C:/Users/edjee/Dropbox/Ed/cached_run.RData")

References

Angrist, Joshua D., and William N. Evans. 1998. “Children and Their Parents’ Labor Supply: Evidence from Exogenous Variation in Family Size.” The American Economic Review 88 (3). American Economic Association: 450–77. http://www.jstor.org/stable/116844.

Angrist, and Jorn-Steffen Pischke. 2009. Mostly Harmless Econometrics: An Empiricist’s Companion. 1st ed. Princeton University Press. https://EconPapers.repec.org/RePEc:pup:pbooks:8769.

Athey, Susan, and Guido Imbens. 2016. “Recursive Partitioning for Heterogeneous Causal Effects.” Proceedings of the National Academy of Sciences 113 (27). National Academy of Sciences: 7353–60. doi:10.1073/pnas.1510489113.

Athey, Susan, Julie Tibshirani, and Stefan Wager. 2016. “Generalized Random Forests.” arXiv E-Prints, October, arXiv:1610.01271.

Autor, David H., David Dorn, and Gordon H. Hanson. 2013. “The China Syndrome: Local Labor Market Effects of Import Competition in the United States.” American Economic Review 103 (6): 2121–68. doi:10.1257/aer.103.6.2121.

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society. Series B (Methodological) 57 (1). [Royal Statistical Society, Wiley]: 289–300. http://www.jstor.org/stable/2346101.

Chernozhukov, Victor, Sokbae Lee, and Adam M. Rosen. 2009. “Intersection Bounds: Estimation and Inference.” arXiv E-Prints, July, arXiv:0907.3503.

Gelman, Andrew. 2018. “Statistical Modeling, Causal Inference, and Social Science.” Statistical Modeling, Causal Inference, and Social Science: https://statmodeling.stat.columbia.edu/2018/03/15/need-16-times-sample-size-estimate-interaction-estimate-main-effect/.

Holm, Sture. 1979. “A Simple Sequentially Rejective Multiple Test Procedure.” Scandinavian Journal of Statistics 6 (2). [Board of the Foundation of the Scandinavian Journal of Statistics, Wiley]: 65–70. http://www.jstor.org/stable/4615733.

Imbens, Guido W., and Joshua D. Angrist. 1994. “Identification and Estimation of Local Average Treatment Effects.” Econometrica 62 (2). [Wiley, Econometric Society]: 467–75. http://www.jstor.org/stable/2951620.

Kitagawa, Toru. 2015. “A Test for Instrument Validity.” Econometrica 83 (5): 2043–63. doi:10.3982/ECTA11974.

Lavy, Victor, and Joshua D. Angrist. 1999. “Using Maimonides’ Rule to Estimate the Effect of Class Size on Scholastic Achievement*.” The Quarterly Journal of Economics 114 (2): 533–75. doi:10.1162/003355399556061.

Mourifie, Ismael, and Yuanyuan Wan. 2017. “Testing Local Average Treatment Effect Assumptions.” The Review of Economics and Statistics 99 (2): 305–13. doi:10.1162/REST\_a\_00622.

Shaffer, Juliet. 1995. “Multiple Hypothesis Testing.” Annual Review of Psychology 46 (January): 561–84. doi:10.1146/annurev.ps.46.020195.003021.

Wager, Stefan, and Susan Athey. 2015. “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests.” arXiv E-Prints, October, arXiv:1510.04342.


  1. For simplicity the rest of the essay assumes \(D_{1i} - D_{0i} \geq 0\) unless explicitly stated otherwise.

  2. Or equivalently, taking differences in the binary case.

  3. Which is perhaps approximating \(N \times (N-1)\) adjustments - verifying this rigorously is, again, computationally prohibitive.

  4. I use Cholesky decomposition to generate p.s.d \(\Sigma = LL'\) by simulating entries of \(L\) drawn from \(U(-1, 1)\)