In this post we’re going to demonstrate the use of Bayesian hierarchical models to overcome the weak instruments problem. We essentially recreate the work of Chamberlain and Imbens, 2003 which I definitely should have read before embarking on this project and not after - ah well, we live and learn. We’re going to fit a moderately complicated instrumental variables model in stan along the way. The set-up is a little involved with this one, so feel free to skip straight to the results at the bottom - there are some pretty plots.

Setting up The Problem

There’s a lot of material on weak instruments online so we’ll keep this short - a simple, canonical weak instrument problem looks something like this:

\[ \begin{aligned} \beta - \hat{\beta} &= (Z'X)^{-1}Z'\varepsilon \\ &\rightarrow (0)^{-1}(0) \end{aligned} \]

Even asympotically as \(N\) grows the first stage relationship effectively disappears. We can think of an instrument as an exogenous shock to one part of the system and in the weak instruments problem this shock is too small - we get a nasty limiting distribution.

Rephrasing the above along Mostly Harmless lines using the Wald estimator we have:

\[ \begin{aligned} y_i &= z_i\gamma + u_i \text{ (Reduced Form)}\\ x_i &= z_i\pi + e_i \text{ (First Stage)} \\ \\ \hat{\beta}_{iv} &= \frac{\hat{\gamma}}{\hat{\pi}} \end{aligned} \]

How do we fix this? Well a good way would be to stop dividing by 0 - bad things tend to happen when we divide by 0.

A Simple Bayesian Approach

Let’s embed the above in a Bayesian setting, focusing on the first stage for now: \[ \begin{aligned} \pi &\sim N(\mu, \lambda^{-1} ) \\ X &= Z\pi + e \\ \pi_{\text{posterior}} &= (Z'Z + \lambda)^{-1}(Z'Z\hat{\pi} + \lambda\mu) \end{aligned} \]

The first line indicates lays out our prior over \(\pi\), we’ve used a precision parametrisation to make the interpretation of \(\lambda\) easier. The second line just switches notation slightly towards a matrix algebra formulation. Finally, we come to the posterior.

The posterior of \(\pi\) is very similar to the frequentist solution but we have this extra \(\lambda\) term - the interpretation here is that the posterior is essentially a precision weighted average of our prior and the data where \(\hat{\pi}\) is the least squares solution familiar to frequentists.

How does this help us?

Well, we require \(\pi_{\text{posterior}} \nrightarrow 0\) whilst \(\hat{\pi} \rightarrow 0\). Therefore, we really, really want \(\mu \nrightarrow 0\).

This gives us two potential solutions:

Using a BHM:

\[ \begin{aligned} \mu \sim N(\hat{\mu}, \hat{V}_{\mu}) \\ \hat{\mu} = \frac{\sum^J\frac{\hat{\pi}_j}{\sigma^2_j + \lambda^2}}{\sum^J\frac{1}{\sigma^2_j + \lambda^2}} \end{aligned} \]

Where \(\sigma^2_j\) indicates the variance of the model’s “lower level” (\(X_{ij} = Z_{ij}\pi_j + e_{ij};\ e_{ij} \sim N(0, \sigma^2_j)\)).

Essentially, weak instrument groups are identified off the population mean which produces a sensible measure of underlying uncertainty in weak instrument groups. The population mean is a variance weighted average of each group’s parameter - if we see a strong signal from one group it has more weight in determining the overall population mean.

Weak instrument identification off the population mean is reflected in larger probability intervals for weaker groups, however the limiting distribution is well-behaved and non-degenerate unlike its frequentist cousin.

We’re going to focus building a model using the second bullet point in the rest of this post.

Bayesian Hierarchical Model

The above brushed over what a Bayesian hierarchical model actually does - we’ll talk very briefly about the intuition here.

Suppose we observe multiple natural groups or levels in our data indexed by \(j\) - maybe a treatment effect from a federal policy we observe across states or cities. In some states we might face a very weak instrument but in others a much stronger instrument. We know that what we learn from one treatment effect should probably generalise, to a degree, to another city. A Bayesian hierarchical model lets us put structure over this generalisability; we want to separate sampling variation from true heterogeneity in treatment effects; introduce regularisation through hierarchical shrinkage and overcome the weak instruments problem - a BHM brings all these good things to the table.

Sometimes a BHM is simplest to think about when we draw it, here \(\beta_j\) indexes groups not regressors:

We model each group’s \(\beta_j\) as being drawn from some underlying population or global parameter, \(\mu\). We’re Bayesians so we put a prior probability distribution over \(\mu\) - we call these hyper-priors, here \(N(0, 10)\).

Typically in frequentist estimation we face two extremes when data is observed across multiple groups, for instance consider panel data:

Another alternative is to estimate a fixed effects regression - we estimate a dummy for each individual or time period or both. The fixed effects regression is sort of a half way house where we let an entity’s intercept vary but still maintain constant slopes across individuals.

The BHM represents a compromise between the two extremes - it can return the pooling or no-pooling estimate but also some weighted average between. The solution a BHM will return will depend on the hierarchical variance of the data; essentially we ask: how much of the difference in treatment effects is due to sampling variation vs true heterogeneity. As such, Bayesian hierarchical models are often fit with varying slopes and intercepts - not only are there additive differences between group outcomes but the treatment can have differential effects across groups.1

But we’re not quite done - we want to perform instrumental variables estimation, not simple OLS. The solution - add another dimension:

\[ \begin{aligned} \mathcal{Y}_{j} &= (\textbf{Y}_{j} \quad \textbf{X}_j) \\ \mathcal{Y}_j &\sim MVN( \mathcal{X}_j \mathcal{B}_j, \Omega_j) \quad \forall i, j \end{aligned} \]

where:

\[ \begin{aligned} \text{Y}_{ij} &= \widehat{\text{X}}^n_{ij}\beta_j^{iv} + X_{ij}^e\beta^{e}_j + u_{ij} \\ \text{X}_{ij} &= \text{Z}_{ij}\pi_j + X^e_{ij}\gamma_j + e_{ij} \end{aligned} \]

Stacking into block matrices gives: \[ \begin{aligned} (\textbf{Y}_j \quad \textbf{X}_{j}) &= (\text{Z}_{ij} \quad X^e_{ij}) \begin{pmatrix} \pi_j \beta^{iv}_{j} & \pi_j \\ \gamma_j \beta^{iv}_j + \beta^e_j & \gamma_j \end{pmatrix} \\ &= \mathcal{X}_j (\mathbf{b}^r \quad \mathbf{b}^f) \\ &= \mathcal{X}_j \mathcal{B}_j \end{aligned} \]

We need to use the multivariate normal here to fully account for posterior uncertainty across both relationships. Modelling the problem as two independent normals is the Bayesian analogue to computing two-stage least squares manually and forgetting to adjust our standard errors.

Where the multivariate normal covariance matrix describes the first stage and reduced form’s variance and covariance:

\[ \Omega_j = \begin{pmatrix} \sigma^{2}_{r,j} & \sigma_{fr, j} \\ \sigma_{fr, j} & \sigma^{2}_{f, j} \end{pmatrix} \]

Priors

We need to put some priors on all of the above, here we use a pretty standard set up:

\[ \begin{aligned} \mathbf{b}^{d}_j &\sim MVN(\mu^{d}, \Sigma^d) & r, f \in d \\ \mu^{r}, \mu^{f} &\sim N(0, 10) \end{aligned} \]

As is common in Bayesian estimation we split our covariance matrices up into a scale and correlation matrix:

\[ \begin{aligned} \Sigma &= \text{diag}(\tau) \Phi \text{diag}(\tau) \\ \Omega_j &= \text{diag}(\nu_j) \Theta_j \text{diag}(\nu_j) \end{aligned} \]

with priors:

\[ \begin{aligned} \tau, \nu &\sim \text{Cauchy}^+(0, 5) \\ \Phi, \Theta_j &\sim \text{LKJ}(2) \end{aligned} \]

That’s almost all the Greek alphabet I know, so fortunately we can turn to actually estimating this monster now.

Sufficient Statistics

Unfortunately, fitting this model in stan would take a long time. Hierarchical models are notoriously hard to fit and we’re bumping the difficulty up somewhat by estimating a multivariate hierarchical model.

Fortunately, help is at hand - we’re using a multivariate normal likelihood which means we can exploit the fact that the normal is in the exponential family of distributions and therefore the factorisation criteria means we can find sufficient statistics for the normal likelihood. For the univariate case, suppose we’re estimating the mean of \(n\) \(x_i\) draws from a Gaussian distribution, rather than evaluating the likelihood conditioning on all the data we can just use \(\sum^n_i{x_i}\) and \(\sum^n_i{x_i^2}\) to evaluate the likelihood.

This is an incredible speed up for stan - rather than looping over \(N\) data points we can loop over \(J\) sets of sufficient statistics.

To move from the univariate to multivariate case we’re going to reparametrise the problem to create a residual matrix and compute \(\bar{\mathbf{x}}\) and \(\hat{\Sigma}\)

Therefore, define a residual matrix: \[ R = (Y_{j} \quad X^n_j) - (Z_{j} \quad X^e_{j}) \begin{pmatrix} \pi_j \beta^{iv}_{j} & \pi_j \\ \gamma_j \beta^{iv}_j + \beta^e_j & \gamma_j \end{pmatrix} \\ \]

For simplicity’s sake let’s rewrite this as:

\[ R = (a - db) \] Therefore we need to factor the likelihood:

\[ L = (2\pi)^{\frac{-n}{2}}|\Sigma|^{-\frac{n}{2}}exp(-\frac{1}{2} R'\Sigma^{-1}R) \]

Into something like this: \[ \text{log} L \approx -\frac{n}{2}\Sigma^{-1}\bar{R'}\bar{R} - \frac{1}{2} \text{trace}(\Sigma^{-1}R'R) \]

I’m not sure this is the most elegant formulation but it was the easiest for me to debug in R.

Calculating \(R'R\) (and the analogous \(\bar{R'}\bar{R}\)) is just a matter of expanding out \((a' - b'd')(a-db)\):

\[\begin{aligned} R'R &= a'a - 2a'db + b'd'db \\ a'a &= \pmatrix{Y'Y \quad Y'X_n \\ X'_nY \quad X'_nX_n} \\ a'd &= \pmatrix{Y'Z \quad Y'X_e \\ X'_nZ \quad X'_nX_e} \\ d'd &= \pmatrix{Z'Z \quad Z'X_e \\ X'_eZ \quad X'_eX_e} \end{aligned}\]

That’s the last of the algebra - I promise. It’s worth noting that the elements of each matrix are themselves block matrices, writing the sufficient statistics out in their most general format makes it easier to add multiple instruments and covariates later down the line.

Generating Fake Data

First we’re going to define a function to generate our model coefficients. Here we’ve modelled our coefficients of interest, \(\beta^{iv}_j\), as normal with mean 1 and variance 1. The first stage relationship of interest, \(\pi_j\), will be centred at -2 with variance 1 also - however we’ve set \(\pi_1 = 0\) for every draw of the simulation. Our exogenous variables are centred at 0, again with variance 1.

Now we’re going to generate our \(X, Z \text{ and } Y\) data:

In the above function we’ve created dependence across the errors in each equation by setting the covariance on the off diagonal to 2.5.

Now we create a function to actually fit the frequentist and Bayesian models - hierarchical_iv_sufficient_statistics just computes the SSs we derived above. We’ve set the adapt_delta argument to stan’s sampling extremely high - this model code was originally for a slightly different data generating process and re-parametrising the model just for this post is something I’d rather avoid. One of the advantages of sufficient statistics is that we can be lazy and often bruteforce divergent transitions through a higher adapt delta instead of non-centred parametrisation.

Next up, is the stan model - one of the drawbacks of sufficient statistics is that stan models, which aren’t particularly literate at the best of times, are a bit harder to follow:

## functions {
##   real normal_iv(int N,int K_n, matrix aa, matrix ad, matrix dd, matrix aa_bar, matrix ad_bar, matrix dd_bar, matrix Sigma, matrix beta){
##     real lp;
##     matrix[1 + K_n, 1 + K_n] RR;
##     matrix[K_n + 1, K_n + 1] RR_bar;
##     RR = aa - 2 * ad * beta + beta' * dd * beta;
##     RR_bar = aa_bar - 2 * ad_bar * beta + beta' * dd_bar * beta;
##     lp = -0.5*N*log(determinant(Sigma))  - 0.5 * N * trace(inverse_spd(Sigma) * RR_bar) - 0.5 * trace(inverse_spd(Sigma) * RR);
##     return lp;
##   }
##   
##   matrix create_beta(int K_n, int K_e, int K_z, vector beta_iv, vector beta_reduced_form, vector Beta_first_stage, vector Pi) {
##     matrix[K_z + K_e, 1 + K_n] Beta_matrix_final;
##     
##     Beta_matrix_final[1:K_z, 1] = to_vector(Pi * beta_iv');
##     Beta_matrix_final[(K_z + 1):, 1] = to_vector(Beta_first_stage * beta_iv') + beta_reduced_form;
##     Beta_matrix_final[1:K_z, 2] = Pi;
##     Beta_matrix_final[(K_z + 1):, 2] = Beta_first_stage;
##     return Beta_matrix_final;
##   }
## }
## data {
##   int<lower=0> J; // number of groups
##   int<lower=0> N[J]; // number of observations
##   int<lower=0> K_n; // number of endogenous predictors
##   int<lower=0> K_e; // number of exogenous predictors
##   int<lower=0> K_z; // number of instruments although this has to equal k_n right now
##   
##   matrix[K_n + 1, K_n + 1] aa[J];
##   matrix[K_n + 1, K_n + 1] aa_bar[J];
##   
##   matrix[1 + K_n, K_z + K_e] ad[J];
##   matrix[1 + K_n, K_z + K_e] ad_bar[J];
##   
##   matrix[K_z + K_e, K_z + K_e] dd[J];
##   matrix[K_z + K_e, K_z + K_e] dd_bar[J];
## }
## parameters {
##   // MVN Sigma
##   vector<lower=0>[1 + K_n] multi_scale;
##   cholesky_factor_corr[1 + K_n] multi_L_Omega;
##   // Covariate Coefs
##   vector[K_z] Pi[J];
##   vector[K_e] Beta_first_stage[J];
##   vector[K_n] beta_iv[J];
##   vector[K_e] beta_reduced_form[J];
##   // RF and FS beta sigmas
##   vector<lower=0>[K_z + K_e] Sigma_FS;
##   vector<lower=0>[K_n + K_e] Sigma_RF;
##   // Hyper priors
##   vector[K_n + K_e] mu_rf_hyper;
##   vector[K_z + K_e] mu_fs_hyper;
## }
## transformed parameters {
##   vector[K_n + K_e] reduced_form_coefs[J];
##   vector[K_z + K_e] first_stage_coefs[J];
##   matrix[K_z + K_e, 1 + K_n] beta[J];
##   matrix[1 + K_n, 1 + K_n] data_Sigma;
##   
##   data_Sigma = diag_pre_multiply(multi_scale, multi_L_Omega)* diag_pre_multiply(multi_scale, multi_L_Omega)';
##   
##   for (j in 1:J){
##   reduced_form_coefs[j] = append_row(beta_iv[j], beta_reduced_form[j]);  
##   first_stage_coefs[j] = append_row(Pi[j], Beta_first_stage[j]);
##   beta[j] = create_beta(K_n, K_e, K_z, beta_iv[j], beta_reduced_form[j], Beta_first_stage[j], Pi[j]);  
##   }
## }
## model {
##   mu_rf_hyper ~ normal(0, 5);
##   mu_fs_hyper ~ normal(0, 5);
## 
##   Sigma_FS ~ normal(0, 1);
##   Sigma_RF ~ normal(0, 1);
##   
##   multi_scale ~ normal(0, 5);
##   multi_L_Omega ~ lkj_corr_cholesky(2);  
## 
##   for (j in 1:J){
##     reduced_form_coefs[j] ~ normal(mu_rf_hyper, Sigma_RF);
##     first_stage_coefs[j] ~ normal(mu_fs_hyper, Sigma_FS);  
##     target += normal_iv(N[j], K_n, aa[j], ad[j], dd[j], aa_bar[j], ad_bar[j], dd_bar[j], data_Sigma, beta[j]);  
##   }
## }
## generated quantities {
##   real pooling_factor_beta_iv;
##   real pooling_factor_pi;
##   
##     pooling_factor_beta_iv = 1 - (Sigma_RF[1] / (Sigma_RF[1] + data_Sigma[1, 1]));
##     pooling_factor_pi = 1 - (Sigma_FS[1] / (Sigma_FS[1] + data_Sigma[2, 2 ]));
## }

I don’t want to spend a huge amount of time on the stan model - the first function returns the log probability using our SSs and the second takes our equation coefficients and stacks them into a single matrix. The remaining lines are similar to a generic stan regression model apart from the data block which accepts sufficient statistics instead of actual data.

Finally, we define an anonymous function to pull all of the above together; record some MCMC diagnostics such as split Rhat and finally, spit out a dataframe with model estimates. The last few lines perform the simulation.2

Results

The results are pretty stark - when we consider just one draw it becomes clear the advantage a BHM offers in terms of both point estimates and a measure of our uncertainty:

We’re comparing credibility intervals with confidence intervals which is generally frowned upon - in my defence I feel like everyone does it secretly and both are meant to give a measure of uncertainty. What that uncertainty is will depend on you I guess.

Comparing the mean squared error the BHM’s strength becomes even more stark:

jBHMfrequentist
10.855822.8161
20.08880.0840
30.00690.0040
40.05324.6014
50.00320.0024

I’m not sure why the fourth group has such a high MSE - it’s possible that we simply got unlucky in one of the draws and a weak first stage cropped up in a group 4 draw.

In this plot we’re trying to create a rough measure of how large the confidence/credibility interval is relative to its point estimate - we just divide the interval width by the point estimate’s absolute value.

It seems like weak IV’s crazy large confidence intervals are cropping up in the no pooling model giving us widths 10,000 times larger than their point estimates. On the other hand, the BHM tends to keep things calm with only one draw with a ratio of 100.

Finally, we can plot the density of the squared errors on log scale:

Conclusion

We’ve seen how Bayesian hierarchical models offer a nice way to overcome weak instrument issues provided we can exploit information from stronger instruments. I’m a big fan of shrinking IV estimates in general - we know the that the IV estimator is biased and has relatively poor asympotic efficiency compared to its OLS cousin so adding priors and introducing shrinkage is pretty much a free lunch. We improve finite sample performance and in the limit the prior weight tends to 0 anyway.

Chamberlain and Imbens actually had a fully Bayesian solution similar to ours in 1996 - so they only beat us by 23 years! The final, published, paper linked in the intro isn’t fully Bayesian which is a shame - with stan this is pretty easy to do nowadays as we’ve seen.


  1. For further reading on BHMs Ben Lampart has a pretty good intro book and Bayesian Data Analysis 3 provides some more depth.

  2. These results actually use 83 draws due to the Rhat checks and accepting divergent transitions < 10 - this makes me slightly queasy but still not queasy enough to re-parametrise the model for this post. Also, we only use 100 draws since whilst the SSs speed model fitting up Harvard don’t let pre-docs use computing clusters for their blog posts on GitHub Pages.