Wanted: datasets & Stan models with many exchangeable observations for the Bayesian infinitesimal jackknife

I am working on the Bayesian infinitesimal jackknife, a technique for quickly approximating the frequentist covariance of Bayesian posterior expectations (see this blog post and this StanCon presentation). I’m looking for publicly available datasets and Stan models to try it out on, and wondered if the Stan forums have any suggestions for where to look.

The ideal applications are models and datsets with the following properties:

  • A lot of exchangeable (conditionally independent) data points (ideally in the thousands of data points);
  • A handful number of global parameters whose posterior means you’re interested in;
  • A lot of parameters that you need to integrate out with MCMC (i.e., you couldn’t just get a good approximate fit with the Laplace approximation or with lme4);
  • Some possibility of model misspecification.

All I need is the data, a model, and a description (in a paper or otherwise) of what the model is useful for. Rstanarm applications with random effects are also welcome.

I have already looked through the relevant datasets in the Stan examples; most are too small for the necessary asymptotics to kick in.

1 Like

I use

a lot for homework assignments. You can get the dataset from the Supplementary Information section (might require that you login from a MIT IP address). You can simplify things by considering (one of) the “Intent to Treat” models (which estimate the effect of winning the Oregon Medicaid lottery on voting, rather than the effect of subsequently enrolling in / utilizing Medicaid) and you can complicate things but adding predictors.

Thanks! I’m passingly familiar with the dataset for a different project I’m working on. But the published work I know just uses OLS. Which leads me to an honest question: on this dataset, for regressing outcomes on ITT + regressors, does Bayesian MCMC really give a practically different answer than OLS, or than a MAP-based Laplace approximation? (cf my third desideratum)

I would say that most of the estimation techniques tend to be not very precise but favoring slightly positive effects for some outcome variables. Whether an estimate is or is not significant by conventional standards tends to depend on the details of the model.

There is a movement among experimentalists to use OLS or 2SLS because the estimated coefficient on the causal variable is unbiased across datasets for a given vector of treatment assignments (which, to me, seems like an absurd hypothetical in this and most other cases) . And they don’t want to use logit / probit / selection models even when everything is binary because the inverse link function might be wrong. It would be nice if your method could be used to show when the inverse link function was wrong.

Also, the authors would insist that it is “wrong” to ignore household size because the lottery was such that if anyone in a household was selected by the lottery then everyone in the household became eligible for Medicaid. Thus, larger households have better chances. However, in practice household size is not associated very strongly with voting, which raises the question of whether the results ignoring household size, in practice, are wrong?

It would be nice if your method could be used to show when the inverse link function was wrong.

Thanks for the careful responses!

That’s an interesting idea. But you would not need to run MCMC to simple detect misspecification in this way — in frequentist models, too, the asymptotic variances under correct specification and misspecification differ. You would just run logistic regression in lme4, and compare the sandwich covariance with the inverse Hessian.

I think the best motiviation would be cases where you believe your Bayesian model in Stan is misspecified, but you still want to use it, because it’s less misspecified in some practical sense than what you’d need to do in the corresponding optimization-based procedure. If I understood you right, it sounds like you might suggesting that probit / logit regression on this dataset could be one such a case?