Wanted: datasets & Stan models with many exchangeable observations for the Bayesian infinitesimal jackknife

rgiordan · September 24, 2020, 10:19pm

I am working on the Bayesian infinitesimal jackknife, a technique for quickly approximating the frequentist covariance of Bayesian posterior expectations (see this blog post and this StanCon presentation). I’m looking for publicly available datasets and Stan models to try it out on, and wondered if the Stan forums have any suggestions for where to look.

The ideal applications are models and datsets with the following properties:

A lot of exchangeable (conditionally independent) data points (ideally in the thousands of data points);
A handful number of global parameters whose posterior means you’re interested in;
A lot of parameters that you need to integrate out with MCMC (i.e., you couldn’t just get a good approximate fit with the Laplace approximation or with lme4);
Some possibility of model misspecification.

All I need is the data, a model, and a description (in a paper or otherwise) of what the model is useful for. Rstanarm applications with random effects are also welcome.

I have already looked through the relevant datasets in the Stan examples; most are too small for the necessary asymptotics to kick in.

bgoodri · September 25, 2020, 12:00am

I use

a lot for homework assignments. You can get the dataset from the Supplementary Information section (might require that you login from a MIT IP address). You can simplify things by considering (one of) the “Intent to Treat” models (which estimate the effect of winning the Oregon Medicaid lottery on voting, rather than the effect of subsequently enrolling in / utilizing Medicaid) and you can complicate things but adding predictors.

rgiordan · September 25, 2020, 4:01pm

Thanks! I’m passingly familiar with the dataset for a different project I’m working on. But the published work I know just uses OLS. Which leads me to an honest question: on this dataset, for regressing outcomes on ITT + regressors, does Bayesian MCMC really give a practically different answer than OLS, or than a MAP-based Laplace approximation? (cf my third desideratum)

bgoodri · September 25, 2020, 6:27pm

I would say that most of the estimation techniques tend to be not very precise but favoring slightly positive effects for some outcome variables. Whether an estimate is or is not significant by conventional standards tends to depend on the details of the model.

There is a movement among experimentalists to use OLS or 2SLS because the estimated coefficient on the causal variable is unbiased across datasets for a given vector of treatment assignments (which, to me, seems like an absurd hypothetical in this and most other cases) . And they don’t want to use logit / probit / selection models even when everything is binary because the inverse link function might be wrong. It would be nice if your method could be used to show when the inverse link function was wrong.

bgoodri · September 25, 2020, 6:32pm

Also, the authors would insist that it is “wrong” to ignore household size because the lottery was such that if anyone in a household was selected by the lottery then everyone in the household became eligible for Medicaid. Thus, larger households have better chances. However, in practice household size is not associated very strongly with voting, which raises the question of whether the results ignoring household size, in practice, are wrong?

rgiordan · September 27, 2020, 5:25pm

It would be nice if your method could be used to show when the inverse link function was wrong.

Thanks for the careful responses!

That’s an interesting idea. But you would not need to run MCMC to simple detect misspecification in this way — in frequentist models, too, the asymptotic variances under correct specification and misspecification differ. You would just run logistic regression in lme4, and compare the sandwich covariance with the inverse Hessian.

I think the best motiviation would be cases where you believe your Bayesian model in Stan is misspecified, but you still want to use it, because it’s less misspecified in some practical sense than what you’d need to do in the corresponding optimization-based procedure. If I understood you right, it sounds like you might suggesting that probit / logit regression on this dataset could be one such a case?

Topic		Replies	Views
Tutorial on Monte Carlo EM and variants for MML and MMAP Algorithms	16	3743	October 22, 2018
Book request for lending library: McElreath: Statistical Rethinking: A Bayesian Course with Examples in R and Stan General	3	1178	July 13, 2017
CmdStan & Stan 2.37 release candidate General	16	803	June 24, 2025
Preprint on HMC using an embedded Laplace approximation Publicity	0	937	April 28, 2020
Paper: Causal inference with panel data by Pang, Liu, and Xu Modeling techniques	15	2098	July 18, 2023

Wanted: datasets & Stan models with many exchangeable observations for the Bayesian infinitesimal jackknife

Related topics