Measurement error model with a large sample?

Hi,

I have a large sample of data – around 360,000 data points. I am trying to fit a simple measurement error model to start. Though ideally would like to fit this measurement error model within a hierarchical model structure.

Given the Bayesian measurement error treats each data point as an unknown parameter this becomes untenable to estimate (i.e. it takes days to compute) given our sample size.

What would you recommend?

Thank you,

Ilan

This is very large indeed. In my experience it is however OK to split such large datasets randomly into smaller chunks that are manageable for Stan, you just need to check that the estimates of the shared parameters are similar across the chunks (which they will be, unless there is some problem with your model).

With the latest Stan additions, you just might be able to push your model to feasible region with all the datapoints, using map_rect (with MPI and/or threading) or offloading some of the computation to GPU (since 2.19, not yet in rstan). I however fear that this might be non-trivial to get working as the documentation is not quite there yet.

Alternatively, you check if you are able to express your model in INLA - it seems this should be possible (a good starting point might be http://www.r-inla.org/examples/case-studies/muff-etal-2014), and you can easily get an order of magnitude speedup. Although INLA does not have such a great support as the Stan ecosystem, it is still a mature software and I can recommend it. The solution is however only approximate, so if you can afford it, it is useful to test your model with Simulation-Based Calibration (I have had one occasion where INLA was moderately mis-calibrated for a hyperparameter)

1 Like

Thanks for this. Very helpful. Anyway to do INLA in R Stan - BRMS interface I wonder?

Not really, INLA is a completely separate piece of software, with its own idiosyncracies and learning curve. Though the R INLA interface is based on formulas and does not require you to learn a new language.

There is a talk about using some of the ideas from INLA to get efficient approximate density computations in Stan, but AFAIK this has always been scheduled to “someday” and no real progress happened in the past year or so.