I’ve updated the case study and put it up on the Stan case studies webpage.
I’ve updated the case study and put it up on the Stan case studies webpage.
Hi folks, Mitzi recently made me aware of this discussion. Thanks for all your work. We’re quite happy with the how well this has progressed.
I’m also pleasantly struck by the similarities of Miguel’s interest in applying Stan to large(ish) areal cancer data sets and our original motivation for reaching out to Andrew for our pediatric pedestrian injury project. (https://projectreporter.nih.gov/project_info_description.cfm?aid=9235300&icde=35379981&ddparam=&ddvalue=&ddsub=&cr=1&csb=default&cs=ASC&pball=) I expect there will be a lot of interest in this project across the spatial analytic community.
On a practical note, we just received updated (to 2014) pedestrian injury data from New York City and hope to report back shortly on how well Stan behaves playing with it. I’d like to write up a full report for peer review and would of course invite Andrew, Mitzi and the development team to join in as co-authors. After NYC, we’ll introduce Stan to state-level and then US-level data.
In addition to the NICHD-funded project that allowed us to contribute to this effort, we’re looking at the city’s recent Vision Zero efforts and would eventually like to move toward a space-time model (https://www.ncbi.nlm.nih.gov/pubmed/25643104), so am particularly interested in the discussion of the implications of a constrained vs. unconstrained implementation in that context.
Lastly, on an editorial, it’s hard to get away from comparisons of the usefulness of Stan vs. INLA. INLA was a response to the inability of MCMC (read WinBUGS/OpenBUGS) to handle reasonably large spatial data. My first attempts at analyzing NYC census tract data with WinBUGS back in 2008 or so took hours and hours and hours. INLA was a welcome alternative and I think the spatial analytic community owes Havard Rue a debt of gratitude. But it came with it’s own costs in terms of a sort of black box approximation that felt less accessible and intuitive than sampling, somewhat convoluted approaches to extracting results, and restrictions on adapting and exploring different models than those cooked into the software. I think Stan now allows researchers to reconsider applying the rich and robust sampling approaches with which they may be more familiar. I’d suggest a model exploration workflow may be GLM to INLA to Stan.
As a dyed in the wool INLA person, I completely agree with this. (With the slight caveat that I’ve never really understood why people think BUGS/JAGS/Stan is less of a black box than INLA is. Most users don’t know how they work either!) But the structural constraints that INLA put on models are real restrictions and we never really wanted to be able to fit general models, just to be able to fit a subset well and quickly. But because it’s fast, we have been quietly pushing the idea that you can use INLA to explore complex data (especially in the space-time settings) before moving onto a different model-fitting method if necessary.
There’s still a lot of work to get this working properly in Stan. In particular, there isn’t an internal (C++) function like BUGS’ car.normal, which would speed up the computation immensely. There also isn’t any way to fit something like a Leroux model efficiently (not that I like the model, but we should be able to fit it) or a standard CAR of the form Q = tau( D - rho W) because of the way the parameters interact with Q. (Mitzi’s trick only allows parameters to multiply a fixed precision matrix)
Once that makes its way in, we will need to put some serious thought into how to actually allow for more complex GMRF specifications. This is the same interface challenge that has been hit with the GP models, so hopefully the solution for that will become the solution for GMRFs as well.
The opacity comes from the wall of math in all the descriptions of INLA. BUGS and friends just draw from the posterior for a model you’ve defined explicitly. Sure, the user may not know how the sample is drawn, but they think they know what they’re getting. With INLA, you get whatever that wall of math describes, and frankly, I haven’t penetrated it far enough to understand exactly what I’m getting other than an approximate posterior. To be fair, I haven’t tried very hard, because I’m not very good at linear algebra or exponential families. I finally understand what lme4 was doing (max marginal likelihood), but I missed where Jennifer and Andrew switched estimation strategies in their book on my first pass.
At the risk of drifting very far afield in this topic, the thing that gets me is the magical thinking aspect of MCMC. It’s just not true that BUGS and friends draw from the posterior. Sometimes it does that approximately. We’re better now at telling when that’s happening (especially for HMC/Stan), but the idea that you can write it down and have it work is magical thinking. (Let’s face it - most of this forum is devoted to the fact that even an excellent implementation of MCMC doesn’t just work)
INLA is easy to do the “MCMC style” explanation of (where you leave out all the things that make it work):
- For every hyper parameter theta, approximate pi(x| y, theta) by the Gaussian that matches the location and curvature of the true distribution at the mode
- Use this in the identity pi(theta |y) \propto pi(y,x,theta)/pi(x | y, theta)
- Compute pi(x | y) = \int pi(x | y, theta) pi(theta | y).
From which you can see that the whole thing lives and dies on how well that Gaussian approximation works, and Tierney and Kadane’s paper from the mid 80s covers that in great detail.
(sorry - this is a pet peeve. If you don’t understand exactly what the program is supposed to have implemented, you’re looking at a black box. By that definition Bugs [where we had to look at the component Pascal and still got it wrong] is a black box to us. As Stan is to most people. That’s not a bad thing. It’s just a thing.)
Might as well just drop the “MCMC” and generalize to every aspect of life! And of course everything having to do with real numbers on a computer is an approximation if they aren’t small integers.
Almost, but still don’t quite understand, but that’s typical for me in stats. I really have tried to work through this stuff, but always get stuck.
To start, I’m not sure what
theta are supposed to be, but I’m guessing
x: parameters (assume size
theta: hyperparameters (assume size
y: observed data (assume size
But then I get stuck at the first bullet point. I’m assuming there are
theta[k], so when you say for every hyperparameter, I can only imagine we’re talking about
pi(x | y, theta[k]), but then I don’t know how to make sense of that. Are we dealing with
Normal(x | mu[k], Sigma[k])
as an approximation to
pi(x | y, theta[k])
defined as follows?
mu[k] = ARGMAX_x p(x | y, theta[k]) Sigma[k] = grad_? grad_? p(mu[k] | y, theta[k])
What’s the curvature w.r.t.? I think our only choices are
theta[k], but the latter’s a scalar and I don’t see how to plug the former in through the argmax.
This is where I give up and go back to my hole to wish for a day when statisticians wrote out algorithms explicitly enough for me to follow. I took me a month to understand GMO for the same reason and takes me an hour to refresh my memory again every time I pick it up. There’s something about these algorithms that just throws me. The first explanations I ever understood like this were Alp’s descriptions of ADVI—those were written for computer scientists!
I mean “for every possible value of theta” (which is a vector). One example would be a GP regression, you find the gaussian approximation posterior distribution with fixed range, smoothness, and variance parameters (those three make up theta). A more “process oriented” way of thinking of it is you need a machine that takes in a value of theta and spits out the Gaussian approximation to pi( x | y, theta). In most cases this is the result of an optimisation routine (but these are the lower-level details, so just imagine a magical machine)
The curvature is wrt x: it’s the only thing not fixed by conditioning
I’m afraid I still have no idea what that means. How can we do a computation for every possible value of theta? Are we breaking the components of
theta down somehow? Are we doing more than one normal approximation? Is the approximation multivariate? What exactly is being approxiomated and how do we compute the location and covariance of the approximator?
The same way you build a proposal kernel q( theta’ | theta_current), for which you need to build a program that for any possible theta_current you get a proposal distribution.
I feel the way I phrased it originally was really weird. But the output from that step could be used a independence proposal kernel in a Metropolis-within-Gibbs sampler for the full conditional step x’ ~ q(x’ | y, theta)
Fitting GP with noisy observations of time
I’m just getting more lost. Maybe we should put this off until I see you again—I don’t want to drive you crazy on our lists. My problem is that I need low level pseudocode, not something that takes a proper computational statistician to unpack.
Now I think when you said “for every possible value of theta”, you were just talking about the domain of a function. Is that right? Bringing in proposal distributions has just confused me further.
P.S. I’ve never understood what people meant by “Metropolis-within-Gibbs” or why one is more general (never remember which one people say that is).
Apologizes for not answering in a reasonable time, I have been and still am in my cation period. I am glad to know of your project. It seems pretty interesting. On our side, Paqui’s PhD has as main focus applying regular and new spatial, spatio-temporal, multivariate methods to large problems so our work seems to be closely related.
Regarding your last comment I agree with you. In my particular opinion, and this is quite personal, INLA is an alternative to MCMC but it has a cost: less modeling flexibility, retrieving just marginal distributions … In our experience WinBUGS has been the default tool for years and we have run it succesfully for large problems (I remind spatio-temporal models (8000 units X 10 years=80000 observations aprox.). But we feel that, if making inference in large settings is the primary goal, we should try new modern tools. This is our goal and it seems that Stan is starting to show its potential for GMRF. This is good news. We will keep working in this direction.
Hello Mitzi, Charlie, and all:
I was able to successfully run the BYM2 model for the full 2011-2014 NYC crash dataset, subset to injured pedestrians ages 0-19 years. Will run bicycle injuries separately.
I stepped away during the process but I believe it took about an hour in total; the log output following the fourth and final chain was:
Elapsed Time: 1694.32 seconds (Warm-up)
2257.52 seconds (Sampling)
3951.84 seconds (Total)
I did, however, receive a warning message, but was unable to decipher what impact it had on the results. Here are the warning message and model output for first and last 3 census tracts in the dataset:
Loading required namespace: rstudioapi Warning messages: 1: There were 25 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 2: There were 11975 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded 3: Examine the pairs() plot to diagnose sampling problems > print(bym2_nyc_stanfit, digits=3, pars=c("lp__", "beta0", "beta1", "rho", "sigma", "mu", "mu", "mu", "mu", "mu", "mu"), probs=c(0.025, 0.5, 0.975)); Inference for Stan model: bym2_predictor_only. 4 chains, each with iter=6000; warmup=3000; thin=1; post-warmup draws per chain=3000, total post-warmup draws=12000. mean se_mean sd 2.5% 50% 97.5% n_eff Rhat lp__ 15085.423 3.961 55.209 14974.140 15085.492 15192.924 194 1.026 beta0 0.622 0.279 0.408 -0.114 0.740 1.122 2 5.971 beta1 0.070 0.001 0.004 0.061 0.071 0.079 60 1.032 rho 0.976 0.001 0.005 0.965 0.976 0.985 26 1.098 sigma 3.176 0.036 0.193 2.822 3.162 3.580 30 1.096 mu 4.619 0.789 1.261 2.435 4.744 6.817 3 2.431 mu 6.110 1.003 1.595 3.336 6.260 8.817 3 2.573 mu 6.541 1.091 1.705 3.530 6.730 9.387 2 2.783 mu 3.946 0.695 1.103 2.060 4.041 5.982 3 2.342 mu 5.484 0.940 1.496 2.840 5.639 8.125 3 2.372 mu 3.313 0.552 0.878 1.777 3.388 4.902 3 2.365
Any thoughts on the warning? I’m too new with these models to parse it out.
[edit: escaped system output for readability]
That’s a warning about the way HMC worked. It said that in 25 proposals
the MCMC scheme detected that there could be something wrong.
Now, this is not something to panic about immediately - the diagnostic
throws out a lot of false positives. There is a lot of infromation in this
The take away message is that if thse warnings seem to be concentrating in
a particular part of the parameter space, then there may be some problems.
Otherwise they’re probably false positives.
Hope this helps!
Concentration of divergences
More seriously, most of your iterations are not as effective as they could be because the trajectories are hitting the default size limit. You can see this manifest in the terribly small effective sample sizes (notice too that the Rhats are waaaaaay too big). This will only get worse if you increase
adapt_delta per the recommendations.
The two links the warnings do a good job of summarizing the problems and some ways to address them, especially in the treedepth case. In addition to these links and the link @Daniel_Simpson included above, you might also want to check out http://mc-stan.org/users/documentation/case-studies/rstan_workflow.html.
Long story short, this fit is not to be trusted.
I edited the original post so that the
n_eff are readable, and this is clearly a problem.
@kwheeler-martin: In the event someone else hasn’t already helped offline, can you attach the Stan program you used? Just in case there’s something obvious there.
Hi all again,
We have kept working on the BYM model in Stan. We were worried about the markedly different times that Stan took for running different datasets on the same region of study (Spain). For the stochastic restricted model this happened for a standard deviation on the mean of 0.001 (see results at the Resul-BYM-StochasticRestriction2.pdf document of the /fisabio/BYM-stan repo at github) but not when that standard deviation was 0.0001 (see results at the Resul-BYM-StochasticRestriction1.pdf document of the /fisabio/BYM-stan repo at github). In this latter case all computing times for all datasets were comparable. We do not know why this happens but we mention this just in case it could be of use for you.
We checked that different computing times were due to some specific chain that took substantial more time to run than the rest of chains in that simulation. At first, we thought that this could be due to the starting values of the chains but when played around with this factor we did not see any difference. Interestingly we later realized that the bad performance of some of the chains did not happen for all the simulation period but it suddenly appeared at some moment of the simulation. See these two screen captures:
As you can see, the performance of all three chains is basically equal at the beginning of the simulation but, at about 30% of the simulation, one of the chains starts to perform much worse than the other chains. If we could fix this, it would mean a significant computing improvement for these spatial models. Thus, do you have any idea why this could be happening or, even better, how could this be fixed?
Maybe this issue is not strictly related to spatial models but instead is a more general Stan question. Apologizes for this.
Is that posterior standard deviation for a parameter? If so, rescaling so it’s more on the unit scale can help considerably.
You might want to read Michael Betancourt’s case study on divergences. Are you getting divergence warnings? What’s the target acceptance rate you’re using and have you tried increasing it?
If the problem’s difficult, it may require reparameterizing. What happens is that a centered parameterization is more efficient when the data highly determine the posterior and a non-centered parameterization is more efficient whenthey only loosely determine the posterior.
I should add that if it’s only a 50% increase in time and the adaptation parameters (mass matrix and step size) wind up in the same place and R-hat and n_eff look OK, then there shouldn’t be an issue with the draws.
You probably want to do much more warmup than 400 iterations. If you start with a small step size, set a high target acceptance rate, and warmup longer, you’ll get much more stable sampling times and probalby higher n_eff overall per unit time.
Thank you. Apologies for falling off the radar, just returning from vacation. Running some basic diagnostics offline (I am green). As it turns out two regions (NYC census tracts) wound up having no neighbors and this may have something to do with the problematic fit. Will also try increasing treedepth.
Charlie also introduced me to some alternate adjacency methods that might work better now that we’re dealing with NYC as a whole, to account for connectivity across boroughs/waterways where appropriate. If interested, http://www.injuryepi.org/resources/spatialEpiBook.pdf starting page 37.