Case study on spatial models for areal data - Poisson CAR/IAR

Ok I agree with you about needing phi_std, but the original code doesn’t do that.

Fun/tedious fact about Markov models is that this won’t put them on unit scale (hence the model scaling in the bym2 example). I’m working to get an implementation of that to break the tedious INLA dependency.

OK, now I understand the problem (thanks Rob and Imad!) and yes, you’re correct that the BYM model was doing the wrong thing. I changed the model and now Stan and WinBUGS are arriving at the same answer to w/in se_mean on the Scotland lip cancer data.

the statement:

target += -0.5 * dot_self(phi[node1] - phi[node2]);

is computing on the unit scale - so yes, this is an ICAR component with sd 1.

but this statement is doing the wrong thing:

phi = phi * sigma_phi;

cf. discussion in Stan reference manual, section “Parameterizing Centered Vectors” has an example of a vector beta[K] which has a sum-to-zero constraint w/ K-1 degrees of freedom, derived from vector beta_raw[K-1] and it states that:

Placing a prior on beta_raw in this parameterization leads to a subtly different posterior than that resulting from the same prior on beta in the original parameterization without the sum-to-zero constraint. Most notably, a simple prior on each component of beta_raw produces different results than putting the same prior on each component of an unconstrained K-vector beta.

the right thing to do is put the non-centered parameterization of both random effects components in the model here:

y ~ poisson_log(log_E + beta0 + beta1 * x + phi * sigma_phi + theta * sigma_theta);

and again in the generated quantities block:

vector[N] mu = exp(log_E + beta0 + beta1 * x + phi * sigma_phi + theta * sigma_theta);

instead of doing this work in the transformed parameters block. a follow-on simplification is that you can just work directly with parameter theta - you don’t need parameter theta_std anymore.

this is great - I will update the models and the case studies to correct this.

that said, the parameterization proposed in BYM2 seems like the better way to go.

cheers,
Mitzi

1 Like

Hi Daniel,

some comments on your comments :):

I completely agree, in my opinion the implementation that you are talking about is perfectly right.

I also agree. Setting the “precision” to 1 does not mean to set the standard deviation to 1 in ICAR random effects, in part because each value of the random effect has a different standard deviation. But this also happens on the “precision” parameter tau as this parameter is multiplied in the diagonal of the precision matrix by the number of neighbours so we do not have a single common precision for all the observations. This is why I have quoted the term “precision” at the beginning of this paragraph. Anyway, when I state setting the precision or standard deviation I mean fixing the scale of the ICAR distribution what means in some sense to fix either the “precision” or the “standard deviation” (in a broad sense) to a certain value.

As mentioned below the original implementation did not include the sigma_phi^2 in the target term so the model is not doing exactly that. Anyway we thought about the modelling alternative that you are proposing. Just one comment about that, in the expression for log(p(phi)) that you put above there is one term left corresponding to the (determinant of the precision matrix)^(-1/2). The log-determinant of sigma^2(D-W) can be put as log(sigma)2(N-1) times the log-determinant of D-W (* I put two technical asides on this at the next paragraph). The second of these terms is just a constant but the first of them do matters for the log-likelihood. Thus implementing this proposal in terms of a “standard deviation” sigma_phi, in our opinion, should be done as:

target += -0.5 * dot_self((sigma_phi^2)*(phi[node1] - phi[node2])) - (N-1) *log(sigma_phi);

About the technical asides: the N-1 in (N-1) *log(sigma_phi) is valid if the region of study has a single connected component. If it was composed of K isolated connected components the N-1 should be changed to N-K. Besides, D-W is singular so its log-determinant is (minus) infinite. Nevertheless, the sum-to-zero restriction leaves out the null dimension of that matrix and the determinant of D-W could be changed by a generalized determinant, which turns back a finite constant value that we will not have to take care of. That is why that term is nos included in the last expression.

Anyway, at the end, we did not implement this proposal since this would be a centered parameterization of the ICAR random effects in the BYM model, what seems to be an inadvisable option according to the Stan manual. So we opted by other different modeling options with explicit non-centered parameterizations.

Ok, I agree that this could be improved. We are on that. Regretfully INLA is not an option for many of the models that I am handling, this is why we are trying Stan and considering some other MCMC options. Anyway, for now, what amazes us is the competitiveness of WinBUGS which I expected to be beated by far for any alternative implementation of the BYM model, for example, with more modern MCMC tools. For now, with our still preliminary results, WinBUGS unexpectedly keeps being a competitive option. Anyway, as you mention, I am also sure that more efficient Stan implementations are possible and Stan will be able to show all its power also in disease mapping problems. We are on that ;).

Thanks for your answers and your time :).

Hi Mitzi,

Thanks for the update.

Your implementation now is closely similar to our original implementation of summer of 2016 that I talked about with Bob (see the document Resul-BYM-OurOriginal.pdf at the Github Repo fisabio/BYM-stan). The only major difference between your last implementation (that suggested in your previous post) and our original implementation is that, for the hetereogenous term we set:

target+=-0.5*dot_self(het);

instead of

het ~ N(0.0,1.0);

since we thought that the first of these expressions could be more efficient. Thus, the two main differences between our original implementation and your (fixed) previous modelling proposal (file Resul-ModelMitzi-Stan.pdf) is the prior of the heterogeneous random effects mentioned above and the coding of the random effects, which in your original proposal the scale was added as a “transformed parameter” and we proposed to do it at the linear predictor of the Poisson distribution, as you propose now. Regretfully some of these issues makes the model significantly slower since the computing times of our original implementation (file: Resul-BYM-OurOriginal.pdf) are substantially higher than those of your (fixed) previous proposal (file: Resul-ModelMitzi-Stan.pdf). Note also that computing results in Resul-BYM-OurOriginal.pdf corresponded to a different (older) machine than that which we are using now. So this could be also an important factor. We’ll try to investigate also on this …

We are going to run your new modeling proposal in all our datasets during this weekend in order to check if the inefficiency comes from the coding of the N(0,1) term, the place where the standard deviation of the heterogeneous term is included (transformed parameters or linear predictor) or the machine used. At the turn back of the weekend we will be able to tell something about this.

By the way, because of the possible efficiency of putting the scale of the random effects in the transformed parameters, we tried to do it also for the spatial term in (the fixed version of) your previous proposal. That is, we tried the model in Resul-ModelMitzi-Stan.pdf but putting the scale of the spatial term also in the transformed parameters section (see file BYM-Crash.stan in the fisabio/BYM-stan github repo). Regretfully, as I mentioned in my first post, this model crashed at the compiling stage and we do not have any idea which could be the reason for this (we tried several modifications but none of them worked). Do any of you have any explanation for this?

Best wishes.

dear Miguel,

the biggest difference between your implementations and the ones that Bob and I have been proposing is the likelihood function in each: your poisson model has only an exposure and an intercept term, but no fixed-effects co-variates.

 // Verosimilitud
  log_mu = log(Exp)+m+sd_sp*sp+sd_het*het;
  Obs ~ poisson_log(log_mu);

if you treat the exposure instead as the single co-variate, this provides another parameter for the model to fit.
the Brooklyn dataset that I have been using has this characteristic - the data X is in fact a rate-scaled population value where all tracts are scaled using the same city-wide accident rate.

  y ~ poisson_log(beta0 + beta1 * x + phi * sigma_phi + theta * sigma_theta);

I took this dataset and fit it treating the data as an exposure term, and the fit was worse and it took 500 seconds as opposed to 200 seconds using the BYM model (not BYM2) for predictor only.

cheers,
Mitzi

How are you making the comparisons? MCMC is tricky and it’s easy to make misleading comparisons if you’re not careful. To avoid any subtitles you need to

    1. Verify that the compared samplers are giving good finite time estimates (formally, verifying that a central limit theorem holds and we can concepts like effective sample size are appropriate). This means at the very least running multiple chains (>= 4) from diffuse initial conditions and checking the Gelman-Rubin statistic; in Stan that also means checking all of the HMC specific diagnostics. Because HMC is typically theoretically more robust than Gibbs or Random Walk Metropolis it should be sufficient to verify Stan and then check that WinBUGS yields the same estimates as Stan.

-2 . Once you have verified that both samplers are giving equivalent estimates then you can compare performance. Here the only measure that is relevant is effective sample size per time, or often more appropriate, time per effective sample (as we often have a target effective sample size in mind and time is the variable quantity). Gibbs and Random Walk Metropolis are often much faster per iteration but much, much slower in effective samplers per iteration so HMC will dominate in the relevant measures.

Hi Michael:

We have verified convergence in Stan and WinBUGS following the same procedures (function monitor of library rstan). Our results report the same convergence statistics (calculated in the same way) for both procedures. When both packages achieve convergence for the same dataset (this is a bit more frequent in WinBUGS than in Stan) they always have yielded equivalent results up to, let’s say, a couple of decimal places.

As you will find in our reports both computing times and minimum effective sample sizes (minimum between all variables retrieved in our models) are reported and comparisons between WinBUGS and Stan are usually made according to both quantities.

Best wishes.

Hi Mitzi,

I am always referring to the raw BYM model, without covariates and all numerical comparisons that Paqui is performing do not contain any covariate.

I have more results of the simulations carried out this weekend:

  • First, regarding the machine used, it definitely matters. We have rerun our original analyses in our new machine where the rest of simulations have been run and computing times reduce from 14553 in average to 6954 seconds. We have replaced from Github the old document Resul-BYM-OurOriginal.pdf with a newer version containing the computing times with the same machine than the rest of simulations. Thus the times in this document are comparable to those in the rest of documents.
  • We have simulated from the last of Mitzi’s proposals with poisson_log(log_E + beta0 + phi * sigma_phi + theta * sigma_theta); for our Spanish mortality datasets (model and results in Resul-BYM-MitziNew.pdf in the GitHub repo Fisabio/BYM-Stan). Computing times are in average 6060 seconds per run. We have also run a second version of Mitzi’s models without priors (model and results Resul-BYM-MitziNew-NoPriors.pdf) and we have seen no great differences in computing times for this (6315 seconds per run) and the previous model. When convergence is achieved, results for the latter model with no priors coincide with those obtained from WinBUGS (file Resul-BYM-WinBUGS.pdf).
  • The main difference between our initial implementation (run in 6954 in average) and the new Mitzi’s implementation without priors which run in 6315 seconds is the implementation of the prior distribution of the heterogenous term. In our initial implementation it was coded as: target+=-0.5*dot_self(het) is we thought that does was very efficient but the implementation in Mitzi’s model with theta~N(0,1) seems more efficient.
  • So, by now we have two valid implementations of BYM model with no priors Resul-BYM-MitziNew-NoPriors.pdf and Resul-BYM-NoPriors-Stan.pdf. The single difference between them is that for the model in MitziNew the linear predictor is log_E + beta0 + phi * sigma_phi + theta * sigma_theta meanwhile for the other one it is: log_E + beta0 + phi * sigma_phi + theta and the scale on theta is included as a transformed parameter. Computationally, the first of them takes 6315 seconds in average to run and the second 4937 seconds, which seems a pretty interesting difference. We haven’t been able to make phi*sigma_phi as a transformed parameter (model BYM-crash.stan) and according to these results it could also yield a significant improvement in computational terms. Would this be possible? Would you know how to do it (without making stan to crash during compilation of the model)? In my opinion this would be a promissing improvement of the BYM model.

I forgot to attach this …

Besides these news I find also interesting to report further results of the computational performance of the BYM model in Stan with some other datasets. We run in summer 2016 the BYM model (that coming in the file Resul-BYM-OurOriginal.pdf) also for the Lip Cancer in Scotland dataset. For that dataset Stan beated WinBUGS. WinBUGS took 11.3 seconds for getting an effective sample size of 230 min_n_eff meanwhile Stan took 11.9 seconds for getting a min_n_eff of 775 (computations carried out in our old machine). Thus, Stan’s performance is satisfactory in this case. We also run our model for several mortality datasets corresponding to the Valencian Region (one of the 17 regions that compound Spain made up of 540 municipalities). For these datasets (46 mortality causes) WinBUGs took, in average, 85 seconds for getting an average min_n_eff of 130. For those same datasets Stan took in average 165.5 seconds for getting a min_n_eff of 273. Thus for the Valencian Region Stan and WinBUGS performances are pretty equivalent. In summary, Stan is more competitive than WinBUGS in smaller datasets but this relation reverses for large datasets (whole Spain with about 8000 units), i.e. paradoxically Stan seems to scale worse than WinBUGS. Does this have sense?

Keep working on this …

The scaling is going to depend on the data set, the model, and how well the model is specified for the data. In some cases, the conjugacy inherent in a BUGS model can lead to very efficient computations.

Can you say something about posterior correlations in these different cases? Does someone know whether WinBUGS is using blocked Gibbs for conditional multivariate normals in these specific models (I would guess yes)?

Aki

Hi again

Regarding Bob’s comment:

Up to our knowledge no conditionally conjugate prior distribution is used in our WinBUGS BYM implementation.

Regarding Aki’s comment:

The updater used for sampling the heterogeneous and spatial random effects in WinBUGS is UpdaterRejection.loglin. I think that this is a multivariate Metropolis updater with a proposal function that mimicks the posterior distribution of these effects, as proposed in Gamerman D (1997) Sampling from the posterior distribution in generalized linear mixed models. Statistics and Computing. 7, 57-68. Thus, if I am not wrong WinBUGS makes a blocked sampling of random effects in BYM.

By the way, Paqui is performing some simulations and we think that we have promissing results on the efficient implementation of the BYM model in Stan. We will send more results on this tomorrow since we are now ellaborating the reports.

Best wishes.

If we were looking at the right thing before, BUGS uses the conditional specification of the the ICAR model, which reduces to a conuugate normal RNG per dimension plus some normalization for identifiability (sum to zero). That’s why I was saying I’m not surprised it’s very efficient. Other conjugate BUGS models can be very fast, too.

Not sure what it does for the heterogeneous (non-spatial) random effects. Between the heterogeneous and spatial effects, the models’ not identified other than through the prior. That’s why Dan Simpson was proposing a kind of mixture formulation where there’s a balance between the spatial and heterogeneous random effects.

I’m not sure car.nomal does single sight updating. It really shouldn’t
because that’s very inefficient (the spectral gap will close as the graph
gets bigger)

OK, went back and dug up the reference I found, which was a PyMC (before PyMC3!) implementation:

http://glau.ca/?p=340

It quotes the actual Component Pascal code for car.normal:

the car.normal() distribution is exactly what the manual said:

PROCEDURE (node: Node) Sample (OUT res: SET);
    VAR
        i, len: INTEGER;
        mu, tau, value, wPlus: REAL;
BEGIN
    len := LEN(node.neighs);
    i := 0;
    mu := 0.0;
    wPlus := 0.0;
    WHILE i < len DO
        mu := mu + node.neighs[i].value * node.weights[i];
        wPlus := wPlus + node.weights[i];
        INC(i)
    END;
    mu := mu / wPlus;
    tau := node.tau.Value() * wPlus;
    value := MathRandnum.Normal(mu, tau);
    node.SetValue(value);
    res := {}
END Sample;

I don’t think that’s a multivariate normal since mu and tau both appear to be scalars.

I’m rolling my eyes very hard here.

With weak data, a NCP in Stan should beat the pants off a single-site Gibbs sampler (in fact, you can probably prove it with Gaussian data).

A partial non-centering (ie evaluate the log-Hessian at the MAP and use that for the transformation rather than the prior precision) would be even more efficient.

Hi Bob

I am pretty convinced that WinBUGS makes blocked sampling of the random effects in BYM model. When running the BYM model in WinBUGS it states (Info>Node Info menu) that random effects are sampled with the UpdaterRejection algorithm. Although WinBUGS documentation on updater algorithms is ugly you can find for example at this ppt from Bill Browne an interesting description of the Updater Rejection sampling used by WinBUGS. As mentioned in the ppt, WinBUGS reproduces the method proposed by Gamerman (1997) which is also described in detail in Rue & Held’s monography on GMRF for sampling from the posterior distribution of GMRF. As I mentioned previously this method makes a block sampling of the whole vector of random effects using a proposal function that mimics the posterior distribution of that vector.

I am not an expert at all in component Pascal but in my opinion the routine attached would be used to sample from the prior distribution of the CAR distribution but not from its posterior distribution that would use Gamerman’s method.

Best.

Thanks for the clarification. Sounds like it pretty much has to be blocked given the performance and the output.

If anyone’s on the OpenBUGS or WinBUGS mailing lists, they could ask. Or someone could dive into the code and verify. I don’t have time to do it as it doesn’t really matter for Stan how BUGS fits these models.

Hi,

We are back again with the Stan implementation of the BYM model. Paqui and I have been dealing with the sum-to-zero restriction of the spatial random effect and in our opinion we have very good results. First, Paqui has programmed in Stan the BYM model leaving out the intercept and the sum-to-zero restriction on the spatial term. This is equivalent to the BYM model but with the intercept inserted into the spatial term. We have done this to assess the harmful effect of the sum-to-zero restriction on Stan’s performance. Results for this model are in file Resul-BYM-UnRestricted.pdf of our Github repo at fisabio/BYM-stan). I recall that the Stan results of the original BYM model that would be comparable to this last model are in file Resul-BYM-NoPriors-Stan.pdf. As can be checked, computing times are much lower for the new (unrestricted) BYM implementation, about 8 times lower. Moreover, convergence problems for this new implementation seem to have gone. Thus this new implementation seems much more competitive than the previous one that took 20.5 seconds per unit of min_n_eff meanwhile this new (unrestricted) implementation needs 3.47 seconds to increase min_n_eff in one unit. This implementation is also more competitive that that in WinBUGS since this had a similar performance to the hard restriction implementation in Stan (20.7 seconds per unit of min_n_eff, file Resul-BYM-WinBUGS.pdf). By the way, we have implemented to versions of the non-restricted model one with the scale of the random effects included in the linear predictor and another one with the scale included as transformed parameter. We show the results of just the second implementation as this is a bit faster than the other one, 611 seconds per model (in average) vs 624 seconds.

Although the unrestricted implementation above is a computationally convenient version of the BYM model, sum-to-zero restrictions are necessary for lots of models containing several spatial terms as for example spatio-temporal or multivariate spatial models. Thus Paqui has also been working on the efficient implementation of sum-to-zero restrictions in the BYM model in Stan. Specifically, instead of using hard sum-to-zero restrictions as that implemented in our original proposal of the BYM model we have used soft (stochastic) sum-to-zero restrictions (I think that this is the way that INLA imposes also restrictions …). More in detail, sum-to-zero restrictions are imposed by setting into the model: cero~normal(phi.mean,epsilon) where cero=0, phi.mean is the mean of the spatial random effect and epsilon is a low value for the standard deviation. This ensures the sum of the spatial random effects to be “close” to zero. Obviously, the model with this soft sum-to-zero restriction also contains an intercept term. Results for epsilon=0.0001 are shown in file Resul-BYM-StochasticRestriction1.pdf. Once again convergence problems for this model have vanished and computing times are more satisfactory than for the original hard-restricted implementation since we need now 8.9 seconds for each unit of min_n_eff. Anyway, computing times are not as satisfactory for the soft constrained model than for the unrestricted impementation. Note also that posterior estimates for the sds for all three BYM implementations (hard-restricted, unrestricted and soft-restricted) coincide when convergence is achieved for all three.

Finally, we have also played with different values of epsilon in order to assess the sensitivity of our analyses to that value. Thus, for epsilon equal to 0.00001 we have found that the model runs substantially slower (results not reported). In contrary, for epsilon=0.001 Stan runs approximately as fast as for the unrestricted model and is similarly efficient since it needs 3.26 seconds per unit of min_n_eff (results can be viewed in file Resul-BYM-StochasticRestriction2.pdf). Note that the posterior estimates of the sds in this model have not changed by increasing as a consequence of increasing epsilon. Regarding the model with epsilon=0.000001 (results not reported) computing times per dataset were about 10000, a bit more than twice the times needed for the hard-restricted model. Thus, a hard-restricted model is preferable to a tight soft-restricted alternative.

Thus, as a summary, in our opinion the soft restricted implementation of the BYM model in Stan is far more competitive than that in WinBUGS and the hard restricted implementation of the model. Anyway, care should be taken in the choice of epsilon since the performance of the model is quite sensitive to this value. Further work should be done on appropriate choices of epsilon.

Best.

Several things:

  • In hindsight it’s not surprising that this happened - the sum to zero constraint induces an ugly global dependency, so it makes sense that bad things happen in the posterior.

  • Broadly speaking, I like the strategy of not constraining away the intrinsic-ness, but I’m not sure how useful that is for users. In particular, a lot of people will put an intercept in as well. Does this tank the performance?

  • For more complex models (eg Type IV space-time interaction models), the number of constraints you end up with is just ludicrous, so I definitely think this unconstrained method is better (or at least easier to use).

  • As for soft constraints, that’s basically what INLA does. The exact thing it does (if my memory serves) is add a small number to the diagonal of the precision matrix, and then use the formulas in Håvard and Leo’s book to compute the density with the exact constraint.