# Case study on spatial models for areal data - Poisson CAR/IAR

#21

I’ve added a new section. There’s a PR submitted and the branch is here if anyone wants to comment:

#22

@Daniel_Simpson is the scaling that you introduced only applicable when using the poisson likelihood, or does it apply to gaussian/binomial likelihoods too?

random =  sigma *(sqrt(rho)*theta_std + sqrt(1-rho)*scaling_factor*phi);


#23

It’s relevant for any model specified in terms of the precision. The
Riebler paper references Sørbye and Rue where it’s discussed at length.

D

#24

@imadmali The scaling factor is needed to ensure that var(random[i]) = sigma^2. Otherwise it’s difficult to interpret (and hence set a prior) for the standard deviation. This is the same advice that we always give (make sure your model is scaled!), but applied to the second-order structure of a model.

Most people don’t do this, and it can lead to problems!

#25

To check that this model recovers the spatial relationships, we expect to see positive covariance between adjacent regions and low or negative covariance between non-adjacent regions. Furthermore, the covariance between adjacent regions which have a small number of neighbors should be more positive than the covariance between regions which have many neighbors, per the conditional formulation of the ICAR.

Should that latter sentence be more positive with more neighbors and less positive with less neighbors?

#26

no, it is correct as it stands. when a region has only a few neighbors, each of those neighbors contributes more information, so the co-variance be more positive.

#27

Ah, I see. Would you mind adding a sentence or two elaborating on that point so that denser people such as myself aren’t at the risk of confusion?

#28

absolutely will do!

#29

This isn’t actually true! A priori, you can get negative correlation. There are good reasons for this (to do with the sum to zero constraint). For a proper CAR model, you usually don’t get it.

There is a condition in the linear algebra literatures that says that if Q is an “M-Matrix”* then all of the entries of Q^{-1} are postiive and hence all of the correlations are positive. So if you add a small number to the diagonal, then you get positive correlations. One of the standard CAR models (Q = kappa*I + Q_{ICAR}) has this property. A statistical interpretation of this is that if x ~ N(0, Q_{ICAR}^+), then and y|x ~ N(x, \kappa^{-1}I), then y ~ N(0,Q).

*M Matrix: Q = sI - B, where s > largest eigenvalue of B, and the elements of B are all non-negative. A sufficient condition is that Q is strictly diagonally dominant and all the diagonals are strictly positive

#30

does the number of neighbors affect the strength of the correlation between pairs of neighbors?

#31

Honestly, I’m not sure.

What I do know is:

1. If Q is nonsingular, then the correlation decays exponentially in graph distance
2. If Q is singular and a sum-to-zero constraint is used to fix this, then the correlation will not decay (because there is a global constraint). This is also where the negative correlation comes from: if there are large positive components, there also have to be compensating negative components. In particular, the constrained model is no longer Markov.

#32

thanks! I wasn’t taking into account the different possible values for correlations.
removing misleading statements from case study.

#33

@Daniel_Simpson in your revision of the BYM model shouldn’t we be including phi_std_raw ~ normal(0, 1) in the model block?

#34

Good catch @imadmali

#35

Wait… spoke too soon. It’s fine. The prior on that is given in the

block. It shouldn’t be N(0,1).

D

#36

Hi,

First of all, apologizes for the length of this post. We (myself and Paqui Corpas, my PhD student), Mitzi and Bob have been changing e-mails for more than a month about issues related to this example and we feel that now could be a good moment to get more people involved in this conversation. My main goal with all this is applying Stan to large disease mapping problems making it possible to deal with larger datasets than those studied in WinBUGS. I have been a WinBUGS user for disease mapping problems for more than a decade and I feel that I should make room for new tools to apply them to these large datasets. I have some experience also with INLA but it seems a bit restrictive for some of the models that I am using and, sincerely, I prefer MCMC, so Stan has been the first alternative tool that I have considered.

We have read with interest Mitzi’s work on ICAR-based spatial models. First, I would like to note one issue on Mitzi’s current implementation of the BYM model (file: bym_predictor_plus_offset.stan file in the /stan-dev/example-models Github repo at: https://github.com/stan-dev/example-models/tree/master/knitr/car-iar-poisson). Lines 28 to 32 of that model state:

real<lower=0> sigma_phi = inv(sqrt(tau_phi)); // convert precision to sigma
vector[N] phi;
phi[1:(N - 1)] = phi_std_raw;
phi[N] = -sum(phi_std_raw);
phi = phi * sigma_phi;

Thus, if we understand it appropriately, phi is a re-scaled ICAR, with standard deviation sigma_phi. On the other hand, line 37 in the model states:

target += -0.5 * dot_self(phi[node1] - phi[node2]);


Which corresponds to an ICAR distribution of standard deviation equal to 1 (instead of sigma_phi) so this does not agree with the previous lines of code. If we are right (we would appreciate your opinion on this), this could be fixed by making something like

real<lower=0> sigma_phi = inv(sqrt(tau_phi)); // convert precision to sigma
vector[N] phi_std;
phi_std[1:(N - 1)] = phi_std_raw;
phi_std[N] = -sum(phi_std_raw);
vector[N] phi = phi_std * sigma_phi;

and

target += -0.5 * dot_self(phi_std[node1] - phi_std[node2]);

since phi_std (but not phi) follows an ICAR(1) distribution. Regretfully, we haven’t gotten this to work in Stan (it crashes surely because our proficiency is still low with Stan), although we have programmed a slightly modified version of this (phi is not defined as a transformed_parameter but it is computed within the model) that do works, file: Resul-BYM-WithPriors-Stan.pdf in the Github repo fisabio/BYM-stan at https://github.com/fisabio/BYM-stan. The structure of this (and all subsequent) file(s) is the same: 1.- syntax of the model run 2.- syntax used to run the model 3.- Descriptive results.

We have run the original “bym_predictor_plus_offset.stan” model on several of our datasets (same region of study but different causes of death), results are shown in file Resul-ModelMitzi-Stan.pdf in the Github repo above fisabio/BYM-stan, from now on all the documents that we will mention will be available there. We see that posterior summaries of sigma_phi are identical for all datasets where convergence is achieved. Note that the prior mass of the values sampled for sigma_phi, around 0.016, is rather low (prior used for tau_phi is a gamma(1, 1) which places most of the prior mass for sigma_phi above 0.3). So this prior does not explain these identical results and reinforce us to think that the question raised above could really be a concerning issue. On the other hand, we have implemented the changes mentioned above on Mitzi’s current model (model and results are available in Resul-BYM-WithPriors-Stan.pdf). In this case we do not find identical posterior results on sigma_phi, evidencing more sensible results.

Nevertheless, although results in this last model seem sensible most of the causes have considerable part of the posterior mass of sigma_phi below 0.3, despite of the gamma prior used puts low probability there. This is why we have changed those gamma informative priors to uniform priors on the sds, which was the original priors that we wanted to implement originally. We do not want to get into the appropriateness of these or other priors here since our interest is more computational. Nevertheless, we will use these uniform priors on the sds and flat improper priors on the random effects since we have years of extensive experience with them in WinBUGS and we have retrieved in all cases apparently sensible results. Thus, we have implemented the last of the models mentioned in Stan but with flat improper priors for both fixed effects (intercept) and sds of random effects (model and results at Resul-BYM-NoPriors-Stan.pdf). We have also implemented that same model in WinBUGS (model and results at Resul-BYM-WinBUGS.pdf). Results for both models, when convergence is achieved, are mostly identical, at least for the sds so we consider these two implementations of the BYM model in both different packages to be equivalent.

Now we go into the original question that motivated us to get in contact with Bob and Mitzi, Stan’s efficiency in large datasets. The datasets in our example correspond to mortality data in Spain at municipal level (around 8.000 units of analysis). So this seems a large dataset where Stan should show its advantage in comparison to WinBUGS. I apologize for not sharing these datasets but confidentiality issues (small area data) do not allow me, at this moment, to share them. As mentioned, we run the BYM model with flat priors in both WinBUGS and Stan for our datasets (results in files Resul-BYM-NoPriors-Stan.pdf and Resul-BYM-WinBUGS.pdf). As shown, when convergence is achieved, posterior inference on the sds coincides for both implementations. This is good news. Nevertheless we are a bit surprised of Stan’s computational performance as compared to that of WinBUGS. First, we see that for six of our datasets (datasets: 9,10,11,28,30,31) convergence is not achieved in Stan. WinBUGS, on the other hand, just had 3 datasets with (minor in some cases) convergence problems (datasets: 13,17,37): n.eff<100. Moreover, although Stan run for 4400 iterations and WinBUGS run for 11000, WinBUGS runs much faster than Stan. To be fair, the number of effective of iterations of each software should be taken into account in this comparison. But even in that case Stan’s performance is just similar to that of WinBUGS. Specifically, for Stan (results in document Resul-BYM-NoPriors-Stan.pdf), leaving out the 6 datasets with bad convergence, we find that 19.2 seconds are due for increasing the n.eff in unit (per average in all datasets). On the other hand, for WinBUGS (results in document Resul-BYM-WinBUGS.pdf), leaving out the 3 datasets with bad convergence, we find that 18.9 seconds are needed per unit of n.eff. Therefore, WinBUGS seems at least for these datasets as competitive as Stan. Just to let you know min.n.eff in the results mean the minimum n.eff for all the variables retrieved in each model and for comparability the same number of cores (3) have been used for running the Stan and WinBUGS version of the BYM model.

Maybe one could think that priors used could be having some effect on computing times. In our opinion priors are not so important. Results in Resul-BYM-WithPriors-Stan.pdf show computing times for the BYM stan models with the gamma informative priors in Mitzi’s model and Results in Resul-BYM-NoPriors-Stan.pdf show computing times for that model but with improper flat uniform priors for fixed effects and sds of random effects. Results for both models show no significant difference between both models so, this does not seem at a first glance a decisive issue in terms of computational efficiency. A more important issue in our opinion could be differences between the different chains sampled. As shown in the stan implementation of BYM (file Resul-BYM-NoPriors-Stan.pdf), differences between different causes of death can be very important, meanwhile this does not happen in WinBUGS (file Resul-BYM-WinBUGS.pdf). We have seen that when Stan takes longer to run is due to some of the chains sampled, which takes much longer than the rest of them. Is there any way to control this in Stan such as fixing the starting points of the chains as for WinBUGS? Do you have any other suggestion for making Stan more efficient in this context? We are very confident in Stan capabilities for large disease mapping studies but at our first attempt we haven’t really found that hypothetical advantage. Any guidance or suggestion in this sense will be really welcome.

#37

In no particular order:

• Priors are very important for these sorts of models and you will see some prior sensitivity. It’s not very concerning (in that it’s fairly easy to set meaningful priors on a slight generalisation). See the paper by Riebler et al in the references (the idea follows Dean et al and Wakefield, with important modifications). This is implemented in bym2_predictor_plus_offset.stan and stan_INLA_bym2.Rmd shows the correctness of the implementation by comparing with INLA.

• In general, setting the precision=1 in a CAR model will not lead to a standard deviation of 1. See again Riebler.

• I suspect you’re correct about the Besag term (although you really should use the other parameterisation). (@mitzimorris and/or @imadmali: can you check my logic here). I think the prior that’s been implemented is
p(phi) \propto ( sigma_phi^2 \sum_{i~j} ( phi_i - phi_j)^2,
so sigma_phi isn’t the standard deviation but the inverse standard deviation (which would explain your problems). So yes the change you made will fix it, but the other parameterisation is better.

• As for speed, this is a very first implementation. For various technical reasons (mainly the model is built in the Stan language and not implemented in the C++ code) this is slower than it could be. For the moment if you want it to be fast, use INLA.

#38

hi Miguel,

this code as written is correct. the statement

phi = phi * sigma_phi;


does the non-centered parameterization of phi with standard deviation sigma_phi.
see this example here:
http://mc-stan.org/workshops/nyc/lee-5.pdf

cheers,
Mitzi

#39

Hi Mitzi,

I don’t think I understand why this is correct (which is probably me not understanding the logic under priors on transformed parameters).

Because it looks to me that
target += -0.5 * dot_self(phi[node1] - phi[node2]);
should expand to
target += -0.5 * dot_self( {{phi * sigma_phi}}[node1] - {{phi * sigma_phi}}[node2]);
which would be the wrong model.

What’s wrong in my thinking??

Thanks!

D

#40

Dan: I think the issue is that you want to put the dot-self on the standardized phi:

target += -0.5 * dot_self(phi_std[node1] - phi_std[node2]);


That puts the spatial effects prior on a unit scale (and hence unit precision). Then you just multiply like Mitzi said:

phi = sigma_phi * phi_std;


Fair warning—any time there are negations or inversions, I start to get turned around.