Case study on spatial models for areal data - Poisson CAR/IAR

There’s nothing stopping us from coding that. It’s just that I don’t know how to automatically detect the number of connected components / what’s in each connected component. But I’m very bad with graphs!

I usually use the spdep package. Or is it something more complicated?

Which function do you use? (Sorry - I don’t usually do these things manually)

@anon75146577, Here is a little example.
I use the package maptools to have the ‘Sudden Infant deaths’ shapefile available, then use the package rgdal to read the shapefiles from my PC by using readOGR

library(maptools)
library(rgdal)
sids <- readOGR(system.file("shapes/", package="maptools"), "sids")

Then I use the package spdep to build the adjacency

library(spdep)
temp <- poly2nb(sids)

if you want an higher order lag

temp2= nblag(temp,2)

To build the adjacency matrix just use the nb2mat function

matrix <- nb2mat(temp, style="B")

style=“B” is the basic binary coding, W is row standardised, C is globally standardised, while S is the variance-stabilizing coding scheme proposed by Tiefelsdorf et al. 1999. W£ith this matrix you can check who is connected with whom, and who is disconnected.

Anyway, for a more extensive description you can check https://cran.r-project.org/web/packages/spdep/vignettes/nb.pdf
or the manual of the spdep package (it’s huge)
I hope this will help!

Easy peasy. I missed the matrix classes, but took a lot of algorithms classes.

I could code that as a Stan function, but it may not be worthwhile to do it in Stan directly unless you somehow need the components themselves.

R’s spdep package does this for you, function n.comp.nb

given an nb object nb_nyc_mm, here a map of New York City census tracts:

> components <- n.comp.nb(nb_nyc_mm)
> table(components$comp.id)

   1    2    3    4    5    6    7 
1983    2    1    1   22    1  108 

there are 7 components, 3 of which are islands.

Cool! In that case there’s no reason not to code up Anna and Håvard’s stuff. @imadmali: you should consider this for rstanarm.

are islands a problem here?

Kind of. The nullspace of the model with one connected component is just a constant.

When there are multiple connected components, you can add a constant to any of them independently. So there should be 7 constraints on the ICAR over the nb_nyc_mm graph.

The paper linked above gives good advice about what to do in that situation. (In particular, you should do something specific for the islands: basically take them out of the ICAR and give them a random effect)

1 Like

This case study is extremely helpful. I have one question/clarification (sorry to resurrect such an old thread). @mitzimorris notes in her description of the bym2 model results that:

n_eff numbers, while low for rho and sigma, are sufficient

I’m wondering how to gauge “sufficient” in this context. I ask because I’ve been running a similar model (albeit binomial rather than Poisson) and am getting pretty low (<100) n_eff values for rho and a number of phi’s (despite running 8000 warmup iterations and 10000 iterations total). The Rhat values are all ok (~1), is that an indication that the n_eff are sufficient?

are you running a single chain with 8000 warmup iterations or are you running multiple chains? how many warmup iterations per chain?

if you’re using R, the Bayesplot package might be useful. Also see Michael Betancourt’s RStan workflow case study - there’s a utility script which has a function to check n_eff.

I’m running 4 chains with warmup = 8000 and iter=10000 which I presume means each chain is running that number of warmup and iterations. I’ll give that utility script a look. Thanks for any additional info.

My model has a varying intercept in addition to the CAR term. I’ve tried running it with the varying intercept included in the convolved_re term and with it maintained in the actual model with an additional random effect (theta) included in the convolved_re term. Perhaps that is causing my problem?

Thanks for your help!

Just ran the utility script for check_n_eff and returned:

[1] “n_eff / iter looks reasonable for all parameters”

so I guess I’m good. I’ll need to read a bit more to understand why I’m good, but at least it seems safe to proceed. Thanks for your help!

The check_n_eff function simply checks to see if you can trust the effective sample size estimator, not whether the effective sample size is sufficient in your particular application. In practice we have to estimate effective sample size from the empirical autocorrelation in the samples, and if the autocorrelations are too large the estimator can be biased high and strongly overestimate the true effective sample size.

Whether the effective sample size in your chains is sufficient depends on the details of your application. In ideal circumstances (where you don’t see any divergences or E-FMI warnings, see the other functions in stan_utility.R) the effective sample size controls how accurate your Markov chain Monte Carlo estimators are – more effective samples, more precise estimators. How much precision you need depends on the questions you want to answer.

The case study at https://betanalpha.github.io/assets/case_studies/rstan_workflow.html more carefully goes through these definitions. See also https://arxiv.org/abs/1701.02434, Section 2 for a more elaborate discussion.

Thanks @betanalpha. The other utility checks don’t report any additional pathologies. For my application rho and phi are not actually the parameters of interest (I’m interested in the beta values after conditioning on the potential for spatial dependence in the data). Which leads me back to trying to understand what “sufficient” means in this context:

Am I to assume that estimates for rho and phi are sufficient if the model appears to be mixing and free of other pathologies given that my interest is in the beta estimates conditional on the existence of spatial structure? The mean’s and sd’s for the beta parameters seem reasonable given previous iterations of the model without spatial structure. Or am I missing some key element that determines “sufficiency”?

Thanks to both you and @mitzimorris for your help on this.

No other pathologies is great, so we can focus on effective sample size.

The worry in a model like this is that the latent parameters \rho and \sigma freeze and you end up sampling from the conditional distribution \pi(\theta \mid y, \rho, \sigma) instead of the joint distribution \pi(\theta, \rho, \sigma | y). But given all of the diagnostic information that doesn’t seem to be the case.

If you case about the \beta then what you have to figure out is how accurately do you need to know the mean of each \beta? Remember that MCMC only estimates that mean, and the error in the estimate depends on the effective sample size. In RStan you see this as the mcse, the second column when you call print(fit). If the mcse is small enough for your inferences then great, you’re done. If it’s not precise enough you’ll have to run longer or try to implement your model in a way that samples the latent parameters more efficiently (not that a superior implementation necessarily exists).

In particular, the threshold for “precise enough” depends on your application. We can’t tell you what is sufficient, you have to figure that out within the context of your domain expertise. That said, as a rule of thumb O(100) effective samples is typically sufficient for most applications.

That should run 8000 warmup iterations and 2000 sampling iterations.