There’s nothing stopping us from coding that. It’s just that I don’t know how to automatically detect the number of connected components / what’s in each connected component. But I’m very bad with graphs!
I usually use the spdep package. Or is it something more complicated?
Which function do you use? (Sorry - I don’t usually do these things manually)
@anon75146577, Here is a little example.
I use the package maptools to have the ‘Sudden Infant deaths’ shapefile available, then use the package rgdal to read the shapefiles from my PC by using readOGR
library(maptools)
library(rgdal)
sids <- readOGR(system.file("shapes/", package="maptools"), "sids")
Then I use the package spdep to build the adjacency
library(spdep)
temp <- poly2nb(sids)
if you want an higher order lag
temp2= nblag(temp,2)
To build the adjacency matrix just use the nb2mat function
matrix <- nb2mat(temp, style="B")
style=“B” is the basic binary coding, W is row standardised, C is globally standardised, while S is the variance-stabilizing coding scheme proposed by Tiefelsdorf et al. 1999. W£ith this matrix you can check who is connected with whom, and who is disconnected.
Anyway, for a more extensive description you can check https://cran.r-project.org/web/packages/spdep/vignettes/nb.pdf
or the manual of the spdep package (it’s huge)
I hope this will help!
Easy peasy. I missed the matrix classes, but took a lot of algorithms classes.
I could code that as a Stan function, but it may not be worthwhile to do it in Stan directly unless you somehow need the components themselves.
R’s spdep package does this for you, function n.comp.nb
given an nb
object nb_nyc_mm
, here a map of New York City census tracts:
> components <- n.comp.nb(nb_nyc_mm)
> table(components$comp.id)
1 2 3 4 5 6 7
1983 2 1 1 22 1 108
there are 7 components, 3 of which are islands.
Cool! In that case there’s no reason not to code up Anna and Håvard’s stuff. @imadmali: you should consider this for rstanarm.
are islands a problem here?
Kind of. The nullspace of the model with one connected component is just a constant.
When there are multiple connected components, you can add a constant to any of them independently. So there should be 7 constraints on the ICAR over the nb_nyc_mm graph.
The paper linked above gives good advice about what to do in that situation. (In particular, you should do something specific for the islands: basically take them out of the ICAR and give them a random effect)
This case study is extremely helpful. I have one question/clarification (sorry to resurrect such an old thread). @mitzimorris notes in her description of the bym2 model results that:
n_eff numbers, while low for rho and sigma, are sufficient
I’m wondering how to gauge “sufficient” in this context. I ask because I’ve been running a similar model (albeit binomial rather than Poisson) and am getting pretty low (<100) n_eff
values for rho
and a number of phi
’s (despite running 8000 warmup iterations and 10000 iterations total). The Rhat
values are all ok (~1), is that an indication that the n_eff are sufficient?
are you running a single chain with 8000 warmup iterations or are you running multiple chains? how many warmup iterations per chain?
if you’re using R, the Bayesplot package might be useful. Also see Michael Betancourt’s RStan workflow case study - there’s a utility script which has a function to check n_eff.
I’m running 4 chains with warmup = 8000
and iter=10000
which I presume means each chain is running that number of warmup and iterations. I’ll give that utility script a look. Thanks for any additional info.
My model has a varying intercept in addition to the CAR term. I’ve tried running it with the varying intercept included in the convolved_re
term and with it maintained in the actual model with an additional random effect (theta
) included in the convolved_re
term. Perhaps that is causing my problem?
Thanks for your help!
Just ran the utility script for check_n_eff
and returned:
[1] “n_eff / iter looks reasonable for all parameters”
so I guess I’m good. I’ll need to read a bit more to understand why I’m good, but at least it seems safe to proceed. Thanks for your help!
The check_n_eff
function simply checks to see if you can trust the effective sample size estimator, not whether the effective sample size is sufficient in your particular application. In practice we have to estimate effective sample size from the empirical autocorrelation in the samples, and if the autocorrelations are too large the estimator can be biased high and strongly overestimate the true effective sample size.
Whether the effective sample size in your chains is sufficient depends on the details of your application. In ideal circumstances (where you don’t see any divergences or E-FMI warnings, see the other functions in stan_utility.R
) the effective sample size controls how accurate your Markov chain Monte Carlo estimators are – more effective samples, more precise estimators. How much precision you need depends on the questions you want to answer.
The case study at https://betanalpha.github.io/assets/case_studies/rstan_workflow.html more carefully goes through these definitions. See also https://arxiv.org/abs/1701.02434, Section 2 for a more elaborate discussion.
Thanks @betanalpha. The other utility checks don’t report any additional pathologies. For my application rho
and phi
are not actually the parameters of interest (I’m interested in the beta
values after conditioning on the potential for spatial dependence in the data). Which leads me back to trying to understand what “sufficient” means in this context:
Am I to assume that estimates for rho
and phi
are sufficient if the model appears to be mixing and free of other pathologies given that my interest is in the beta
estimates conditional on the existence of spatial structure? The mean’s and sd’s for the beta
parameters seem reasonable given previous iterations of the model without spatial structure. Or am I missing some key element that determines “sufficiency”?
Thanks to both you and @mitzimorris for your help on this.
No other pathologies is great, so we can focus on effective sample size.
The worry in a model like this is that the latent parameters \rho and \sigma freeze and you end up sampling from the conditional distribution \pi(\theta \mid y, \rho, \sigma) instead of the joint distribution \pi(\theta, \rho, \sigma | y). But given all of the diagnostic information that doesn’t seem to be the case.
If you case about the \beta then what you have to figure out is how accurately do you need to know the mean of each \beta? Remember that MCMC only estimates that mean, and the error in the estimate depends on the effective sample size. In RStan
you see this as the mcse
, the second column when you call print(fit)
. If the mcse
is small enough for your inferences then great, you’re done. If it’s not precise enough you’ll have to run longer or try to implement your model in a way that samples the latent parameters more efficiently (not that a superior implementation necessarily exists).
In particular, the threshold for “precise enough” depends on your application. We can’t tell you what is sufficient, you have to figure that out within the context of your domain expertise. That said, as a rule of thumb O(100) effective samples is typically sufficient for most applications.
That should run 8000 warmup iterations and 2000 sampling iterations.