BYM2 with unstructured country random effect

@RobertoCerina

More generally - is it correct to think of the theta for the islands are state-fixed effects ?

Superficially they look and act that way, but what they’re doing is different. The prior distribution for theta is a model, and one way to describe the model is to say it conducts partial pooling of the data (or learning across observations); that means (not to overexplain but you’re not the only reader), it first infers the total amount of dispersion around the mean \tau based on all the data and then uses that result to figure out what an outlying value is. That prior for theta places low probability on outliers.

g(\mu_i) = \alpha + \theta_i \\ \theta_i \sim Gauss(0, \tau) \\ \tau \sim Guass^{+}(0, ...)

So if data from Hawaii looks like an outlier, the prior model will tend to pull the posterior towards the mean \alpha. If there is a lot of data for Hawaii, then the likelihood will overwhelm the prior on theta, hence the posterior for that theta will not be pulled towards the mean much.

Thanks Connor - based on this explenation of theta, which acts essentially like a traditional random effect, I would then have expected that

    if (K_size[k] == 1) {
                 gamma[K_members[k,1]] = theta_sp[K_members[k,1]];
    } ...

be actually specified as

    if (K_size[k] == 1) {
                 gamma[K_members[k,1]] = theta_sp[K_members[k,1]] * sigma_sp;
    } ...

Because otherwise the Hawaii and Alaska effects do not share the scale parameter and hence face no pooling - what is your sense ? if this is not the case, what in the model is allowing infromation pooling across the islands and the mainland ?

Many thanks once again.

1 Like

Yep, that’s right

1 Like

Thanks, I just updated that file

1 Like

Still no luck with this btw - even after following your recommendations above. The spatial effects are just not willing to converge. See below for the attached Rhat. I copy below the latest code as well for reference, in case you spot something strange.

Worth noting that convergence improves if you really jack-up the rho_sp - e.g. with a beta(10,10) prior - but the data is doing far less work then and the posterior mean is ~ 0.45, which is close enough to 0.5 that makes me think the prior is just overwhelming the data.

functions {
  real standard_icar_disconnected_lpdf(vector phi,
				       int[ , ] adjacency,
				       int[ ] comp_size,
				       int[ , ] comp_members) {
				       
    if (size(adjacency) != 2) {
      reject("require 2 rows for adjacency array;"," found rows = ", size(adjacency));
    }
    if (size(comp_size) != size(comp_members)) {
      reject("require ", size(comp_size), " rows for members matrix;", " found rows = ", size(comp_members));
    }
    if (size(phi) != dims(comp_members)[2]) {
      reject("require ", size(phi), " columns for members matrix;", " found columns = ", dims(comp_members)[2]);
    }

    real total = 0;
    for (n in 1:size(comp_size)) {
      if (comp_size[n] > 1)
	      total += -0.5 * dot_self(phi[adjacency[1, comp_members[n, 1:comp_size[n]]]] - phi[adjacency[2, comp_members[n, 1:comp_size[n]]]]) + normal_lpdf(sum(phi[comp_members[n, 1:comp_size[n]]]) | 0, 0.001 * comp_size[n]);
    }
    return total;
    
  }
}
data {
  int<lower = 0> n; // number of observations
  int<lower = 0, upper = 1> Y[n]; // outcome

  // spatial structure
  int<lower = 0> I;  // number of nodes
  int<lower = 0> J;  // number of edges
  int<lower = 1, upper = I> edges[2, J];  // node[1, j] adjacent to node[2, j]

  int<lower=0, upper=I> K;  // number of components in spatial graph
  int<lower=0, upper=I> K_size[K];   // component sizes
  int<lower=0, upper=I> K_members[K, I];  // rows contain per-component areal region indices
  int<lower=0, upper = K> group_id[n]; // small-area id

  vector<lower=0>[K] tau_sp; // per-component scaling factor, 0 for singletons
  
  // random effects idx
  int<lower = 1,upper = I> state_id[n]; // small-area id

}
parameters {
           // intercept 
           real alpha;  
           // spatial effects
           real<lower = 0> sigma_sp;  // scale of spatial effects
  real<lower = 0, upper = 1> rho_sp;  // proportion of spatial effect thats spatially smoothed
                 vector[I] theta_sp;  // standardized heterogeneous spatial effects
                 vector[I] phi_sp;  // standardized spatially smoothed spatial effects
}
transformed parameters {
  vector[I] gamma;
  vector[n] mu;

  // each component has its own spatial smoothing
  for (k in 1:K) {
    if (K_size[k] == 1) {
                 gamma[K_members[k,1]] = theta_sp[K_members[k,1]]*sigma_sp;
    } else {
      gamma[K_members[k, 1:K_size[k]]] = ( sqrt(1 - rho_sp) * theta_sp[K_members[k, 1:K_size[k]]] + sqrt(rho_sp/tau_sp[k]) * phi_sp[K_members[k, 1:K_size[k]]] ) * sigma_sp;
      
    }
  }

  mu = alpha + gamma[state_id];

}
model {
  // global intercept prior
  alpha ~ normal(0,1);
  
  // spatial hyperpriors and priors
  sigma_sp ~ normal(0, 1);
  rho_sp ~ beta(5, 5);
  theta_sp ~ normal(0, 1);
  phi_sp ~ standard_icar_disconnected(edges, K_size, K_members);

  // likelihood
  Y ~ bernoulli_logit(mu);
}

convergence_total.pdf (8.6 KB)

@RobertoCerina I ran this model (and a more complicated version of this model) successfully with Mitzi’s BYM2 model without the islands (essentially dropping Alaska and Hawaii from my dataset). That model runs very smoothly and rather quickly converges …

Did you use the “original” ICAR code? Or did you use this code but left off the islands from the data?

I haven’t yet had a chance to work with that implementation of the model so I’m less able to help with it, though I’m not noticing anything that looks wrong . Have you tried the BYM2.stan file in the same git repo?

Ok - I ran the above model for the no islands case and it does not converge; I ran this code below, inspired largely by the `original’ ICAR prior from Mitzi’s code and it worked jus fine:

data {
    int<lower = 1> n; // total number of observations
    int<lower = 0> Y[n]; // vector of labels

    int<lower = 1> I; // number of small-areas
    int<lower = 1> state_id[n]; // small-area id
                          
    int<lower=0> n_edges; // Data for improper, efficient spatial prior:
    int<lower=1, upper=I> node1[n_edges]; // node1[i] adjacent to node2[i]
    int<lower=1, upper=I> node2[n_edges]; // and node1[i] < node2[i]
        
    int<lower=0> k; // 
    real tau_sp[k]; // scaling factor derived from the adjacency matrix
}

parameters {
    real alpha; // baseline
    real<lower = 0,upper = 1> rho_sp; // mixing weight
    real<lower = 0> sigma_sp; //
    vector[I] theta_sp; // small-area random effect - main
    vector[I] phi_sp; // small-area random effect - spatial
}

transformed parameters{
   vector[I] gamma; // convoluted small-area effect
   vector[n] mu; // logit-scale propensity to vote R

  // variance of each component should be approximately equal to 1
   gamma =  sqrt(1-rho_sp) * theta_sp + sqrt(rho_sp / tau_sp[1]) * phi_sp ; 
  
  // linear function of the logit-scale propensity to be a recruit
  mu = alpha + gamma[state_id]*sigma_sp;  
}

model {

  // // // Fixed Effects
  alpha ~ cauchy(0,10);

  // // // Random Effect on Small-Area
  theta_sp ~ normal(0,1);
  // // // ICAR priors
  target += -0.5 * dot_self(phi_sp[node1] - phi_sp[node2]);
  // soft sum-to-zero constraint on phi_sp,
  // equivalent to mean(phi_sp) ~ normal(0,0.01)
  sum(phi_sp) ~ normal(0, 0.01 * I);
  
  // // //  Convolution Parameters
   rho_sp ~ beta(0.5,0.5);
   sigma_sp ~ normal(0,1);
       
  // // // Likelihood
  Y ~ bernoulli_logit(mu);
}

I will try now the other BYM2-a code on the repository and get back. If you have a chance, see if you spot anything dodgy in the model above.

I have it working no problem with the BYM2-a model from @cmcd - there must be something not working in the BYM2-b model, perhaps is application specific, but it’s just a fairly basic logit-BYM2 model for the 51 US areas… for the life of me I do not know what is wrong.

Another area where I might have something wrong is the data input - see if these make sense, they seem sensible to me but I could be wrong:

int<lower = 0> n; // number of observations
data.list.total$n
[1] 4827

int<lower = 0, upper = 1> Y[n]; // outcome
data.list.total$Y
[1] 1 0 0 0 1 0 0 0 0 1 0 1 0 0 …

// spatial structure
int<lower = 0> I; // number of nodes
data.list.total$I
[1] 51

int<lower = 0> J; // number of edges
data.list.total$J
[1] 109

int<lower = 1, upper = I> edges[2, J]; // node[1, j] adjacent to node[2, j]
data.list.total$edges
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] …
[1,] 1 1 1 1 3 3 3 3 3 4 …
[2,] 10 11 25 43 5 6 29 32 45 …

int<lower=0, upper=I> K; // number of components in spatial graph
data.list.total$K
[1] 3

int<lower=0, upper=I> K_size[K]; // component sizes
data.list.total$K_size
[1] 49 1 1

int<lower=0, upper=I> K_members[K, I]; // rows contain per-component areal region indices
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] …
[1,] 1 3 4 5 6 7 8 9 10 11 …
[2,] 2 0 0 0 0 0 0 0 0 0 …
[3,] 12 0 0 0 0 0 0 0 0 0 …

int<lower=0, upper = K> group_id[n]; // connected-chunk id
group_id.temp.empty
[1] 1 1 1 1 1 1 1 1 1 1 …

vector<lower=0>[K] tau_sp; // per-component scaling factor, 0 for singletons
data.list.total$tau_sp
[1] 0.5250207 0.0000000 0.0000000

int<lower = 1,upper = I> state_id[n]; // small-area id
data.list.total$state_id
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 …

I’m going to use BYM2-a for now - I find the indexing in BYM2-b more intuitive somehow, if you get to the bottom of it at some point please do drop a line here. Code below.

functions{
/**
 * Log probability of the intrinsic conditional autoregressive (ICAR) prior,
 * excluding additive constants. 
 *
 * @param phi Vector of parameters for spatial smoothing (on unit scale, approximately)
 * @param node1 
 * @param node2
 * @param k number of groups
 * @param group_size number of observational units in each group
 * @param group_idx index of observations in order of their group membership
 * @param has_theta If the model contains an independent partial pooling term, phi for singletons can be zeroed out; otherwise, they require a standard normal prior. Both BYM and BYM2 have theta.
 *
 * @return Log probability density of ICAR prior up to additive constant
 **/
real icar_normal_lpdf(vector phi, 
                      int[] node1, 
                      int[] node2, 
                      int k, 
                      int[] group_size, 
                      int[] group_idx,
                      int has_theta) {
  real lp;
  int pos=1;
  lp = -0.5 * dot_self(phi[node1] - phi[node2]);
  if (has_theta) {
    for (j in 1:k) {
      /* sum to zero constraint, for each connected group; singletons zero out */
      lp += normal_lpdf(sum(phi[segment(group_idx, pos, group_size[j])]) | 0, 0.001 * group_size[j]);
      pos += group_size[j];
    }
  } else {
    /* has no theta */
    for (j in 1:k) {
      if (group_size[j] > 1) {
    /* same as above for non-singletons: sum to zero constraint */
    lp += normal_lpdf(sum(phi[segment(group_idx, pos, group_size[j])]) | 0, 0.001 * group_size[j]);
      } else {
    /* its a singleton: independent */
    lp += std_normal_lpdf(phi[segment(group_idx, pos, group_size[j])]);
      }      
      pos += group_size[j];
    }
  }
  return lp;
                      }


/**
 * Combine local and global partial-pooling components into the convolved BYM2 term.
 *
 * @param phi_tilde local (spatially autocorrelated) component
 * @param theta_tilde global component
 * @param spatial_scale scale parameter for the convolution term
 * @param n number of spatial units
 * @param k number of connected groups
 * @param group_size number of observational units in each group
 * @param group_idx index of observations in order of their group membership
 * @param logit_rho proportion of convolution that is spatially autocorrelated, logit transformed
 * @param scale_factor The scaling factor for the BYM2 model. See scale_c R function, using R-INLA.
 *
 * @return BYM2 convolution vector
 */
vector convolve_bym2(vector phi_tilde, 
                     vector theta_tilde,
                     real spatial_scale,
                     int n, 
                     int k,
                     int[] group_size, 
                     int[] group_idx,
                     real rho, 
                     vector scale_factor
              ) {
  vector[n] convolution;
  int pos=1;
  for (j in 1:k) {
    if (group_size[j] == 1) {
        convolution[ segment(group_idx, pos, group_size[j]) ] = spatial_scale * theta_tilde[ segment(group_idx, pos, group_size[j]) ];
    } else {
    convolution[ segment(group_idx, pos, group_size[j]) ] = spatial_scale * (
     sqrt(rho / scale_factor[j]) * phi_tilde[ segment(group_idx, pos, group_size[j]) ] +
     sqrt(1 - rho) * theta_tilde[ segment(group_idx, pos, group_size[j]) ]
      );
  }
  pos += group_size[j];
  }
  return convolution;
 }

}

data {
  int<lower = 1> n; // total number of observations
  int<lower = 0> Y[n]; // vector of labels

  int<lower=1> I; // no. of spatial units
  int<lower=1> k; // no. of groups
  int group_size[k]; // observational units per group
  int group_idx[I]; // index of observations, ordered by group
  int state_id[n]; // index of states in the data
  int<lower=1> n_edges; 
  int<lower=1, upper=n> node1[n_edges];
  int<lower=1, upper=n> node2[n_edges];
  int<lower=1, upper=k> comp_id[I]; 
  vector[k] scale_factor; // BYM2 scale factor, with singletons represented by 1
}

transformed data {
  int<lower=0,upper=1> has_theta=1;
}

parameters {
  real alpha;
  vector[I] phi_tilde;
  vector[I] theta_tilde;  
  real<lower=0> spatial_scale;
  real<lower=0,upper=1> rho;
}  
  
transformed parameters{
  vector[I] convolution;
  vector[n] mu;

  convolution = convolve_bym2(phi_tilde, theta_tilde, spatial_scale, I, k, group_size, group_idx, rho, scale_factor);

  // linear function of the logit-scale propensity to be a recruit
  mu = alpha + convolution[state_id];  
}

model {

   phi_tilde ~ icar_normal(node1, node2, k, group_size, group_idx, has_theta);
   theta_tilde ~ std_normal();
   spatial_scale ~ std_normal();
   rho ~ beta(1,1);
   alpha ~ std_normal();
       
  // // // Likelihood
  Y ~ bernoulli_logit(mu);
}
2 Likes

Just to confirm, there’s a big ol’ mistake in the way that the BYM2 with islands model is coded.
I’m very sorry to have put untested code into the wild.
I see the error of my ways and am working through the paper by Sterrantino et al - [1705.04854] A note on intrinsic Conditional Autoregressive models for disconnected graphs, and testing on the Scotland dataset (R library SpatialEpi) - something I’ve been meaning to do since summer 2018. hope to have notebooks for corrected version soon.

1 Like

the mistake is an indexing mistake - the ICAR is computed from the edgelist, but the indexes are in terms of the nodelist. the missing step is a data prep step which computes the per-component index masks into the edgelist.

in addition, closer reading of Sterrantino et al indicates that the distribution in the spatial component for a singleton needs to be a normal(0, 1/sqrt(K)) where K is the number of components - will run this by Dan Simpson to make sure I get it right next time.

2 Likes

Thanks for checking up on this Mitzi - I look forward to seeing the new version of the code once it’s out !

1 Like

I put together a notebook, scripts, and programs - needs further review and testing - BYM2 island implementation flawed; fix and writeup · Issue #3 · ConnorDonegan/Stan-IAR · GitHub

1 Like

OK spatial statisticians, need your insights here.

Fixed all indexing problems with the BYM2_islands.stan program, and have put together an Python notebook (running CmdStanPy) to fit the models and data, along with a bunch of R helper functions and scripts to do all the indexing for the spatial data, plus use INLA’s secret sauce for computing the scaling factors.

I am unable to get the same estimates as INLA on the Scotland data, per the Freni-Sterrantino et al paper - [1705.04854] A note on intrinsic Conditional Autoregressive models for disconnected graphs. - the model does fit the data quickly, without any diagnostic errors; the estimates are close, but as they say in Scotland, a wee bit off. This is a dataset with one connected component and 3 islands. I remain mystified the recommendations in section 4 of Freni-Sterrantino - I tried writing it up in my own words as part of the writeup, and that let to a bunch of experiments but no real insights.

The scaling factor for the singleton comes in as data and is computed in file bym2_helpers.R.
The ICAR prior is computed in the Stan program, according to Dan Simpson, it is doing the right thing.

This should work no problem for areal data where you’ve got two different components - it’s singletons that are problematic.

1 Like

Mitzi - I can try to look at this in some depth sometime next week, without possibly being able to add anything to the amazing work you and Connor have been doing.

If you’re worried about STAN not matching INLA perfectly, my first gut instinct response is that that’s not totally uncommon - I’m sure you’ve looked at INLA’s FAQ, but see below in case you (or other people reading) haven’t. Might be worth posting on INLA’s forum to ensure the models match perfectly ? Sorry if this is not much help for now.

from: R-INLA Project - FAQ

2. I have compared the INLA-results with those obtained using my own MCMC-code, but they do not match so well. What is wrong?

Although it might happen you have run into one of those cases where INLA is know to fail, it is far more likely it is "the devil is in the details’-effect. It is surprisingly difficult to get it all correct so that the model, in INLA and in your MCMC code is exactly the same. If there is one tiny difference it will show up in the results, somewhere. This ‘difference’ can be using not the same parameters of the prior, using a conditional not a marginal ‘initial condition’ for the AR(1) model, etc. Especially, this is awkward for intrinsic models where there are sum-to-zero constraints.

If you run into these problems, please contact us to get the implied model in INLA specified in detail. In this way, you are comparing the results obtained from the same model. For smaller models, INLA provide its own MCMC-sampler

which you can use; see demo(“Tokyo-compare”).

3 Likes

Hi STAN developer,

Just want to know that whether the model mentioned in the article ( Rodrigues, E.C. and Assunção, R., 2012. Bayesian spatial models with a mixture neighborhood structure. Journal of Multivariate Analysis , 109 , pp.88-102.) can be fitted using STAN? It is an extension of the Leroux CAR model.

Thanks in advance

yes, you can fit CAR models in Stan. here’s a great case study: Exact sparse CAR models in Stan

Hi Morris,
Thank you for your reply. To be more specific, I would like to know, is it possible to fit Leroux CAR models by STAN. The model reads

y= X\beta + \theta
\theta \sim MVN(0, Q^-1)
Q=\tau ((1-\lambda)I+\lambda R)
with \lambda \sim Unif(0,1) and I is an identity matrix. R is a variant of spatial adjacent matrices,
\mathbf{R}_{i j}=\left\{\begin{array}{cc}n_{i} & \text { if } i=j \\ -1 & \text { if } j \in \partial_{i} \\ 0 & \text { otherwise }\end{array}\right.

I search for the old topics in the forum, it seems that STAN is rarely used to fit the Leroux model, maybe this model requires more computing time?

Furthermore, it is possible to extend the above mentioned model to a more general formulation? such as
y= X\beta + \theta
\theta \sim MVN(0, Q^-1)
Q=\tau (\lambda_1 I+\lambda_2 R^{(1)} + \lambda_3 R^{(2)} + \cdots + \lambda_{k+1} R^{(k)} )
with \lambda_1+ \lambda_2+\cdots+ \lambda_{k+1}=1 and \lambda_l>0

Many thanks in advance,
Charles

this paper puts the Leroux model in context of all the variants of the CAR and ICAR models:
https://arxiv.org/pdf/1601.01180.pdf
see section 3.1.3

The paper you mention says this:

In Bayesian disease mapping, one needs to specify a neighborhood structure to make inference about the underlying geographical relative risks. We propose a model in which the neighborhood structure is part of the parameter space. We retain the Markov property of the typical Bayesian spatial models: given the neighborhood graph, disease rates follow a conditional autoregressive model.

An essential aspect of the BYM model and its extensions is the specification of the neighborhood structure for the areas. Although this is quite flexible and can be arbitrarily defined, in practice it is typically based only on adjacency relationships. There are few justifications for this practice other than its easy calculation by means of GIS (Geographic information system) routines. A related problem with the BYM model is that the neighborhood structure determines the smoothing degree used in relative risk estimation. Some authors noticed its tendency to oversmooth the risks when the usual adjacency neighborhood structure is used. Therefore, it would be very useful to have a model that allows for multiple neighborhood structure and automatically adapts itself according to the observed data.

The problem is that estimating these covariance matrices as a parameter requires a lot of matrix determinant calculations which are extremely expensive. The CAR case study goes into details.

1 Like

For the sake of avoiding confusion on a popular topic (and false rumors about what Stan can’t do), I’ll just clarify that it is certainly possible to use the Leroux model in Stan. It can be specified in straightforward terms using the built-in multivariate normal (multi_normal_prec) function. The problem enters when you wish to speed things up with a custom pdf (unlike with proper CAR models, which can be fit really fast using custom Stan functions).

2 Likes

Take a look at this preprint by Connor Donegan. https://osf.io/3ey65/ He has proposed a way of formulating the pCAR model in Stan that is very efficient. I believe the pCAR can be altered to resemble the Leroux prior.