BYM2 with unstructured country random effect

Hi everyone - I have a question about a BYM2-like model.

I have observations at the individual - district - country level.

I use the standard BYM2 mixed prior on the unstructured and spatial effects at the district level. I still think there is unstructured variance to be captured at the country-level, so I also throw in a country-level random effect:

\mu = X\beta + \theta + \gamma;
\theta = \sigma( \phi \sqrt{w/s}+ \psi \sqrt{1-w} );
\phi \sim \mbox{ICAR}(0,1) ;
\psi \sim N(0,1);
\gamma \sim N(0,\sigma_\gamma);
w \sim \mbox{Beta}(0.5,0.5)

Where \phi is the spatial component of the district-level effect; \psi is the unstructured component; \gamma is the country-level random effect; s is the scaling factor for the spatial effect.

The challenge I’m facing is that \gamma washes out - the entirety of the country-level variance is sucked into the spatial effect - see the image below which shows the spatial effect \phi over districts in the middle east - it clearly shows it is capturing the country-level dynamics, as the countries are absolutely distinct in the plot (see Tunisia, Morocco, Egypt et. ).

dist_spat_effect_outcome.pdf (532.9 KB)

I take this as a sign the country effect is being `sucked into’ the spatial effect as I would have expected the latter to be smoother over districts, and the harsh jumps along the country borders suggest to me there is an identifiability issue.

I’m taking this to mean I should structure the prior variance to identify the effects. Specifically, inspired by [1], I thought I could simply extend the BYM2 model as follows:

\mu = X\beta + \theta;
\theta = \sigma( \phi \sqrt{w_1/s}+ \psi \sqrt{w_2} + \gamma \sqrt{w_3} );
w \sim \mbox{Dirichlet}(1,1,1)
\psi \sim N(0,1);
\gamma \sim N(0,1);

My questions are:
a) is the above valid ?
b) I worry about a scaling factor - in principle I think I could re-express the dependency implied by the random-effect model at the country-level with a neighbourhood matrix, which would carry its own scaling factor in the equation above.

If this approach is not valid, can anyone suggest a better way ?

Many thanks in advance,
Roberto

[1] Rodrigues, E.C. and Assunção, R., 2012. Bayesian spatial models with a mixture neighborhood structure. Journal of Multivariate Analysis , 109 , pp.88-102.

1 Like

Maybe @mitzimorris has answers or thoughts on your questions.

The map of \phi doesn’t look like a problem to me. It reminds me of an analysis in Cressie and Wikle’s Statistics for Spatio-Temporal Data that shows a pretty strong break at an administrative boundary, and they note how they didn’t include those boundaries in the model. That makes the finding more compelling than if they had included the boundaries (like \gamma) because it speaks to the strength of the pattern in the data (the spatial discontinuities in the outcome overpowered your prior probability model at those borders). I don’t know your research question and only see a map of \phi but you may have what you need in the BYM2 model.

Maybe you’re already using this in your Stan code or you’re connecting Yemen to Egypt etc, if not then be sure to adjust for the disconnected graph structure in the IAR model

https://athowes.github.io/2020/11/10/fast-disconnected-car/

1 Like

I’m far from an expert, alas. here’s a paper: https://www.sciencedirect.com/science/article/pii/S0047259X12000589

We propose a model that expands the BYM and Leroux models beyond single-neighbor dependence of BYM and Leroux models to a larger class that has geographically increasing orders of neighborhood extension

They extend the Leroux model along the lines of what you’re suggesting. (The difference between the Leroux and BYM2 model is introduction of factor s calculated from the spatial structure of the data - cf. Riebler et al.)

1 Like

Thank you both !

in the end, it was not necessary to use the modified Leroux model.

It turns out the problem was the unconnected graph. Because the shapefile was a Frankenstein of multiple shape files, it had all sorts of inconsistencies, the biggest of which was none of the countries’ inner areas were neighbours to areas of different countries.

After I fully-connected the graph connected.pdf (617.0 KB) by including countries for which we have no observations, but are useful to connect areas where we do (I added Saudi Arabia and Israel, for instance), the model converges nicely Rhat.pdf (68.3 KB) (lp in blue) and the plots make more sense - the Governorate effects are much smoother dist_spat_effect_outcome.pdf (681.9 KB) , and the Country random effects are well identified govunst_effect_outcome.pdf (380.4 KB) . I worry a little bit about that lambda value lambda.pdf (4.9 KB) - is it usual to see that sort of exponential-like density within the range ? The scale factor for this graph is 1.2, which is a bit high but much more reasonable than what I was getting before.

I guess the question remains as to whether the `artificially’ connected graph creates some potential problems in inference compared to a model that deals with the islands explicitly. I don’t suspect that to be the case, given also the fact that there are very few islands.

Any and all thoughts are appreciated !

1 Like

I don’t think I follow exactly what you did here, looks like you added observations (with zero counts?) and basically imputed the values to connect it all? Generally I think its worth treating a connectivity matrix just as carefully as you would treat any other data or model specification. Preparing the model for islands doesn’t look like a lot of fun to me, but in my opinion at least it would be worth it for your data. The inferential problem is just that you’ll force one or more observations to (mutually) influence your estimate of another (in addition to their influence in \psi) without good reason or against your better judgement, which can ripple through the model rather than be just a localized problem. You can’t really figure out the impact until you try it.

Hi Connor - thanks for your reply.

Basically I just added countries to the shapefile so that the would be connectivity across the whole graph. I didn’t add any extra observations. These countries have no observations, but that’s not unlike many of the sub-regions of countries for which we do have observations; if you look at all those sub-regions in egypt, for most of them we have no observations.

My understanding is that the sub-regions effect of the new countries are assumed to be exchangeable at the unstructured-level, and so get estimated with unstructured effects \psi around zero with the common variance.

The spatial effects \phi are essentially interpolated, so if we have no observations for Saudi, but we do for Yemen (and they are low) and Egypt (and they are high) the Saudi effect will be imputed as the average of the neighbouring countries - so the average of Egypt and Yemen - which seems plausible if we have no observations for it.

I should note the outcome variable - this is an individual-level model of ISIS recruits, so it’s a rare-event, case-control model, other than being a spatial model:

y \sim \mbox{Bernoulli}(\rho)

\mbox{logit}(\rho) = \mbox{offset} + X\beta + \theta + \gamma

The objective is to account for spatial auto-correlation to de-bias the individual-level fixed-effects, more than to make area-level inference.

Does the approach still seem fishy to you ? I’ll check out the Islands model anyways.

Many thanks again for the feedback,
R

sounds interesting!

Are you distinguishing between ‘observing zero’ (zeros) and not observing at all (unknowns)? It sounds like following through with what you describe would require a model of missing observations. But if you’re not interested in using your data to make inferences about places per se, I’d just drop any locations on the map that don’t have data, and model the (remaining) geography of the observations.

Hey @RobertoCerina, if you end up working with the ‘island’ structure this might be useful, I’ve been meaning to work on this for a while now. Builds on the work I linked to earlier from @mitzimorris and Adam Howes but may be a little more streamlined

2 Likes

Many thanks @cmcd , this is immensely useful and looks very reasonable to implement (the US example is also very useful for some other work I’m doing on MrP, so brilliant, many thanks ! ). I will post something when I get a chance to implement. Again thanks to you, and Mitzi for the help.

1 Like

Hi @cmcd - I’m getting back in touch as I’ve finally found some time to implement the `islands’ version of the BYM2 model. I am using the code from Mitzi that is available on the github repository you linked to above (BYM2-b). I modified the code slightly for a bernoulli distributed outcome which is indexed by US states (e.g. the responses to a voter-intention survey).

I am facing a problem with the indexing though - this is the error:

SYNTAX ERROR, MESSAGE(S) FROM PARSER:
No matches for: 

  vector[ ] - vector[ ]

Available argument signatures for operator-:

  vector - vector
  row_vector - row_vector
  matrix - matrix
  vector - real
  row_vector - real
  matrix - real
  real - vector
  real - row_vector
  real - matrix

Expression is ill formed.
 error in 'modele2692cc0db87_ea907ebc718fa2d847fe19598a53363c' at line 58, column 83
  -------------------------------------------------
    56:       if (comp_size[n] > 1)
    57:         total += -0.5 * dot_self(phi[adjacency[1, comp_members[n, 1:comp_size[n]]]] - 
    58:                                  phi[adjacency[2, comp_members[n, 1:comp_size[n]]]]) + 
                                                                                          ^
    59:                                  normal_lpdf(sum(phi[comp_members[n, 1:comp_size[n]]]) | 0, 0.001 * comp_size[n]);
  -------------------------------------------------

Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 
  failed to parse Stan model 'ea907ebc718fa2d847fe19598a53363c' due to the above error.

The code is below - again mostly the same as the BYM2-b from @mitzimorris. I am not experienced enough in writing functions in STAN to figure out exactly what is going on - obviously I understand that there is an indexing issue, but the integer arrays that index the spatial-effects seem like they should work to me. Sorry if this is a basic question - looking forward to any and all thoughts.

Many thanks !

functions {
  /**
   * Return the log probability density of the specified vector of
   * coefficients under the ICAR model with unit variance, where
   * adjacency is determined by the adjacency array and the spatial
   * structure is a disconnected graph which has at least one
   * connected component.  The spatial structure is described by
   * an array of component sizes and a corresponding 2-D array
   * where each row contains the indices of the nodes in that
   * component.  The adjacency array contains two parallel arrays
   * of adjacent element indexes (i.e. edges in the component graph).
   *
   * For example, a series of four coefficients phi[1:4] for a
   * disconnected graph containing 1 singleton would have
   * adjacency array {{1, 2}, {2, 3}}, signaling that coefficient 1
   * is adjacent to coefficient 2, and 2 is adjacent to 3,
   * component size array {3, 1}, and (zero-padded) component members
   * array of arrays { { 1, 2, 3, 0}, {4, 0, 0, 0} }.
   *
   * Each connected component has a soft sum-to-zero constraint.
   * Singleton components dont contribute to the ICAR model.
   *
   * @param phi vector of varying effects
   * @param adjacency parallel arrays of indexes of adjacent elements of phi
   * @param comp_size array of sizes of components in spatial structure graph
   * @param comp_members array of arrays of per-components coefficients.
   *
   * @return ICAR log probability density
   *
   * @reject if the the adjacency matrix does not have two rows
   * @reject if size mismatch between comp_size and comp_members
   * @reject if size mismatch between phi and dimension 2 of comp_members
   *
   * author: Mitzi Morris
   */
   
  real standard_icar_disconnected_lpdf(vector[ ] phi,
                                       int[ , ] adjacency,
                                       int[ ] comp_size,
                                       int[ , ] comp_members
                                       ) {
    real total = 0;
                                       
    if (size(adjacency) != 2){
    reject("require 2 rows for adjacency array;"," found rows = ", size(adjacency));
    }
    if (size(comp_size) != size(comp_members)) {
      reject("require ", size(comp_size), " rows for members matrix;"," found rows = ", size(comp_members));
    }
    if (size(phi) != dims(comp_members)[2]){
      reject("require ", size(phi), " columns for members matrix;"," found columns = ", dims(comp_members)[2]);
    }
    
    for (n in 1:size(comp_size)) {
      if (comp_size[n] > 1)
        total += -0.5 * dot_self(phi[adjacency[1, comp_members[n, 1:comp_size[n]]]] - 
                                 phi[adjacency[2, comp_members[n, 1:comp_size[n]]]]) + 
                                 normal_lpdf(sum(phi[comp_members[n, 1:comp_size[n]]]) | 0, 0.001 * comp_size[n]);
    }
    return total;
  }
}
data {
  // spatial structure
  int<lower = 0> N_state;  // number of nodes
  int<lower = 0> n_edges;  // number of edges
  int<lower = 1, upper = N_state> edges[2, n_edges];  // node[1, j] adjacent to node[2, j]
  
  int<lower=0, upper=N_state> k;  // number of components in spatial graph
  int<lower=0, upper=k> group_size[k];   // component sizes
  int<lower=0, upper=N_state> comp_id[k, N_state];  // rows contain per-component areal region indices
  
  vector<lower=0>[k] scale_factor; // per-component scaling factor, 0 for singletons
}
parameters {
  // spatial effects
  real<lower = 0> sigma_sp;  // scale of spatial effects
  real<lower = 0, upper = 1> rho_sp;  // proportion of spatial effect thats spatially smoothed
  vector[N_state] theta_sp;  // standardized heterogeneous spatial effects
  vector[N_state] phi_sp;  // standardized spatially smoothed spatial effects

}
transformed parameters {
  vector[N_state] gamma;
  vector[n] mu; // logit-scale propensity to vote R
  
  // each component has its own spatial smoothing
  for (i in 1:k) {
    if (group_size[i] == 1) {
      gamma[comp_id[i,1]] = theta_sp[comp_id[i,1]];
    } else {
      gamma[comp_id[i, 1:group_size[i]]] = (sqrt(1 - rho_sp) * theta_sp[comp_id[i, 1:group_size[i]]] + 
                                           (sqrt(rho_sp) * sqrt(1 / tau_sp[i]) * phi_sp[comp_id[i, 1:group_size[i]]]) * sigma_sp);
    }
  }
  
  // linear function of the logit-scale propensity to be a recruit
  mu = alpha + 
       gamma[state_id];  
  
}
model {

  // spatial hyperpriors and priors
  sigma_sp ~ normal(0, 1);
  rho_sp ~ beta(0.5, 0.5);
  theta_sp ~ normal(0, 1);
  phi_sp ~ standard_icar_disconnected(edges, K_size, K_members);

  // // // fixed effects priors
  alpha ~ cauchy(0,10);
      
  // // // likelihood
  Y ~ bernoulli_logit(mu);
  
}

in the function decl, vector[] should be just plain vector

see: 9.4 Argument types and qualifiers | Stan Reference Manual

Thanks for your quick reply Mitzi - changing that I get this error:

SYNTAX ERROR, MESSAGE(S) FROM PARSER:
No matches for: 

  size(vector)

Available argument signatures for size:

  size(int[ ])
  size(real[ ])
  size(vector[ ])
  size(row_vector[ ])
  size(matrix[ ])
  size(int[ , ])
  size(real[ , ])
  size(vector[ , ])
  size(row_vector[ , ])
  size(matrix[ , ])
  size(int[ , , ])
  size(real[ , , ])
  size(vector[ , , ])
  size(row_vector[ , , ])
  size(matrix[ , , ])
  size(int[ , , , ])
  size(real[ , , , ])
  size(vector[ , , , ])
  size(row_vector[ , , , ])
  size(matrix[ , , , ])
  size(int[ , , , , ])
  size(real[ , , , , ])
  size(vector[ , , , , ])
  size(row_vector[ , , , , ])
  size(matrix[ , , , , ])
  size(int[ , , , , , ])
  size(real[ , , , , , ])
  size(vector[ , , , , , ])
  size(row_vector[ , , , , , ])
  size(matrix[ , , , , , ])
  size(int[ , , , , , , ])
  size(real[ , , , , , , ])
  size(vector[ , , , , , , ])
  size(row_vector[ , , , , , , ])
  size(matrix[ , , , , , , ])

 error in 'modele26942e80896_7172170a1108e80dd84fca38359ad7ea' at line 51, column 17
  -------------------------------------------------
    49:       reject("require ", size(comp_size), " rows for members matrix;"," found rows = ", size(comp_members));
    50:     }
    51:     if (size(phi) != dims(comp_members)[2]){
                        ^
    52:       reject("require ", size(phi), " columns for members matrix;"," found columns = ", dims(comp_members)[2]);
  -------------------------------------------------

Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 
  failed to parse Stan model '7172170a1108e80dd84fca38359ad7ea' due to the above error.

what version of Stan are you using?

compiler for Stan 2.26 allows vector args to the size function - 5.1 Integer-valued matrix size functions | Stan Functions Reference

older versions of the language, perhaps not - checking when this changed.
(use CmdStanR instead of RStan to get latest Stan goodness).

1 Like

Ah - indeed I had an old version of STAN ! thanks for the pointer - I was running around with STAN 2.19 it turns out ! Runs smoothly now - many thanks.

I have a 2 follow-up questions:

  1. when I look at the original code from here: Stan-IAR/BYM2-b.stan at main · ConnorDonegan/Stan-IAR · GitHub , if I look at the data structure it is as follows:

    data {
      // spatial structure
      int<lower = 0> I;  // number of nodes
      int<lower = 0> J;  // number of edges
      int<lower = 1, upper = I> edges[2, J];  // node[1, j] adjacent to node[2, j]
    
      int<lower=0, upper=I> K;  // number of components in spatial graph
      int<lower=0, upper=K> K_size[K];   // component sizes
      int<lower=0, upper=I> K_members[K, I];  // rows contain per-component areal region indices
    
      vector<lower=0>[K] tau_sp; // per-component scaling factor, 0 for singletons
    }
    

I have some trouble understanding int<lower=0, upper=K> K_size[K] : if K_size is the size of each group of connected spatial units, shouldn’t it be declared as int<lower=0, upper=I> K_size[K] ? If I think of US states, I = 51, K = 3 (alaska, hawaii and mainland) and K_size = c(49,1,1). Am I thinking about this correctly ? Apologies in advance if I’m sticking my foot in my mouth…

  1. How should I think about convergence in this model ? I understand convergence is a global phenomenon, so I should monitor lp_ and ensure that converges as well as the other parameters I’m monitoring, but what if I have a converging lp_ and gamma, but a non-convergent phi, for example - should that be cause of concern ? The idea being that I’m using the model to make predictions, so I only want to use gamma in my predictions on a test-set and care little for the values of phi and theta that generate it.

Many thanks in advance ! R

As to 2, sounds like what you’re describing is that once theta is joined to phi whatever issue that was detected in phi is no longer evident from the test statistic—but its still present, so I’d heed the warning. Sampling for that parameter is usually pretty smooth, so if there seems to be trouble it might an indication that something’s not parameterized quite right, or just needs to run for more iterations.

One thing to consider, if you have multiple connected components (rather than one connected set of observations plus some islands), its a good idea to have a separate intercept for each connected component (the first component uses the global intercept). I’m working on updating the code in that repo for this. But you can just add a dummy variable for the second component, for example.

Freni-Sterrantino et al.'s “A note on intrinsic conditional autoregressive models for
disconnected graphs”, Spatial and Spatio-Temporal Epidemiology

3 Likes

Thanks @cmcd - as always that is very useful advice and thanks for the paper link too. It’s still unclear to me what to do to improve the situation there.

I have this model running for the US case; the challenge I’m facing is that my Phis don’t converge (as I was detailing above). It is indeed the case that the states for which I have the least observations struggle more (e.g. North Dakota, I only have 5 respondents out of ~ 4800). But generally the Phis are non-convergent. Not sure what recommendations you have here, would love to hear your thoughts.

functions {
  real standard_icar_disconnected_lpdf(vector phi,
				       int[ , ] adjacency,
				       int[ ] comp_size,
				       int[ , ] comp_members) {
				       
    if (size(adjacency) != 2) {
      reject("require 2 rows for adjacency array;"," found rows = ", size(adjacency));
    }
    if (size(comp_size) != size(comp_members)) {
      reject("require ", size(comp_size), " rows for members matrix;", " found rows = ", size(comp_members));
    }
    if (size(phi) != dims(comp_members)[2]) {
      reject("require ", size(phi), " columns for members matrix;", " found columns = ", dims(comp_members)[2]);
    }

    real total = 0;
    for (n in 1:size(comp_size)) {
      if (comp_size[n] > 1)
	      total += -0.5 * dot_self(phi[adjacency[1, comp_members[n, 1:comp_size[n]]]] - phi[adjacency[2, comp_members[n, 1:comp_size[n]]]]) + normal_lpdf(sum(phi[comp_members[n, 1:comp_size[n]]]) | 0, 0.001 * comp_size[n]);
    }
    return total;
    
  }
}
data {
  int<lower = 0> n; // number of observations
  int<lower = 0, upper = 1> Y[n]; // outcome

  // spatial structure
  int<lower = 0> I;  // number of nodes
  int<lower = 0> J;  // number of edges
  int<lower = 1, upper = I> edges[2, J];  // node[1, j] adjacent to node[2, j]

  int<lower=0, upper=I> K;  // number of components in spatial graph
  int<lower=0, upper=I> K_size[K];   // component sizes
  int<lower=0, upper=I> K_members[K, I];  // rows contain per-component areal region indices

  vector<lower=0>[K] tau_sp; // per-component scaling factor, 0 for singletons
  
  // random effects idx
  int<lower = 1,upper = I> state_id[n]; 
}
parameters {
           // fixed effects 
                         real alpha;  // intercept
           // spatial effects
           real<lower = 0> sigma_sp;  // scale of spatial effects
  real<lower = 0, upper = 1> rho_sp;  // proportion of spatial effect thats spatially smoothed
                 vector[I] theta_sp;  // standardized heterogeneous spatial effects
                   vector[I] phi_sp;  // standardized spatially smoothed spatial effects
}
transformed parameters {
  vector[I] gamma;
  vector[n] mu;
  
  // each component has its own spatial smoothing
  for (k in 1:K) {
    if (K_size[k] == 1) {
      gamma[K_members[k,1]] = theta_sp[K_members[k,1]];
    } else {
      gamma[K_members[k, 1:K_size[k]]] = sqrt(1 - rho_sp) * theta_sp[K_members[k, 1:K_size[k]]]
                                         + (sqrt(rho_sp) * sqrt(1 / tau_sp[k])
                                         * phi_sp[K_members[k, 1:K_size[k]]]) * sigma_sp;
    }
  }
  
  // linear predictor
  mu = alpha + gamma[state_id];
}
model {
  // global intercept prior
  alpha ~ cauchy(0,10);
  
  // spatial hyperpriors and priors
  sigma_sp ~ normal(0, 1);
  rho_sp ~ beta(0.5, 0.5);
  theta_sp ~ normal(0, 1);
  phi_sp ~ standard_icar_disconnected(edges, K_size, K_members);

  // likelihood
  Y ~ bernoulli_logit(mu);
}

I believe you’re correct - that should be

int<lower=0, upper=I> K_size[K];   // component sizes

that’s the problem with toy examples - so small that everything works.

1 Like

I think this:

 sqrt(1 - rho_sp) * theta_sp[K_members[k, 1:K_size[k]]] + 
( sqrt(rho_sp) * sqrt(1 / tau_sp[k]) * phi_sp[K_members[k, 1:K_size[k]]] ) 
 * sigma_sp;

should be like this:

( sqrt(1 - rho_sp) * theta_sp[K_members[k, 1:K_size[k]]] + 
 sqrt(rho_sp) * sqrt(1 / tau_sp[k]) * phi_sp[K_members[k, 1:K_size[k]]] ) 
 * sigma_sp;

They share the scale parameter.

The beta prior on rho_sp also places quite a bit of density on 0 and 1; 1 would be an ICAR term with no theta. I’d personally go with something flatter or else that looks like a slightly ‘frowning’ rather than ‘smiling’ face (convex, concave, whatever).

hist(rbeta(n=30e3, 0.5, 0.5))

Have you already tried this same model without the spatial part—drop phi and all the BYM stuff and just use theta? I’d try that to troubleshoot also.

1 Like

Thanks @cmcd - I ran this model (and a more complicated version of this model) successfully with Mitzi’s BYM2 model without the islands (essentially dropping Alaska and Hawaii from my dataset). That model runs very smoothly and rather quickly converges, which is what made me think I could quickly extend it. A model with simple state random effects also converges without any issues very quickly.

I see your point about the rho_sp and tried to increase it to a more favourable (if unrealistic) beta(5,5), as well as increasing the tree-depth to 12 (over 500 iterations, 250 warmup, 4 chains and thin = 4). I also changed the gamma definition as you suggested above - to reflect the shared variance component.

This improves the situation some - see attached for the traceplots on phi - the 2nd and 12th are respectively Alaska and Hawaii, so their Phi can be ignored as it doesn’t actually enter the model at any point. Still, some of the other states are rather concerning. Attached are also the gamma and rho traceplots - there are some residual problems on gamma but all due to the weird acting phis. convergence_total.pdf (8.5 KB) traceplot.rho_sp.pdf (6.1 KB) traceplot.gamma_sp.pdf (72.2 KB) traceplot.phi_sp.pdf (72.1 KB)

The global intercept alpha also fails to converge which is disappointing. I’m wondering if this is an identifiability issue perhaps.

More generally - is it correct to think of the theta for the islands are state-fixed effects ?

Many thanks for all your help.

I was just thinking that these models can require more iterations than you might expect, they tend to sample quickly but still require more iterations than something like your theta will to converge to a stable distribution. I’d increase to something like chains = 5, iter = 3e3, thin = 0, warmup = 1500 to start.

1 Like