Spatial modelling of individual data measured at areal level

Hello everyone,

I’m new to spatial modelling, and I’m having trouble finding the right modelling method for my problem.

My data relates to accidents. For each individual, I have a clinical score measuring the severity of the injuries, as well as the municipality in which the accident occurred. I am trying to establish whether the injury severity score is associated with population density in the commune, given that the effect of one commune is suspected to be influenced by adjacent communes.
I am therefore looking to carry out modelling that takes into account the spatial correlation between communes.

However, the data are spatially associated with areas, and the only modelling options I can find for this type of spatial data are models in which the individual data are aggregated.

Do you know of a way to model my outcome on an individual scale while taking into account the fact that it is measured at an area level?

Thank you in advance for your advice.

Hi,

I would recommend checking out the documentation of the geostan R package. It implements the most widely used spatial models and most of the examples are areal based. Even if you don’t use the package directly its a good resource for becoming more familiar with spatial models in Stan.

3 Likes

@mitzimorris wrote a tutorial on spatial modeling, and one of the examples is an areal model of pedestrian accident data in NYC by census tract as data:

I think she’s been working on updates to it.

2 Likes

Following up on @Jesse’s suggestion, here’s some R and Stan code for getting started with models for observations nested in areal units.

This creates some data with K observations per areal unit:

library(geostan)

data(georgia)
N_geo <- nrow(georgia)
W <- shape2mat(georgia, "W")
mu_geo <- sim_sar(w=W, rho = 0.7)
K <- 15
mu <- rep(mu_geo, each = K)
y <- rnorm(mean = mu,
           sd = 0.2,
           n = length(mu))
dat <- data.frame(
    GEOID = rep(georgia$GEOID, each = K),
    y = y)

# converting GEOID to an index
id <- georgia$GEOID
dx <- data.frame(
    GEOID = id,
    geo_index = 1:N_geo
)
dat <- merge(dat, dx, by = "GEOID")

Here’s a Stan model example with a SAR model for the areal units. The user-defined function is taken from the documentation that Jesse linked to (especially this vignette: Custom spatial models with RStan and geostan • geostan). I added an option to center the spatial model on zero or not (zmp) (i.e., choose non-centered or centered parameterization).

So you want to have a mean parameter alpha (and you could add covariates or whatever to that) and then add another term for the area, which I’ve coded as geo. Using the index:

\mu_i = \alpha + geo_i

where geo is mean-zero (geo ~ mvnormal(0, Sigma)). Instead of doing a loop like for (i in 1:n) mu[i] = alpha + geo[index[i]] I used the index like mu = alpha + geo[index].

All of the data inputs for the spatial model are provided by geostan, so its less of a hassle than it may look at first:

[edit: applied Stan model formatting]

functions {
/**
 * Log probability density of the simultaneous autoregressive (SAR) model (spatial error model)
 *
 * @param y Process to model
 * @param mu Mean vector
 * @param sigma Scale parameter
 * @param rho Spatial dependence parameter
 * @param W Sparse representation of W (its non-zero values)
 * @param W_v Column indices for values in W
 * @param W_u Row starting indices for values in W
 * @param lambda Eigenvalues of W
 * @param n Length of y
 *
 * @return Log probability density of SAR model up to additive constant
*/
  real  sar_normal_lpdf(vector y,
              vector mu,
              real sigma,
              real rho,
              vector W_w,
              array[] int W_v,
              array[] int W_u,
              vector lambda,
              int n) {
    vector[n] z = y - mu;
    real tau = 1 / sigma^2;
    vector[n] ImrhoWz = z - csr_matrix_times_vector(n, n, rho * W_w, W_v , W_u , z);
    real zVz = tau * dot_self(ImrhoWz);
    real ldet_V = 2 * sum(log1m(rho * lambda)) - 2 * n * log(sigma);
    return  0.5 * ( -n * log(2 * pi()) + ldet_V - zVz );         
 }    
}

data {
  int<lower=0,upper=1> zmp; // zero-mean CAR model? (non-centered param.)
  int<lower=1> N;
  vector[N] y;

  // no. geog areas
  int N_geo;

  // geo identifier
  array[N] int geo_index;    
  
// SAR
  int nW_w;
  vector[nW_w] W_w;
  array[nW_w] int W_v;
  array[N_geo + 1] int W_u;
  vector[N_geo] eigenvalues_w;
}

parameters {
  real alpha;
  real<lower=0> sigma;
  
  vector[N_geo] geo;  
  real<lower=0> sigma_geo;
  real<lower=1/min(eigenvalues_w), upper=1/max(eigenvalues_w)> rho_geo;  
}

transformed parameters {
  vector[N] mu;
  vector[N_geo] mu_geo;
  
  if (zmp) {
    mu = alpha + geo[geo_index];
    mu_geo = rep_vector(0, N_geo);
  } else {
    mu = geo[geo_index];
    mu_geo = rep_vector(alpha, N_geo);    
  }
}

model {
 
  target += normal_lpdf(y | mu, sigma);

 // adjust these to have sensible priors for your project
  target += normal_lpdf(sigma | 0, 2);
  target += normal_lpdf(alpha | 8, 5);
  target += student_t_lpdf(sigma_geo | 10, 0, 5);
  
  // spatial model
   target += sar_normal_lpdf(geo | 
                   mu_geo, 
   		  sigma_geo, 
   		  rho_geo, 
                   W_w, 
                   W_v, 
                   W_u, 
                  eigenvalues_w,
                  N_geo);
}

This shows how to convert the data to the required format:

# start creating list of data for Stan
dl <- prep_sar_data(W)
dl$W <- NULL # might need to remove this W item or cmdstan will error

# add other data
dl$N <- nrow(dat)
dl$N_geo <- N_geo
dl$y <- dat$y
dl$geo_index <- dat$geo_index
dl$zmp <- 0 

And draw samples:

library(cmdstanr)
mod <- cmdstan_model("nested_sar.stan") 
S <- mod$sample(data = dl,
                iter_sampling = 1e3,
                parallel_chains = 4)
print(S, c('alpha', 'sigma', 'rho_geo', 'sigma_geo'))

returns:

  variable mean median   sd  mad    q5  q95 rhat ess_bulk ess_tail
 alpha     0.05   0.06 0.28 0.26 -0.40 0.48 1.00     5832     2420
 sigma     0.20   0.20 0.00 0.00  0.20 0.21 1.00     6148     2794
 rho_geo   0.65   0.66 0.07 0.07  0.54 0.76 1.00     5091     2863
 sigma_geo 1.09   1.09 0.06 0.06  0.99 1.20 1.00     7401     2727
1 Like

if you want to code things yourself in Stan, I recommend the set of notebooks that I put together for GeoMED_2024. GitHub - mitzimorris/geomed_2024: Spatial Models in Stan

1 Like

You can also check out the Stan code:

But be careful—it’s distributed under copyleft, specifically GPLv3.

Overall, the Stan code here looks solid other than the spacing, which is messed up throughout. There are a bunch of things that could be made a bit more efficient by pulling operations out of loops and caching them. For example,

vector[n] phi;
int pos=1;
for (j in 1:k) {
  phi[ segment(group_idx, pos, group_size[j]) ] = phi_scale * sqrt(rho) * inv_sqrt_scale_factor[j] * phi_tilde[ segment(group_idx, pos, group_size[j]) ];
  pos += group_size[j];
}
return phi;

can become

vector[n] phi;
int pos=1;
for (j in 1:k) {
  vector[n] slice = segment(group_idx, pos, group_size[j]);
  phi[slice] = inv_sqrt_scale_factor[j] * phi_tilde[slice];
  pos += group_size[j];
}
phi *= phi_scale * sqrt(rho);
return phi;

This saves computing the segment twice, computes sqrt(rho) only once and uses efficient vector-scalar products to multiple the whole vector by phi_scale * sqrt(rho) at once. I think it’s also a bit easier to read pulling out the segment and giving it a name.

More meaningfully, I think our new sum_to_zero_vector is more efficient than the soft centering that’s being used in this repo—@mitzimorris can verify.

the sum to zero vector is far more efficient than soft centering.
see: The Sum-to-Zero Constraint in Stan

1 Like

Overall, the Stan code here looks solid other than the spacing

Thanks for reading through the code! When I have some time to work on the package again I’ll incorporate these suggestions.