How to best model proportion data within a spatial hierarchical structure that mixes random groups nested in fixed groups?

This is a long post, but I thought that it would be much better to be thorough with the question and example…

Let’s fabricate some data that minimally mimics the scenario and motivates this post. The data will represent a spatial hierarchical structure in which there are a collection of six random levels (let’s call these spatial lower level LL in the figure below) nested within three fixed levels (let’s call these spatial upper level UL in the figure below). So we have an unequal number of LL levels per UL level. Additionally, let’s assume that each LL level was visited on 10 occasions (i.e. a time category)—let’s call that time_stamp, an in each occasion we measured a variable that is quantified as a proportion between 0 and 1—let’s call it response. The figure below illustrates the spatial hierarchy.

reprex_fig1

reprex_fig2

As we can see from the figure, LLs are random because they do not fill the space contained within each UL. On the other hand, ULs, when put together, represent the total existing space (e.g. the globe). In this example, I recognise that there are not enough LL’s to characterise a within-UL variance, but this was done on purpose for the purposes of this minimally-reproducible example. All I would like to learn here is the correct approach that would need to be implemented in a spatially-similar (though much larger) dataset.

Another design feature that emerges from the figure is that LLs within spatio-temporal designs rarely represent the same spatial extent (i.e. the size of each LL relative to their engulfing UL). Whilst some ULs are well represented by LLs (e.g. UL1), in other ULs, sampling might be relatively sparse (e.g. UL3). Consequently, we might wish to be able to weight LLs within ULs according to the relative area they represent when estimating how response changes across time_stamps. Similarly, we may also want to weight the ULs according to their relative areas when calculating a total, global change of response across time_stamps. So basically this would constitute a case where we want to incorporate survey design weights.

The code below creates and plots the data:

library(tidyverse)
set.seed(24)

betaparms  <-  function (mu, v) {
    a  <-  ((1 - mu) / v - 1 / mu) * mu^2
    b  <-  a * ((1 / mu) - 1)
    list(a, b)
}

time_stamp   <-  1:10       # effectively the sequence of categorical occasions in the temporal trend
n            <-  c(2, 1, 3) # number of LLs within each UL

prob_change      <-  ifelse(time_stamp != 8, 0.2 + 0.01 * time_stamp, (0.2 + 0.01 * time_stamp) * 2) # add some change after event 8
total_response   <-  0.2                                  # total unweighted response
UL_response_eff  <-  c(0.2, 0, -0.1)                      # UL deviations from total response
LL_response_eff  <-  c(0.05, -0.05, 0, 0.1, -0.05, -0.05) # LL deviations from UL response
response         <-  total_response + UL_response_eff

## Spatial hierarchy
dat  <-  data.frame(UL = factor(unlist(lapply(seq_along(n), function (i, n) rep(paste0('UL', i), n[i]), n=n))),
                    LL = factor(unlist(lapply(c(2, 1, 3), function(x) paste0('LL', 1:x))))
                   )
X    <-  cbind(1, model.matrix(~-1 + UL, dat), model.matrix(~-1 + LL_response_eff, dat))
dat  <-  dat %>% 
			mutate(tmp_response = rbeta(n(), betaparms(response, 0.001)[[1]], betaparms(response, 0.001)[[2]]))

## Add occasion and noise
dat  <-  dat %>% 
			group_by(UL, LL) %>%
		    do({
		        x           <-  .
		        resp_1      <-  x$tmp_response
		        did_change  <-  FALSE
		        for (i in time_stamp[-1]) {
		            did_change  <-  cbind(did_change, prob_change[i] > runif(1, 0, 1))
		            resp_2      <-  ifelse(did_change[i], resp_1[i-1] * rnorm(1, 0.7, 0.2), resp_1[i-1] * rnorm(1, 1.05, 0.1))
		            resp_1      <-  c(resp_1, resp_2)
		        }
		        data.frame(x, time_stamp = time_stamp, response = resp_1)
		    }) %>% 
		    ungroup %>%
		    arrange(UL, LL, time_stamp) %>%
		    mutate(time_stamp = factor(as.character(time_stamp), levels = unique(time_stamp)),
		           UL_LL = factor(paste(UL, LL))
		          ) %>%
		    select(-tmp_response)

g1  <-  ggplot(dat) +
			geom_blank(aes(y = response, x = time_stamp)) +
			geom_line(aes(y = response, x = as.numeric(time_stamp), color = UL, linetype = UL_LL)) +
			geom_point(aes(y = response, x = time_stamp, color = UL)) +
			ggtitle('a) Raw data')
g1

The weights have been conceived based on the idea that, within a given UL, weights sum to 1 across LLs. At the total level, weights across ULs also sum up to 1.

LL_weights  <-  c(0.75, 0.25, 1, 0.25, 0.5, 0.25) # LL to UL weights
UL_weights  <-  c(1, 5, 10)                       # UL to total (global) weights

weights_data <- data.frame(UL = factor(unlist(lapply(seq_along(n), function(i, n) rep(paste0('UL', i), n[i]), n = n))),
		                   LL = factor(unlist(lapply(c(2, 1, 3), function(x) paste0('LL', 1:x))))
                 		  ) %>%
    				mutate(Weights_LL = LL_weights,
           				   Weights_UL = rep(UL_weights / sum(UL_weights), n),
           				   UL_LL = factor(paste(UL, LL))) %>%
    				select(UL, Weights_UL, LL, Weights_LL, UL_LL)

> weights_data
   UL Weights_UL  LL Weights_LL   UL_LL
1 UL1     0.0625 LL1       0.75 UL1 LL1
2 UL1     0.0625 LL2       0.25 UL1 LL2
3 UL2     0.3125 LL1       1.00 UL2 LL1
4 UL3     0.6250 LL1       0.25 UL3 LL1
5 UL3     0.6250 LL2       0.50 UL3 LL2
6 UL3     0.6250 LL3       0.25 UL3 LL3

So, now trying to piece this all together. To recap, the question then is: How do I best use the survey design weights to do a weighted mean of the posterior distributions to first obtain UL-level effects for each value in time_stamp, and then do a second weighted average on the UL-level effects to obtain a total-level effects? One approach I tried so far has been:

library(rstan)
library(broom)
library(gridExtra)
rstan::rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

stanmodel  <-  '
data {
  int<lower=0> N;               // number of observations
  int<lower=1> nX;              // number of fixed effects
  int<lower=0> nLL;             // number of LL groups
  int<lower=1> nZ;              // number of random effects for LL groups
  int<lower=1,upper=nLL> LL[N]; // subject indicator
  row_vector[nX] X[N];          // fixed effects design matrix
  row_vector[nZ] Z[N];          // LL random effect design matrix
  vector[N] y;                  // response
}
parameters {
  real<lower=0> phi;
  vector[nX] beta;              // fixed effects coefficients
  cholesky_factor_corr[nZ] L_u; // cholesky factor of LL random effect correlation matrix
  vector<lower=0>[nZ] sigma_u;  // LL random effect standard deviation
  vector[nZ] z_u[nLL];          // LL random effect
}
transformed parameters {
  vector[N] mu;
  vector[nZ] u[nLL];      // LL random effects
  matrix[nZ, nZ] Sigma_u; // LL random effect covariance matrix

  Sigma_u = diag_pre_multiply(sigma_u, L_u);
  for(j in 1:nLL) {
    u[j] = Sigma_u * z_u[j];
  }
  for (i in 1:N) {
    mu[i] = inv_logit(X[i] * beta + Z[i] * u[LL[i]]);
  }
}
model {
  // priors
  target += lkj_corr_cholesky_lpdf(L_u | 1);
  target += gamma_lpdf(phi | 0.01, 0.01);
  target += student_t_lpdf(sigma_u | 3, 0, 10)
     - 10 * student_t_lccdf(0 | 3, 0, 10);
  for (j in 1:nLL) {
    target += normal_lpdf(z_u[j] | 0, 1);
  }
  target += normal_lpdf(beta | 0, 5);
  // likelihood
  target += beta_lpdf(y | mu * phi, (1 - mu) * phi);
}
'

X         <-  model.matrix(~0 + time_stamp, data = dat)
standata  <-  with(dat, list(
		                     N   = length(response),
		                     nX  = ncol(X),
		                     nLL = length(unique(UL_LL)),
		                     nZ  = ncol(X),
		                     LL  = as.numeric(factor(UL_LL)),
		                     X   = X,
		                     Z   = X,
		                     y   = response
		                    )
		          )
my_model  <-  stan(model_code = stanmodel, data = standata, iter = 1000, warmup = 500, chains = 3, seed = '1234', cores = 3, control = list(adapt_delta = 0.99, max_treedepth = 20))

new_d  <-  with(dat, expand.grid(time_stamp = sort(unique(time_stamp)), UL_LL = unique(UL_LL)))
Fmat   <-  model.matrix(~ 0 + time_stamp, data = new_d) # fixed effects
Zmat   <-  model.matrix(~ 0 + UL_LL:time_stamp, data = new_d) # random effects
Mmat   <-  cbind(Fmat, Zmat)
coefs  <-  as.matrix(my_model)
wch    <-  grep('^beta.*|^u.*', colnames(coefs))

# cbind(colnames(Mmat), colnames(coefs[, wch])) # checks out

## LL trends
new_d  <-  new_d %>% 
              cbind(plogis(coefs[, wch] %*% t(Mmat)) %>%
              tidyMCMC(conf.int = TRUE, conf.method = 'HPDinterval')) %>%
              left_join(dat %>% dplyr::select(UL, UL_LL) %>% distinct)

g2  <-  ggplot(new_d) +
			geom_pointrange(aes(y = estimate, x = time_stamp, ymin = conf.low, ymax = conf.high, color = UL)) +
			geom_line(aes(y = estimate, x = as.numeric(time_stamp), color = UL, linetype = UL_LL)) +
			ggtitle('b) Modelled LL (by time_stamp) marginal means')

## UL trends (weighted mean of LLs)
new_d_UL  <-  with(dat, expand.grid(time_stamp = sort(unique(time_stamp)), UL_LL = unique(UL_LL))) %>%
			      left_join(dat %>% dplyr::select(UL, UL_LL) %>% distinct) %>%
    		      cbind(t(coefs[, wch] %*% t(Mmat))) %>%
    		      gather(key = Iter, value = Value, -time_stamp, -UL, -UL_LL) %>%
    		      left_join(weights_data %>% dplyr::select(UL, UL_LL, Weights_LL) %>% distinct) %>%
    		      mutate(Fit = Value * Weights_LL) %>%
    		      group_by(time_stamp, UL, Iter) %>%
    		      summarise(UL_mean = sum(Fit)) %>%
    		      ungroup

g3  <-  new_d_UL %>%
			mutate(UL_mean = plogis(UL_mean)) %>%
			group_by(time_stamp, UL) %>% 
			tidybayes::mean_hdcih(UL_mean) %>%
			ggplot() +
			geom_pointrange(aes(y = UL_mean, x = time_stamp, ymin = .lower, ymax = .upper, color = UL)) +
			geom_line(aes(y = UL_mean, x = as.numeric(time_stamp), color = UL)) +
			scale_y_continuous(limits = c(0, 0.5)) +
			ggtitle('c) Derived UL (by time_stamp) means')

## Total (global) trends (weighted mean of ULs)
new_d_total  <-  new_d_UL %>%
					 group_by(time_stamp) %>%
					 left_join(weights_data %>% dplyr::select(UL, Weights_UL) %>% distinct) %>%
					 mutate(Fit = UL_mean * Weights_UL) %>%
					 group_by(time_stamp, Iter) %>%
					 summarise(total_mean = sum(Fit)) %>%
					 ungroup
    
g4  <-  new_d_total %>%
			mutate(total_mean = plogis(total_mean)) %>%
			group_by(time_stamp) %>% 
			tidybayes::mean_hdcih(total_mean) %>%
			ggplot() +
			geom_pointrange(aes(y = total_mean, x = time_stamp, ymin = .lower, ymax = .upper)) +
			geom_line(aes(y = total_mean, x = as.numeric(time_stamp))) +
			scale_y_continuous(limits = c(0, 0.5)) +
			ggtitle('d) Derived Total (by time_stamp) means')

dev.new(width = 9.5, height = 8.3)
grid.arrange(g1, g2, g3, g4, nrow = 2)

In the above, I ran the model disregarding the UL fixed effects, only including LL as random effects. So the model only estimates LL/time_stamp level trends at the moment (i.e. it is naive to UL and Total). Those estimates, as we can see from the above figure, look really reasonable. To aggregate up from LL to UL, I have just done a weighted mean of LL/time_stamp trends within each UL.

Does this essentially assume that the LLs are fixed? Or, because of the fact that the LL-level variance is likely larger (due to LL being a random group), would I be correct in assuming that the uncertainty is properly propagated to all levels (UL and total)? Were LL to be fixed, then each LL-level uncertainty would be lower, correct?

One the other hand, the aggregation from UL to Total looks OK, because the ULs are fixed.

Is there another way to include both UL- and LL-level weights in the likelihood calculation straight out of the bat within the stan code? Perhaps I could implement something similar to what’s described in Si et al. 2017 (also see post here—although I have not been able to adapt that situation to this scenario).

Thanks for your patience in reading this post, and for your guidance.

2 Likes

Hi, sorry, it looks like your post slipped through, despite being a very well written question. Did you manage to move forward in the meantime?

Generally there are two general ways I can think out of top of my head:

  • Add a random effect for both UL and the interaction of UL and LL. Since both are centered at zero, the random effect for UL_LL effectively centers around the UL random effect.
  • If what you are speaking about is truly a spatial dataset, maybe the UL grouping is artificial - what does it actually represent? If what you are after with the UL is just to say that more samples from different ULs are more “distant” and thus more likely to differ than those within an UL and in fact you can specify either a “neighbour” relationship or pairwise distances, it might make sense to model the full spatial structure as an ICAR model or a Gaussian process.

Best of luck with your model!

Thanks @martinmodrak! I’m having a look at the suggestions you mentioned. Just one quick question. Would either of those eliminate the need to include LL weights?

I think both could, but I don’t think I completely understand what you are after. In the nested random effects case, the UL level is represented directly, with an explicitly spatial model you should be able to make predictions for all locations across an UL (even those not samples in the LL) and average over those. You could also do some sort of MRP (Multilevel-regression with poststratification) if you believe there is a noticeable systematic bias in which regions are represented (I do not do surveys myself, but my impression from the forums is that MRP is one of the best readily available methods to do this).