Combining spatial and within variability, and biasing representative latent variable

Hi all,

I am still learning how to formulate models in brms (and lme4 syntax generally), so any help would be greatly appreciated. I am trying to model the following relationship:

q_z~N(q_CPT,sig_within) #within-CPT variability
q_CPT~N(q_site,sig_between) #spatial variability across site
q_site~N(q_site_lowest,standard_error) #bias site representative qc 

I am trying to predict occurrences of liquefaction at a site based on various predictor variables. One of the predictor variable is normalized tip resistance, qc1ncs, which is measured (rather, calculated) from CPT tests performed within a site. A CPT test is performed by drilling into the ground and collecting data with depth. So for each CPT, there will be an array of q_z data. At each site, there could be multiple CPT tests performed. So, I am trying obtain a latent variable of qc1ncs, q_site that accounts for the within and spatial variability of qc1ncs. Additionally, I believe that q_site is likely closer to the lowest qc1ncs across all CPTs (i.e., q_site_lowest), and that the lowest qc1ncs data may even be missing from the data, alas I assign epistemic uncertainty to its estimate defined by standard_error.

I have the group factor case_id which is assigned to all q_z that belongs to the same site. So far, I have something like the following:

data = data.frame(case_id,A_est,sig_A,q_z,q_site_lowest,standard_error)

formula <- bf(y | (A - ( b0 + (1/b1)) ) / C,
b0 ~ 1,
b1 ~ 0 + I(1/me(q_site, standard_error)),
A~ 0 + me(A_est,sig_A),
C ~ 1,
q_site~ 0 + offset(q_site_lowest),
q_z ~ 0 + q_site+ (1 | case_id),
nl = TRUE)

c(prior(normal(-2.5,0.5), nlpar="b0"),
                          prior(normal(110.0,10.0), nlpar="b1", lb=0),
                           prior(cauchy(0,1), nlpar="C", lb=0),
                          prior(constant(1.0), nlpar="A"),
                          prior(exponential(1), class = "sigma", resp = "q_z "), 
                          prior(normal(0, 0.15), class = "sd", group ="case_id", resp = "q_z "), 
                          prior(normal(0, 0.3), class = "b", nlpar = "q_site") 
                          )

fit<- brm(formula , data=data, prior=prior, family=bernoulli(link="probit"), iter=4000, seed=123)

Evidently, I am getting an error saying that q_z is not a valid distributional or non-linear parameter because of the nl=TRUE term. Wondering if there is a right way to achieve the aforementioned relationship?

  • Operating System: Windows 11
  • brms Version: 2.23

Again, any help would be greatly appreciated. Thanks!

Hi Tat,

At first glance, your hierarchical structure as written may cause some issues. To me, it looks like: q_{site} \sim \mathcal{N}(q_{site,min}, SE) imposes a tight data dependent prior on the hyperparameter governing the site average q_{c1ncs}. But the site average q_{c1ncs} is unlikely to be close to the site minimum value, which could result in prior-data conflicts. That being said, I don’t think you want the site average to represent the latent variable of interest.

I think what you’re going for (correct me if I’m wrong), is you’re assuming the binary site response y related to some latent “representative” q_{c1ncs} value (we could call it q^*), which you want to impose a prior on that represents some compromise between the site average value and the site minimum value.

Give me a day or so to dig into this and I can come up with a more thorough answer. This sort of model might be easier to write in Stan directly, though I’m no expert on brms.

Possibly some errors here, but I think the trick would be to keep your formulation for q_z, q_{CPT} and q_{site} but instead use something like:
q^* \sim \mathcal{N}(\gamma*q_{site} + (1-\gamma)*q_{site,min}, \tau)
and
y \sim \text{Bern}(\Phi(f(q^*, ...))
That is, the latent value is given a prior whose mean is a convex combination of the site average and min, controlled by \gamma which you also estimate as a parameter (and can make site specific). \tau controls how tight you want the prior on the latent value to be. You can probably come up with other ways of “combining” the site average/min in a data driven fashion too.

Here is some untested Stan code to that effect:

data {
  int<lower=0> N_total; //total number of CPT data points (not traces)
  vector[N_total] q; //all the cpt data appended together
  
  array[N_cpt_total] cpt_trace_id; //trace index for each CPT data point
  array[N_cpt_total] cpt_site_id; //site index for each CPT data point
  
  int<lower=0> N_sites; //number of sites
  int<lower=0> N_traces; //number of traces
  
  vector[N_sites] min_q_site; //pre-computed minimum qc1ncs per site
  
  array[N_sites] int y_site_obs; //site observed outcome
}

parameters {
  real<lower=0> sigma_within_trace;

  real mu_q;
  vector[N_traces] delta_trace; //trace average qc1ncs
  vector[N_sites] delta_site; //site average qc1ncs
  
  real<lower=0> tau_delta_trace;
  real<lower=0> tau_delta_site;
  
  vector[N_sites] q_star; //latent value for binary regression
  
  real<lower=0,upper=1> gamma; //for convex comb site avg and site min, could be made site specific
}

transformed data{
}

model {
  //hierarchical model for site average qc1ncs
  vector[N_total] qbar = mu_q + delta_trace[cpt_trace_id] + delta_site[cpt_site_id]; //vectorized
  q ~ normal(qbar, sigma_within_trace);
  
  mu_q ~ ... //add global prior here
  delta_trace ~ normal(0, tau_delta_trace);
  delta_site ~ normal(0, tau_delta_site);
  
  //latent variable model that links site average and site min
  mu_q_site = mu_qc + delta_site;
  q_star ~ normal(gamma*mu_q_site + (1-gamma)*min_q_site, ...); //need to add a standard deviation to this to control how tight prior constraint on latent variable is
  
  y_site_obs ~ bernoulli_logit( ... q_star); //or use probit formulation

}

though note that I have also partially pooled the CPT data across sites in this setup.

Also, if you want to capture uncertainty in q_{site,min} have you considered modeling the minimum of the each trace within a site with a Gumbel or similar GEV distribution?

Jon,

Thanks for the great insights. The use of a combined prior to achieve a representative q* closer to the min would be something to dig into. And also the use of a Gumbel distribution for q_{site,min}! And yes, I had a feeling that doing all that directly through Stan would be the best option, though I wonder if it is at all possible in brms.

Much appreciated,
Tat

Happy to help, maybe someone else can chime in if this two-stage type model is something that brms will work for. As a final note, there’s also nothing wrong with doubling up on priors in Stan:
q^{*} \sim \mathcal{N}(q_{\mu_{q_{site}}}, ...)
q^{*} \sim \mathcal{N}(q_{min, site}, ...)
This just has the effect of giving q^* a prior that’s proportional to the product of two normals (which has a closed form expression for the implied mean/sd)