Convergence issues with heteroskedastic hierarchical model

Hello all,

Thank you to any and all who take the time to provide some assistance as I attempt to improve a hierarchical heteroskedastic model that I have been working on.


Model description:

I have developed a hierarchical mixture model for the analysis of a particular type of high-throughput data. The details (other than it being high-throughput) don’t matter as this post will present a simplified (non-mixture model) distillation of the partial pooling performed in the model. Data is modeled as being drawn from a heteroskedastic Normal distribution, with different standard deviations for each individual. I am using a linear model to describe how the standard deviation varies with a covariate called “Reads”. Higher values of Reads are correlated with lower amounts of replicate to replicate variability (abbreviated RV hereafter). I am assuming that there is a single Reads vs. RV trend for all individuals. The model can thus be formalized as follows:

\text{Data}_{i,j} \sim \text{Normal}(\mu_{i}, \sigma_{i} )

\sigma_{i} \sim \text{Log-Normal}(-a*\text{Reads} + b, \tau_{\sigma})


Explanation of parameters:

\text{Data}_{i,j} is the jth replicate of data collected for individual i. The number of replicates is typically small (2 to 5) but the number of individuals is always large (> 100, usually > 1000).

\sigma_{i} is what I have referred to as the RV.

a is the Reads vs. log(RV) trend slope multiplied by -1. It is constrained to be positive as Reads should always negatively correlate with the RV (domain expertise, you’ll have to trust me this is true).


The challenge:

I have simulated data according to the above generative model and have fit two incarnations of a stan model using rstan. In one of the models, I used a centered parameterization for the \sigma_{i} and in the other I used a non-centered parameterization. While the centered parameterization led to serious convergence issues (poor mixing of chains, low effective sample sizes, some divergences), the non-centered parameterization yielded no warnings upon sampling. That being said, closer inspection of the pairs plot revealed some concerning funneling behavior (analysis of simulated data with 500 individuals and 2 replicates of data):

sig-sig corresponds to \tau_{\sigma} and sig[1] corresponds to \sigma_{1} .

Increasing the number of replicates from 2 to 4 seems to have made it (unexpectedly to me) worse, with divergent transitions becoming a frequent occurrence.

In the more complicated model that inspired this post, the seriousness of these problems is often amplified. Therefore, I wanted to ask if anyone in this awesome community had suggestions for how my stan models could be improved to address these problems. Below I have included code to simulate data and fit the stan model I am currently using.


Code to reproduce:

R script for simulating data and fitting stan model that follows:
Stanforum_Heteroskedastic.R (3.3 KB)

data {
  int NE; // total number of data points
  int NF; // total number of features (individuals; feature is field-specific jargon)
  int FE[NE]; //  feature ID for each data point
  real Data[NE]; // replicates of data modeled as normally distributed
  real Reads[NF]; // standardized covariate that influences amount of replicate variability (RV)
}

parameters {
  real<lower=0> a; // heteroskedastic slope
  real b; // heteroskedastic intercept
  real<lower=0> sig_sig; // heteroskedastic variance of variance
  vector[NF] z_sig; // z-score for non-centered param
  vector[NF] mu; // Data mean for each feature
}

transformed parameters {
  vector<lower=0>[NF] sig; // RV (non-centered)
  vector[NE] mu_vect; // for vectorization purposes
  vector<lower=0>[NE] sig_vect; // for vectorization purposes

    for (i in 1:NF) {
        sig[i] = exp(-a*Reads[i] + b + sig_sig*z_sig[i]); // non-centered RV trend 
    }
    
    // To aid in vectorization in model block
    for(i in 1:NE){
      mu_vect[i] = mu[FE[i]];
      sig_vect[i] = sig[FE[i]];
    }
    
}

model {  

  // Heteroskedasticity hyperparameters
  a ~ normal(0.3, 0.2); 
  b ~ normal(-1.5, 0.35); 
  sig_sig ~ exponential(0.25); 
  
  // For non-centered parameterization of RV
  z_sig ~ normal(0,1); 
  
  mu ~ normal(0, 1);

  // Generative model
  Data ~ normal(mu_vect, sig_vect);
    
 }

 generated quantities {

  // Array of draws from posterior predictive distribution
  real Data_predict[NE];
  
  for (i in 1:NE) {
    Data_predict[i]  = normal_rng( mu_vect[i] , sig_vect[i]);
  }
  

}

TLDR:

I am attempting to estimate a linear trend between the log(standard deviation) of normally distributed data and a covariate. I want to estimate a single trend for all individuals analyzed (and a standard deviation should be estimated for each individual), thus the slope, intercept, and variance parameters are hyperparameters, making this a hierarchical model. Using a non-centered parameterization for the standard deviation parameter improves convergence relative to a centered parameterization, but does not eliminate problematic correlations in the posterior. Is there anything in the above stan model that sticks out as being incorrect and a likely cause of these problems, or am I just limited by the extent to which the likelihood informs my parameter estimates when there are only a couple replicates of data for each individual? Might I have to resort to a partially centered parameterization (and if so, suggestions on how to choose the weighting parameter and whether a single weight is expected to be nearly optimal across multiple different datasets)?

Thank you again,
Isaac

P.S.

I would’ve assumed (e.g., from the incredibly helpful @betanalpha hierarchical model case study) that a centered parameterization for the RV would have outperformed a non-centered parameterization, since the number of individuals across which partial pooling is performed is high. If anyone knows why this is not the case, I would be very interested in hearing an explanation! Not of great importance though, just trying to learn more about how these things work.

In the non-centered parameter the \sigma_{i} are reconstructed – is \log \sigma_{i} here the reconstructed value in the transformed parameters block or the latent “z-score” as yo refer to it in the comments? Funnels in the transformed parameters are not a concern – it’s funnels in the parameters that are problematic.

How uniform are those individual/feature occupancies? Do they all have lots of observations (and hence a narrow likelihood function) or is there imbalance with some enjoying more data and some enjoying less?

Thank you very much for taking the time to respond!

Ah that’s good to know, it is the reconstructed value, here is a more useful pairs plot with the latent z-score (results from fitting simulated data using a seed of 321 and a seed of 123 in call to rstan::stan). There were 15 divergences and warnings about rhats and low effective sample sizes:

The traceplots also show some problematic sample autocorrelations:

They all have the same number of observations, and the number of observations for each feature is small (usually 2 or 3, no more than 5). So I guess it’s the fact that they each have so few observations that makes the non-centered parameterization optimal? But when partial pooling the \mu_{i} (not done in the above model but done in the actual model), I have found a centered parameterization to be optimal, which I took to be due to the fact that the number of \mu_{i} being partially pooled is large. Perhaps I just need to reread your case study.