Convergence issues with heteroskedastic hierarchical model

Hello all,

Thank you to any and all who take the time to provide some assistance as I attempt to improve a hierarchical heteroskedastic model that I have been working on.


Model description:

I have developed a hierarchical mixture model for the analysis of a particular type of high-throughput data. The details (other than it being high-throughput) don’t matter as this post will present a simplified (non-mixture model) distillation of the partial pooling performed in the model. Data is modeled as being drawn from a heteroskedastic Normal distribution, with different standard deviations for each individual. I am using a linear model to describe how the standard deviation varies with a covariate called “Reads”. Higher values of Reads are correlated with lower amounts of replicate to replicate variability (abbreviated RV hereafter). I am assuming that there is a single Reads vs. RV trend for all individuals. The model can thus be formalized as follows:

\text{Data}_{i,j} \sim \text{Normal}(\mu_{i}, \sigma_{i} )

\sigma_{i} \sim \text{Log-Normal}(-a*\text{Reads} + b, \tau_{\sigma})


Explanation of parameters:

\text{Data}_{i,j} is the jth replicate of data collected for individual i. The number of replicates is typically small (2 to 5) but the number of individuals is always large (> 100, usually > 1000).

\sigma_{i} is what I have referred to as the RV.

a is the Reads vs. log(RV) trend slope multiplied by -1. It is constrained to be positive as Reads should always negatively correlate with the RV (domain expertise, you’ll have to trust me this is true).


The challenge:

I have simulated data according to the above generative model and have fit two incarnations of a stan model using rstan. In one of the models, I used a centered parameterization for the \sigma_{i} and in the other I used a non-centered parameterization. While the centered parameterization led to serious convergence issues (poor mixing of chains, low effective sample sizes, some divergences), the non-centered parameterization yielded no warnings upon sampling. That being said, closer inspection of the pairs plot revealed some concerning funneling behavior (analysis of simulated data with 500 individuals and 2 replicates of data):

sig-sig corresponds to \tau_{\sigma} and sig[1] corresponds to \sigma_{1} .

Increasing the number of replicates from 2 to 4 seems to have made it (unexpectedly to me) worse, with divergent transitions becoming a frequent occurrence.

In the more complicated model that inspired this post, the seriousness of these problems is often amplified. Therefore, I wanted to ask if anyone in this awesome community had suggestions for how my stan models could be improved to address these problems. Below I have included code to simulate data and fit the stan model I am currently using.


Code to reproduce:

R script for simulating data and fitting stan model that follows:
Stanforum_Heteroskedastic.R (3.3 KB)

data {
  int NE; // total number of data points
  int NF; // total number of features (individuals; feature is field-specific jargon)
  int FE[NE]; //  feature ID for each data point
  real Data[NE]; // replicates of data modeled as normally distributed
  real Reads[NF]; // standardized covariate that influences amount of replicate variability (RV)
}

parameters {
  real<lower=0> a; // heteroskedastic slope
  real b; // heteroskedastic intercept
  real<lower=0> sig_sig; // heteroskedastic variance of variance
  vector[NF] z_sig; // z-score for non-centered param
  vector[NF] mu; // Data mean for each feature
}

transformed parameters {
  vector<lower=0>[NF] sig; // RV (non-centered)
  vector[NE] mu_vect; // for vectorization purposes
  vector<lower=0>[NE] sig_vect; // for vectorization purposes

    for (i in 1:NF) {
        sig[i] = exp(-a*Reads[i] + b + sig_sig*z_sig[i]); // non-centered RV trend 
    }
    
    // To aid in vectorization in model block
    for(i in 1:NE){
      mu_vect[i] = mu[FE[i]];
      sig_vect[i] = sig[FE[i]];
    }
    
}

model {  

  // Heteroskedasticity hyperparameters
  a ~ normal(0.3, 0.2); 
  b ~ normal(-1.5, 0.35); 
  sig_sig ~ exponential(0.25); 
  
  // For non-centered parameterization of RV
  z_sig ~ normal(0,1); 
  
  mu ~ normal(0, 1);

  // Generative model
  Data ~ normal(mu_vect, sig_vect);
    
 }

 generated quantities {

  // Array of draws from posterior predictive distribution
  real Data_predict[NE];
  
  for (i in 1:NE) {
    Data_predict[i]  = normal_rng( mu_vect[i] , sig_vect[i]);
  }
  

}

TLDR:

I am attempting to estimate a linear trend between the log(standard deviation) of normally distributed data and a covariate. I want to estimate a single trend for all individuals analyzed (and a standard deviation should be estimated for each individual), thus the slope, intercept, and variance parameters are hyperparameters, making this a hierarchical model. Using a non-centered parameterization for the standard deviation parameter improves convergence relative to a centered parameterization, but does not eliminate problematic correlations in the posterior. Is there anything in the above stan model that sticks out as being incorrect and a likely cause of these problems, or am I just limited by the extent to which the likelihood informs my parameter estimates when there are only a couple replicates of data for each individual? Might I have to resort to a partially centered parameterization (and if so, suggestions on how to choose the weighting parameter and whether a single weight is expected to be nearly optimal across multiple different datasets)?

Thank you again,
Isaac

P.S.

I would’ve assumed (e.g., from the incredibly helpful @betanalpha hierarchical model case study) that a centered parameterization for the RV would have outperformed a non-centered parameterization, since the number of individuals across which partial pooling is performed is high. If anyone knows why this is not the case, I would be very interested in hearing an explanation! Not of great importance though, just trying to learn more about how these things work.

In the non-centered parameter the \sigma_{i} are reconstructed – is \log \sigma_{i} here the reconstructed value in the transformed parameters block or the latent “z-score” as yo refer to it in the comments? Funnels in the transformed parameters are not a concern – it’s funnels in the parameters that are problematic.

How uniform are those individual/feature occupancies? Do they all have lots of observations (and hence a narrow likelihood function) or is there imbalance with some enjoying more data and some enjoying less?

Thank you very much for taking the time to respond!

Ah that’s good to know, it is the reconstructed value, here is a more useful pairs plot with the latent z-score (results from fitting simulated data using a seed of 321 and a seed of 123 in call to rstan::stan). There were 15 divergences and warnings about rhats and low effective sample sizes:

The traceplots also show some problematic sample autocorrelations:

They all have the same number of observations, and the number of observations for each feature is small (usually 2 or 3, no more than 5). So I guess it’s the fact that they each have so few observations that makes the non-centered parameterization optimal? But when partial pooling the \mu_{i} (not done in the above model but done in the actual model), I have found a centered parameterization to be optimal, which I took to be due to the fact that the number of \mu_{i} being partially pooled is large. Perhaps I just need to reread your case study.

Apologies for the delayed response!

The z_sig[1]-log_sig_sig plot suggests there might be an inverted funnel in the posterior distribution, which does suggest trying a centered parameterization (at least for z_sig[1] and any other z_sig[n] that show the same correlation).

On the other hand there’s also some interesting behavior when plotting log_sig_sig against log_a and b. One possibility is that a * Reads[i] , b, and sig_sig * z_sig[i] are all degenerate with each other. In other words if sig_sig is large enough then sig_sig * z_sig[i] might be able to mimic the behavior of both b and a * Reads[i], at least partially. You could verify by plotting the intermediate variables a * Reads[i] and sig_sig * z_sig[i] against each other.

If this is the case then a tighter prior on sig_sig could help separate out the more systematic trends from the unstructured heterogeneity.

Technically it’s the likelihood function that determines which parameterization will give a nicer geometry, but for relatively simple models the shape of the likelihood function pretty cleanly correlates with the number of observations. In other words looking at the number of observations can be an okay heuristic but one has to be careful not to rely on it too strongly.

The benefit of partial pooling is that it constrains the population scale which effectively cuts off the top and bottom of the hierarchical funnel. This can make both the geometry resulting from the centered and non-centered parameterizations much less degenerate. That said if you look at, for example, ESS[f] / N_leapfrog, then will typically still see that one parameterization enjoys a slightly nicer geometry that improves computational performance.

1 Like