Hello all,
Thank you to any and all who take the time to provide some assistance as I attempt to improve a hierarchical heteroskedastic model that I have been working on.
Model description:
I have developed a hierarchical mixture model for the analysis of a particular type of high-throughput data. The details (other than it being high-throughput) don’t matter as this post will present a simplified (non-mixture model) distillation of the partial pooling performed in the model. Data is modeled as being drawn from a heteroskedastic Normal distribution, with different standard deviations for each individual. I am using a linear model to describe how the standard deviation varies with a covariate called “Reads”. Higher values of Reads are correlated with lower amounts of replicate to replicate variability (abbreviated RV hereafter). I am assuming that there is a single Reads vs. RV trend for all individuals. The model can thus be formalized as follows:
\text{Data}_{i,j} \sim \text{Normal}(\mu_{i}, \sigma_{i} )
\sigma_{i} \sim \text{Log-Normal}(-a*\text{Reads} + b, \tau_{\sigma})
Explanation of parameters:
\text{Data}_{i,j} is the jth replicate of data collected for individual i. The number of replicates is typically small (2 to 5) but the number of individuals is always large (> 100, usually > 1000).
\sigma_{i} is what I have referred to as the RV.
a is the Reads vs. log(RV) trend slope multiplied by -1. It is constrained to be positive as Reads should always negatively correlate with the RV (domain expertise, you’ll have to trust me this is true).
The challenge:
I have simulated data according to the above generative model and have fit two incarnations of a stan model using rstan. In one of the models, I used a centered parameterization for the \sigma_{i} and in the other I used a non-centered parameterization. While the centered parameterization led to serious convergence issues (poor mixing of chains, low effective sample sizes, some divergences), the non-centered parameterization yielded no warnings upon sampling. That being said, closer inspection of the pairs plot revealed some concerning funneling behavior (analysis of simulated data with 500 individuals and 2 replicates of data):
sig-sig corresponds to \tau_{\sigma} and sig[1] corresponds to \sigma_{1} .
Increasing the number of replicates from 2 to 4 seems to have made it (unexpectedly to me) worse, with divergent transitions becoming a frequent occurrence.
In the more complicated model that inspired this post, the seriousness of these problems is often amplified. Therefore, I wanted to ask if anyone in this awesome community had suggestions for how my stan models could be improved to address these problems. Below I have included code to simulate data and fit the stan model I am currently using.
Code to reproduce:
R script for simulating data and fitting stan model that follows:
Stanforum_Heteroskedastic.R (3.3 KB)
data {
int NE; // total number of data points
int NF; // total number of features (individuals; feature is field-specific jargon)
int FE[NE]; // feature ID for each data point
real Data[NE]; // replicates of data modeled as normally distributed
real Reads[NF]; // standardized covariate that influences amount of replicate variability (RV)
}
parameters {
real<lower=0> a; // heteroskedastic slope
real b; // heteroskedastic intercept
real<lower=0> sig_sig; // heteroskedastic variance of variance
vector[NF] z_sig; // z-score for non-centered param
vector[NF] mu; // Data mean for each feature
}
transformed parameters {
vector<lower=0>[NF] sig; // RV (non-centered)
vector[NE] mu_vect; // for vectorization purposes
vector<lower=0>[NE] sig_vect; // for vectorization purposes
for (i in 1:NF) {
sig[i] = exp(-a*Reads[i] + b + sig_sig*z_sig[i]); // non-centered RV trend
}
// To aid in vectorization in model block
for(i in 1:NE){
mu_vect[i] = mu[FE[i]];
sig_vect[i] = sig[FE[i]];
}
}
model {
// Heteroskedasticity hyperparameters
a ~ normal(0.3, 0.2);
b ~ normal(-1.5, 0.35);
sig_sig ~ exponential(0.25);
// For non-centered parameterization of RV
z_sig ~ normal(0,1);
mu ~ normal(0, 1);
// Generative model
Data ~ normal(mu_vect, sig_vect);
}
generated quantities {
// Array of draws from posterior predictive distribution
real Data_predict[NE];
for (i in 1:NE) {
Data_predict[i] = normal_rng( mu_vect[i] , sig_vect[i]);
}
}
TLDR:
I am attempting to estimate a linear trend between the log(standard deviation) of normally distributed data and a covariate. I want to estimate a single trend for all individuals analyzed (and a standard deviation should be estimated for each individual), thus the slope, intercept, and variance parameters are hyperparameters, making this a hierarchical model. Using a non-centered parameterization for the standard deviation parameter improves convergence relative to a centered parameterization, but does not eliminate problematic correlations in the posterior. Is there anything in the above stan model that sticks out as being incorrect and a likely cause of these problems, or am I just limited by the extent to which the likelihood informs my parameter estimates when there are only a couple replicates of data for each individual? Might I have to resort to a partially centered parameterization (and if so, suggestions on how to choose the weighting parameter and whether a single weight is expected to be nearly optimal across multiple different datasets)?
Thank you again,
Isaac
P.S.
I would’ve assumed (e.g., from the incredibly helpful @betanalpha hierarchical model case study) that a centered parameterization for the RV would have outperformed a non-centered parameterization, since the number of individuals across which partial pooling is performed is high. If anyone knows why this is not the case, I would be very interested in hearing an explanation! Not of great importance though, just trying to learn more about how these things work.