Is it reasonable to model subject-level residual SD non-hierarchically in a hierarchical Stan model?

Hello fellow stan users!

I am doing a hierarchical analysis of behavioral data (mouse behavior) with 3 levels: individual, group and population. The model requires to estimate 2 parameters (U and S) modeled in log and logit space. The response variable is log-transformed and modeled with a normal likelihood. My question is related to how to model the standard deviation of the normal distribution. My initial approach was to model the observation-level standard deviation hierarchically (subject → group → population). However, this resulted in poor identification, the higher-level SD increased to absorb most of the variability, making the likelihood weakly sensitive to the main parameters (U and S). My second approach was to model the SD non-hierarchically, letting the sampler just pick the best SD estimate for each subject.

The second approach was successful, offering no divergent transitions, reasonable parameter estimates and good r hat values, but I was wondering if it is reasonable to do so and if there is a more principled alternative.

I include the full model below, but the question is specifically about modeling the residual SD:

Thank you for any help provided!

Eq13_code <- "
data {
  int<lower=1> N;              // Number of observations
  int<lower=1> S;              // Number of subjects
  int<lower=1, upper=S> subject[N]; // Subject IDs
  real B[N];                   // Response variable (observed B)
  real w[N];                   // Known parameter w
  real t[N];                   // Known parameter t
  real N_val[N];               // Known parameter N
  real H[N];                   // Known parameter H
  real q[N];                   // Known parameter q
  int<lower=1, upper=2> G[S];  // Group scm or eth (subject-specific)
}
parameters {
  real U_log_pop;               // Population mean for log(U)
  real<lower=0> sigma_U;        // Population standard deviation for U
  real U_log_group[2];          // Group-level means for log(U)
  real U_log_offset[S];         // Subject-specific deviations for log(U)

  real S_logit_pop;             // Population mean for logit(S)
  real<lower=0> sigma_S;        // Population standard deviation for logit(S)
  real S_logit_group[2];        // Group-level means for logit(S)
  real S_logit_offset[S];       // Subject-specific deviations for logit(S)
  
  vector<lower=0>[S] sigma_B;  // Subject-specific residual standard deviations

}
transformed parameters {
  real U_subject[S];   // 
  real S_subject[S];   // 

  for (s in 1:S) {
    U_subject[s] = U_log_group[G[s]] + U_log_offset[s] * sigma_U;  // 
    S_subject[s] = S_logit_group[G[s]] + S_logit_offset[s] * sigma_S; // 
  } 
}
model {
  // Priors for population-level parameters
  target += normal_lpdf(U_log_pop | 0, 50);
  target += normal_lpdf(sigma_U | 0, 10) - log(0.5);  // 
  target += normal_lpdf(S_logit_pop | 0, 50);
  target += normal_lpdf(sigma_S | 0, 10) - log(0.5);  // 

  

  // Priors for group-level parameters
  target += normal_lpdf(U_log_group | U_log_pop, sigma_U);
  target += normal_lpdf(S_logit_group | S_logit_pop, sigma_S);

  // Priors for subject-level parameters (Non-centered)
  target += normal_lpdf(U_log_offset | 0, 1);
  target += normal_lpdf(S_logit_offset | 0, 1);
  
  target += normal_lpdf(sigma_B | 0, 10) - log(0.5);  // 

  // Likelihood (Using Gaussian Approximation)
  for (n in 1:N) {
  int s = subject[n];
  real sigma = sigma_B[s];
  real Pn_star = 1 / (t[n] + (N_val[n] / (w[n] * q[n])));  
  real Pn = (exp(U_subject[subject[n]]) * (Pn_star ^ inv_logit(S_subject[subject[n]]))) ^ (1 / (1 - inv_logit(S_subject[subject[n]])));
  real B_hat =log(N_val[n]) - log((t[n] * q[n]) + (N_val[n] / w[n])) - log(1 + 1 / Pn);
    target += normal_lpdf(B[n] | B_hat, sigma);
  }
}
"
data$Subject <- as.factor(data$Subject)
data_subject <- data %>%
  group_by(Subject) %>% 
  summarise(G = first(G))  

stan_data <- list(
  N = nrow(data),
  S = length(unique(data$Subject)),
  subject = as.numeric(data$Subject),
  B = data$B,
  N_val = data$N_val,
  w = data$w,
  q = data$q,
  t = data$t,
  H = data$H,
  G = data_subject$G  
)
stan_model <- stan_model(model_code = Eq13_code) 

Phase3_eq13<- sampling(  
  stan_model, 
  data = stan_data, 
  iter = 4000,
  chains = 4, 
  cores = 4,
  control = list(adapt_delta = 0.99, max_treedepth = 20) 
)

I couldn’t quite see where in the model you made this choice.

Scale parameters are usually shared across units in a hierarchical model to capture group-level variance. For example, if I’m modeling groups of rats in a clinical trial (like the first example of a hierarchical model in BDA), then we assume there’s a population variability for groups of rats, but it’s just a single scalar. If you model uses separate standard deviations for units, it’s no longer the same kind of hierarchical model. Often the standard deviation based on a single group won’t be well defined, but it sounds like your model is fitting, so maybe I’m not understanding what’s going on.

P.S. You could vectorize that whole likelihood, which would run a lot faster. I’d also recommend consistent indentation and putting your Stan code in its own files so you can get meaningful line-number diagnostics, syntax highlighting, etc.

P.P.S. Stan only needs its target defined up to a constant, so things like subtracting log(0.5) will have no effect on sampling because they drop out when we take derivatives.

Thanks for the feedback Bob, it was very helpful!
To clarify where this modeling choice is made: in the second approach, I model the standard deviation at the subject level via

vector<lower=0>[S] sigma_B;

with the prior

target += normal_lpdf(sigma_B | 0, 10);

and then use the subject-specific SD in the likelihood

real sigma = sigma_B[s];
target += normal_lpdf(B[n] | B_hat, sigma)

I did some testing and tried using the SD as a single scalar, as you suggested. The results were slightly worse than with specific SD per subject, with more shrinkage of the parameters towards the mean and a few divergent transitions (2), although good rhat and ESS values for both models.
I have over 10-15 data points per subject, so maybe that’s why the SD is being well defined at an individual level.
Conceptually, I am not really interested in the group SD, it is intended as a nuisance parameter capturing subject-specific measurement noise or unmodeled variability, rather than a population-level quantity of interest. I also agree that this departs from the classical hierarchical formulation, and I’m considering sticking with a shared SD (single scalar as you suggested), if the subject-specific version is hard to justify methodologically.
This is the current code with SD as a single scalar and following your recommendations as best as I could:

Eq13_code <- "
data {
  int<lower=1> N;              // Number of observations
  int<lower=1> S;              // Number of subjects
  int<lower=1, upper=S> subject[N]; // Subject IDs
  real B[N];                   // Response variable (observed B)
  real w[N];                   // Known parameter w
  real t[N];                   // Known parameter t
  real N_val[N];               // Known parameter N
  real H[N];                   // Known parameter H
  real q[N];                   // Known parameter q
  int<lower=1, upper=2> G[S];  // Group scm or eth (subject-specific)
}
parameters {
  real U_log_pop;               // Population mean for log(U)
  real<lower=0> sigma_U;        // Population standard deviation for U
  real U_log_group[2];          // Group-level means for log(U)
  real U_log_offset[S];         // Subject-specific deviations for log(U)

  real S_logit_pop;             // Population mean for logit(S)
  real<lower=0> sigma_S;        // Population standard deviation for logit(S)
  real S_logit_group[2];        // Group-level means for logit(S)
  real S_logit_offset[S];       // Subject-specific deviations for logit(S)
  
  real<lower=0> sigma_B;  // Scalar residual standard deviations
}
transformed parameters {
  vector[S] U_subject;
  vector[S] S_subject;
  vector[N] B_hat;

  for (s in 1:S) {
    U_subject[s] = U_log_group[G[s]] + U_log_offset[s] * sigma_U;
    S_subject[s] = S_logit_group[G[s]] + S_logit_offset[s] * sigma_S;
  }
  for (n in 1:N) {
    int s = subject[n];

    real Pn_star =
      1 / (t[n] + (N_val[n] / (w[n] * q[n])));

    real Pn =
      pow(
        exp(U_subject[s]) *
        pow(Pn_star, inv_logit(S_subject[s])),
        1 / (1 - inv_logit(S_subject[s]))
      );
    B_hat[n] =
      log(N_val[n]) -
      log((t[n] * q[n]) + (N_val[n] / w[n])) -
      log(1 + 1 / Pn);
  }
}
model {
  // Priors
  U_log_pop ~ normal(0, 50);
  sigma_U   ~ normal(0, 10);

  S_logit_pop ~ normal(0, 50);
  sigma_S     ~ normal(0, 10);

  U_log_group ~ normal(U_log_pop, sigma_U);
  S_logit_group ~ normal(S_logit_pop, sigma_S);

  U_log_offset ~ normal(0, 1);
  S_logit_offset ~ normal(0, 1);

  sigma_B ~ normal(0, 10);

  // Vectorized likelihood
  B ~ normal(B_hat, sigma_B);
}

generated quantities {
  vector[N] log_lik;
  vector[N] B_rep;

  for (n in 1:N) {
    log_lik[n] = normal_lpdf(B[n] | B_hat[n], sigma_B);
    B_rep[n]   = normal_rng(B_hat[n], sigma_B);
  }
}
"
data_subject <- data %>%
  group_by(Subject) %>% #Groups the dataset by Subject and extracts the first observed group ID (G) for each subject
  summarise(G = first(G))  # Take the first group ID for each subject
# Prepare data for Stan model
stan_data <- list(
  N = nrow(data),
  S = length(unique(data$Subject)),
  subject = as.numeric(data$Subject),
  B = data$B,
  N_val = data$N_val,
  w = data$w,
  q = data$q,
  t = data$t,
  H = data$H,
  G = data_subject$G  # Pass subject-specific group IDs, so each subject is correctly assigned to only one group.
)
stan_model <- stan_model(model_code = Eq13_code)
Phase3_eq13<- sampling( 
  stan_model, 
  data = stan_data,  
  iter = 4000,
  chains = 4, 
  cores = 4,
  control = list(adapt_delta = 0.99, max_treedepth = 20)  
)

Which approach makes more sense given the problem? Often, we want to use a single sigma when the B[n] are all drawn from the same population, which will regularize the B[n] toward B_hat and also reduce the size of posterior intervals.

You mean there are 15 different B[n] being drawn from the same sigma? I see sigma is grouped by s. In that case, if the different groups are the relevant unit, you want to do what you’re doing conceptually rather than regularize them all to a single common mean. You can then give the components hierarchical priors.

It only comes up when you want to do inference for new groups or posterior predictive inference for existing groups, because they will integrate over the uncertainty.

Yes, both the group and subject levels are relevant units in this problem. I am interested in subject-level parameter estimates as well as in group-level inference, particularly for comparing the posterior distributions of U and S across groups.

I attempted a new model with subject-specific SDs but regularized through a hierarchical prior (as suggested). The outcome was very positive, no divergent transitions, less shrinkage of outliers, good ESS and rhat values. Thank you for your feedback Bob! Will include you in the acknowledgements of the paper haha.

The final code below, in case there are any issues with the implementation or for people that might have similar issues in the future :

Eq13_code <- "
data {
  int<lower=1> N;              // Number of observations
  int<lower=1> S;              // Number of subjects
  int<lower=1, upper=S> subject[N]; // Subject IDs
  real B[N];                   // Response variable (observed B)
  real w[N];                   // Known parameter w
  real t[N];                   // Known parameter t
  real N_val[N];               // Known parameter N
  real H[N];                   // Known parameter H
  real q[N];                   // Known parameter q
  int<lower=1, upper=2> G[S];  // Group scm or eth (subject-specific)
}
parameters {
  real U_log_pop;               // Population mean for log(U)
  real<lower=0> sigma_U;        // Population standard deviation for U
  real U_log_group[2];          // Group-level means for log(U)
  real U_log_offset[S];         // Subject-specific deviations for log(U)

  real S_logit_pop;             // Population mean for logit(S)
  real<lower=0> sigma_S;        // Population standard deviation for logit(S)
  real S_logit_group[2];        // Group-level means for logit(S)
  real S_logit_offset[S];       // Subject-specific deviations for logit(S)
  
  **real Pop_sigma_B;               // Mean log SD (population)**
**  real<lower=0> SD_sigma_B;       // SD of log SDs **
**  vector[S] sigma_B_offset;       // Non-centered subject SD**
}


transformed parameters {
  vector[S] U_subject;
  vector[S] S_subject;
  vector[N] B_hat;
  vector[S] sigma_B;
  vector[N] sigma_obs;

  for (s in 1:S) {
    U_subject[s] = U_log_group[G[s]] + U_log_offset[s] * sigma_U;
    S_subject[s] = S_logit_group[G[s]] + S_logit_offset[s] * sigma_S;
    **sigma_B[s] = exp(Pop_sigma_B + SD_sigma_B * sigma_B_offset[s]);**
  }

  for (n in 1:N) {
    int s = subject[n];
    sigma_obs[n] = sigma_B[subject[n]];

    real Pn_star =
      1 / (t[n] + (N_val[n] / (w[n] * q[n])));

    real Pn =
      pow(
        exp(U_subject[s]) *
        pow(Pn_star, inv_logit(S_subject[s])),
        1 / (1 - inv_logit(S_subject[s]))
      );

    B_hat[n] =
      log(N_val[n]) -
      log((t[n] * q[n]) + (N_val[n] / w[n])) -
      log(1 + 1 / Pn);
  }
}

model {
  // Priors
  U_log_pop ~ normal(0, 50);
  sigma_U   ~ normal(0, 10);

  S_logit_pop ~ normal(0, 50);
  sigma_S     ~ normal(0, 10);

  U_log_group ~ normal(U_log_pop, sigma_U);
  S_logit_group ~ normal(S_logit_pop, sigma_S);

  U_log_offset ~ normal(0, 1);
  S_logit_offset ~ normal(0, 1);

  

  **Pop_sigma_B ~ normal(0, 1);      **
**  SD_sigma_B ~ normal(0, 0.5);   **
**  sigma_B_offset ~ normal(0, 1);** 

  // Vectorized likelihood
  B ~ normal(B_hat, sigma_obs);
}