Estimating within and between variation

LuckyPal · April 28, 2023, 4:04pm

My question is, I guess, a rather general one about the stan language.

My goal is to evaluate the within-variance and the between-variance of some data. I have some vectors x and y, both of length N. My suspicion is that for each pair c(x[n], y[n]) for n in 1:N, the variance is smaller than the variance of all measurements within x and y, respectively. So, an example data could be generated like this in R: t(replicate(n=N, rnorm(n=2, mean=rnorm(n=1), sd=0.1))).

I have tried the following model but it appears to be very problematic with many divergent transitions. I guess, I should not assign two different sampling statements to each x[n]? But how else could one estimate the within- and between-variation?

data {
  int<lower=0> N;
  real x[N];
  real y[N];
}
parameters {
  real mu[N];
  real mu_x;
  real mu_y;
  real<lower=0> sigma[N];
  real<lower=0> sigma_x;
  real<lower=0> sigma_y;
}
model {
  for(n in 1:N){
    target += normal_lpdf(x[n] | mu[n], sigma[n]);
    target += normal_lpdf(y[n] | mu[n], sigma[n]);
    // priors
    target += normal_lpdf(mu[n] | 0, 5);
    target += cauchy_lpdf(sigma[n] | 0, 2.5);
  }
  target += normal_lpdf(x | mu_x, sigma_x);
  target += normal_lpdf(y | mu_y, sigma_y);
  // priors
  target += normal_lpdf(mu_x | 0, 5);
  target += normal_lpdf(mu_y | 0, 5);
  target += cauchy_lpdf(sigma_x | 0, 2.5);
  target += cauchy_lpdf(sigma_y | 0, 2.5);
}

mike-lawrence · April 28, 2023, 4:57pm

I think you’re overthinking it. Here’s what I’d do:

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}
trasformed data{
  vector[N] z = x - y ;
}
parameters {
  real mu_x ;
  real mu_y ;
  real mu_z ;
  real<lower=0> sigma_x ;
  real<lower=0> sigma_y ;
  real<lower=0> sigma_z ;
}
model {
  mu_x ~ normal( 0, 5) ;
  mu_y ~ normal( 0, 5) ;
  mu_z ~ normal( 0, 5) ;
  sigma_x ~ cauchy( 0, 2.5) ;
  sigma_y ~ cauchy( 0, 2.5) ;
  sigma_z ~ cauchy( 0, 2.5) ;
  x ~ normal( mu_x, sigma_x ) ;
  y ~ normal( mu_y, sigma_y ) ;
  z ~ normal( mu_z, sigma_z ) ;
}
generated quantities{
  real ratio_sigma_x_z = sigma_x / sigma_z ;
  real ratio_sigma_y_z = sigma_y / sigma_z ;
  real ratio_sigma_xy_z = (sigma_x+sigma_y)/2 / sigma_z ;
}

Another way to approach this is to reframe the parameterization as an intercept and an x-y effect, wherein you can talk about the variance of the intercept across replicates and variance of the x-y effect across replicates:

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}
trasformed data{
  vector[N] xy_diff = x - y ;
  vector[N] intercept = (x + y) /2;
}
parameters {
  real mu_xy_diff ;
  real mu_intercept ;
  real<lower=0> sigma_xy_diff ;
  real<lower=0> sigma_intercept ;
}
model {
  mu_xy_diff ~ normal( 0, 5) ;
  mu_intercept ~ normal( 0, 5) ;
  sigma_xy_diff ~ cauchy( 0, 2.5) ;
  sigma_intercept ~ cauchy( 0, 2.5) ;
  xy_diff ~ normal( mu_xy_diff, sigma_xy_diff ) ;
  intercept ~ normal( mu_intercept, sigma_intercept ) ;
}
generated quantities{
  real ratio_sigmas = sigma_xy_diff / sigma_intercept ;
}

mike-lawrence · April 28, 2023, 5:00pm

btw, this can also be expressed as a simple correlation estimation problem, as a correlation is the ratio between two variances, one that reflects the average variance of the two variates, and one that reflects their covariance.

LuckyPal · April 28, 2023, 10:36pm

Thanks a lot for your answer, Mike. That helped a lot in my thought process. You are absolutely right in saying I was probably overthinking this. Your models also represent what I roughly had in mind. I think the formulation that comes closest to my original idea is a simple hierarchical model:

data {
  int<lower=0> N;
  real x[N];
  real y[N];
}

parameters {
  real mu[N];
  real mu_0;
  real<lower=0> sigma_pop;
  real<lower=0> sigma_0;
}

transformed parameters{
  real<lower=0,upper=1> icc;
  icc = sigma_pop/(sigma_pop+sigma_0);
}

model {
  for(n in 1:N){
    target += normal_lpdf(x[n] | mu[n], sigma_pop);
    target += normal_lpdf(y[n] | mu[n], sigma_pop);
    // priors
    target += normal_lpdf(mu[n] | 0, 5);
  }
  target += normal_lpdf(mu | mu_0, sigma_0);
  // priors
  target += cauchy_lpdf(sigma_pop | 0, 2.5);
  target += normal_lpdf(mu_0 | 0, 5);
  target += cauchy_lpdf(sigma_0 | 0, 2.5);
}

mike-lawrence · April 29, 2023, 1:20pm

Hm. In your latest model, you have mu on the left of a | twice, once for each element in a loop and once as a vectorized expression, which isn’t sensical. And neither _0 parameter is connected to the data at all. (Oh, they are, but only bc you have two priors for mu, one of which involves them)

LuckyPal · May 4, 2023, 7:06am

Thanks again, Mike (and sorry for the late response). You are right, of course. Replacing both terms involving mu bei target += normal_lpdf(mu[n] | mu_0, sigma_0); (within the loop over 1:N), the model should make more sense, right? At least, it accurately estimates the posterior mu_0, sigma_pop and sigma_0 from the example data t(replicate(n=500, rnorm(n=2, mean=rnorm(n=1, mean=mu_0, sd=sigma_0), sd=sigma_pop))).

Topic		Replies	Views
Estimating parameters of two normal distribution, mixed at proportions Modeling	5	1425	May 10, 2019
Questions about Maximum Likelihood Estimation in Stan Modeling	1	1060	July 31, 2019
Mismatch when predicting inside and outside of STAN Modeling posterior-predictive	4	806	May 20, 2020
Coping with varying curvature Modeling fitting-issues	9	750	August 24, 2019
Getting different results when using target += Modeling rstan , fitting-issues	3	83	December 12, 2024

Estimating within and between variation

Related topics