Estimating within and between variation

My question is, I guess, a rather general one about the stan language.


My goal is to evaluate the within-variance and the between-variance of some data. I have some vectors x and y, both of length N. My suspicion is that for each pair c(x[n], y[n]) for n in 1:N, the variance is smaller than the variance of all measurements within x and y, respectively. So, an example data could be generated like this in R: t(replicate(n=N, rnorm(n=2, mean=rnorm(n=1), sd=0.1))).

I have tried the following model but it appears to be very problematic with many divergent transitions. I guess, I should not assign two different sampling statements to each x[n]? But how else could one estimate the within- and between-variation?

data {
  int<lower=0> N;
  real x[N];
  real y[N];
}
parameters {
  real mu[N];
  real mu_x;
  real mu_y;
  real<lower=0> sigma[N];
  real<lower=0> sigma_x;
  real<lower=0> sigma_y;
}
model {
  for(n in 1:N){
    target += normal_lpdf(x[n] | mu[n], sigma[n]);
    target += normal_lpdf(y[n] | mu[n], sigma[n]);
    // priors
    target += normal_lpdf(mu[n] | 0, 5);
    target += cauchy_lpdf(sigma[n] | 0, 2.5);
  }
  target += normal_lpdf(x | mu_x, sigma_x);
  target += normal_lpdf(y | mu_y, sigma_y);
  // priors
  target += normal_lpdf(mu_x | 0, 5);
  target += normal_lpdf(mu_y | 0, 5);
  target += cauchy_lpdf(sigma_x | 0, 2.5);
  target += cauchy_lpdf(sigma_y | 0, 2.5);
}

I think you’re overthinking it. Here’s what I’d do:

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}
trasformed data{
  vector[N] z = x - y ;
}
parameters {
  real mu_x ;
  real mu_y ;
  real mu_z ;
  real<lower=0> sigma_x ;
  real<lower=0> sigma_y ;
  real<lower=0> sigma_z ;
}
model {
  mu_x ~ normal( 0, 5) ;
  mu_y ~ normal( 0, 5) ;
  mu_z ~ normal( 0, 5) ;
  sigma_x ~ cauchy( 0, 2.5) ;
  sigma_y ~ cauchy( 0, 2.5) ;
  sigma_z ~ cauchy( 0, 2.5) ;
  x ~ normal( mu_x, sigma_x ) ;
  y ~ normal( mu_y, sigma_y ) ;
  z ~ normal( mu_z, sigma_z ) ;
}
generated quantities{
  real ratio_sigma_x_z = sigma_x / sigma_z ;
  real ratio_sigma_y_z = sigma_y / sigma_z ;
  real ratio_sigma_xy_z = (sigma_x+sigma_y)/2 / sigma_z ;
}

Another way to approach this is to reframe the parameterization as an intercept and an x-y effect, wherein you can talk about the variance of the intercept across replicates and variance of the x-y effect across replicates:

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}
trasformed data{
  vector[N] xy_diff = x - y ;
  vector[N] intercept = (x + y) /2;
}
parameters {
  real mu_xy_diff ;
  real mu_intercept ;
  real<lower=0> sigma_xy_diff ;
  real<lower=0> sigma_intercept ;
}
model {
  mu_xy_diff ~ normal( 0, 5) ;
  mu_intercept ~ normal( 0, 5) ;
  sigma_xy_diff ~ cauchy( 0, 2.5) ;
  sigma_intercept ~ cauchy( 0, 2.5) ;
  xy_diff ~ normal( mu_xy_diff, sigma_xy_diff ) ;
  intercept ~ normal( mu_intercept, sigma_intercept ) ;
}
generated quantities{
  real ratio_sigmas = sigma_xy_diff / sigma_intercept ;
}

btw, this can also be expressed as a simple correlation estimation problem, as a correlation is the ratio between two variances, one that reflects the average variance of the two variates, and one that reflects their covariance.

Thanks a lot for your answer, Mike. That helped a lot in my thought process. You are absolutely right in saying I was probably overthinking this. Your models also represent what I roughly had in mind. I think the formulation that comes closest to my original idea is a simple hierarchical model:

data {
  int<lower=0> N;
  real x[N];
  real y[N];
}

parameters {
  real mu[N];
  real mu_0;
  real<lower=0> sigma_pop;
  real<lower=0> sigma_0;
}

transformed parameters{
  real<lower=0,upper=1> icc;
  icc = sigma_pop/(sigma_pop+sigma_0);
}

model {
  for(n in 1:N){
    target += normal_lpdf(x[n] | mu[n], sigma_pop);
    target += normal_lpdf(y[n] | mu[n], sigma_pop);
    // priors
    target += normal_lpdf(mu[n] | 0, 5);
  }
  target += normal_lpdf(mu | mu_0, sigma_0);
  // priors
  target += cauchy_lpdf(sigma_pop | 0, 2.5);
  target += normal_lpdf(mu_0 | 0, 5);
  target += cauchy_lpdf(sigma_0 | 0, 2.5);
}


Hm. In your latest model, you have mu on the left of a | twice, once for each element in a loop and once as a vectorized expression, which isn’t sensical. And neither _0 parameter is connected to the data at all. (Oh, they are, but only bc you have two priors for mu, one of which involves them)

Thanks again, Mike (and sorry for the late response). You are right, of course. Replacing both terms involving mu bei target += normal_lpdf(mu[n] | mu_0, sigma_0); (within the loop over 1:N), the model should make more sense, right? At least, it accurately estimates the posterior mu_0, sigma_pop and sigma_0 from the example data t(replicate(n=500, rnorm(n=2, mean=rnorm(n=1, mean=mu_0, sd=sigma_0), sd=sigma_pop))).