Scale-mixture of normal vs. student t

Hi all,

I’m attempting to fit a student t_4 model in Stan as a scale mixture of normals. The motivation is that I will do this also in another program, but I created a simpler example/simulation to test in Stan as Stan has the ability to fit a t likelihood. However, I am finding that Stan does not agree with itself when the two ways of fitting the model are compared.

Model 1:

y_{si} \vert \mu, V_s, \kappa_s \sim N(\mu, V_s \times \kappa_s^{-1})
\kappa_s \sim \Gamma(\nu/2, \nu/2)
\mu \sim N(0, 10^5)

where i indexes observation within group s, V_s is a known, fixed scaling parameter, \kappa_s is a group specific precision parameter to be estimated \mu is a common mean to all groups and \nu is the degrees of freedom of the induced t distribution (in this case \nu = 4).

Model 2:

y_{si} \vert \mu, V_s, \nu \sim t_{\nu}(location = \mu, scale = \sqrt{V_s})
\mu \sim N(0, 10^5)

Fitting these two models in Stan give different estimates for \mu. See the attached .png file.

Does anyone know why this might be? I’m attaching a data file and Stan code for the two models.

For what it’s worth, I know the marginal variance in these cases are different (by a factor of \nu/(\nu -2)), but the problem persists when correcting for that.

Thank you!

stanModel_10_student.stan (1.4 KB)
stanModel_10.stan (1.3 KB)

dat.csv (95.7 KB)


I tried with narrower prior sd for mu (yours seems to be quite wide), and different number of iterations (thinking about precision in posterior estimates), but it did not change anything…

Then, I will let wiser statisticians answer :)


Drawing on Michael Betancourt’s Fitting The Cauchy Distribution, I believe you want

y \sim N(\mu, \kappa^{-1/2})
\kappa \sim \Gamma(\nu/2, V/2)

instead of

y \sim N(\mu, V \times \kappa^{-1/2})
\kappa \sim \Gamma(\nu/2, \nu/2)

I edited your stanModel_10.stan to stanModel_10_mix.stan (1.4 KB), from which I got very similar results for the parameter \mu from the student_t version and this mixture version, but with increased effective sample size as hoped for.


Thanks for trying! :)


I think it depends on whether one wants to fit a Cauchy distribution or a t-distribution wth defined moments. It’s not that I simply want a heavy-tailed distribution. The V’s are important in our application and the t-mixture gives us a practical interpretation of the \kappa's. The normal mixture I had written out is the proper way to induce a t_4 via normal mixtures, so I still don’t understand why Stan’s t_4 likelihood would give differing answers.

See the plot below, I still get differing answers with the mixtures vs the Student t distribution. A Cauchy is a t_1 so I have compared it with that.

You’re right, there seems to be something strange going on here. Mathematically, both your and my mixtures (accounting for my typo above) give Student t distributions. I should have written

\kappa \sim \Gamma(\nu/2, \nu V / 2)

Unfortunately, my typo made me think things were working, but I have now reliably reproduced your results and don’t have any reasonable explanation.


It’s very strange! Thanks for giving it a try!

@betanalpha, this seems up your alley. Any insight?

Models agree if you put a hierarchal prior structure on the mean:

which is to say you do something like this:

parameters {
  real mu[n];
  real mu_mean;
  real <lower=0> mu_sd;           
  real<lower=0> kappa[n];      // precision for data source
model {
  kappa[1] ~ gamma(2,2 * V[1]);
  kappa[2] ~ gamma(2,2 * V[2]);
  kappa[3] ~ gamma(2,2 * V[3]);
  kappa[4] ~ gamma(2,2 * V[4]);
  kappa[5] ~ gamma(2,2 * V[5]);
  kappa[6] ~ gamma(2,2 * V[6]);
  kappa[7] ~ gamma(2,2 * V[7]);
  kappa[8] ~ gamma(2,2 * V[8]);
  kappa[9] ~ gamma(2,2 * V[9]);
  kappa[10] ~ gamma(2,2 * V[10]);
  // mu ~ normal(0,10000);
  y1 ~ normal(mu[1], sqrt(V[1] * 1/kappa[1]));
  y2 ~ normal(mu[2], sqrt(V[2] * 1/kappa[2]));
  y3 ~ normal(mu[3], sqrt(V[3] * 1/kappa[3]));
  y4 ~ normal(mu[4], sqrt(V[4] * 1/kappa[4]));
  y5 ~ normal(mu[5], sqrt(V[5] * 1/kappa[5]));
  y6 ~ normal(mu[6], sqrt(V[6] * 1/kappa[6]));
  y7 ~ normal(mu[7], sqrt(V[7] * 1/kappa[7]));
  y8 ~ normal(mu[8], sqrt(V[8] * 1/kappa[8]));
  y9 ~ normal(mu[9], sqrt(V[9] * 1/kappa[9]));
  y10 ~ normal(mu[10], sqrt(V[10] * 1/kappa[10]));

  mu ~ normal(mu_mean, mu_sd);
  mu_sd ~ normal(0, 1);

and looking at the boxplot of mu_mean. (An aside - use matrices! The code repetition is not ideal).

So there is definitely something strange going on due to some parameters being constant and some varying from data set to data set. I would suggest doing the maths to see why you cannot manipulate your current joint normal/gamma formulation to the desired student-t formulation. (Like the manipulations here:, but being careful with the indexing because you have known V[n]).


Thank you! This is great!

I had the non-hierarchy on the mean, because I was trying to replicate how I would implement this in INLA (which I am doing because my more complicated model uses the sparse matrix representation in smoothing effects like ICARs and RWs). But, I am glad to have this special case figured out.

Yes, thank you for the recommendations in my Stan code. This is my first time using Stan so I was not sure about loops, etc. :)

Something like this is easier to read:

data {
  int <lower = 0> N;
  int <lower = 0> n;
  matrix [n, N] y;
  real <lower = 0> V [n]; 

parameters {
  real mu [n];
  real mu_mean;
  real <lower = 0> mu_sd;
  real <lower = 0> kappa [n];

model {
  for (ii in 1 : n) {
    y[ii, ] ~ normal(mu[ii], sqrt(V[ii]  * (1 / kappa[ii])));
    kappa[ii] ~ gamma(2, 2 * V[ii]);

  mu ~ normal(mu_mean, mu_sd);
  mu_sd ~ normal(0, 1);
  mu_mean ~ normal(0, 5);

and you can reflow the data in the csv you have to the appropriate form pretty quickly:

data_in <- read.csv("dat.csv")
stan_dat <- as.list(data_in)
stan_dat$N <- 500
stan_dat$n <- 10
stan_dat$V <- rep(0.2, 10)
stan_dat[["y"]] <- matrix(
  unlist(stan_dat[5 : length(stan_dat)]),
  nrow = stan_dat$n,
  byrow = T