Simple nested normal dist model help

I could use a little help. I’m trying to build a nested normal distribution model - no priors, just likelihood - as simple as possible for learning purposes.

I created some data below, and I first built a model without nesting and that works fine. For the second model, I’m just want to get a mean for each of 3 groups (so I should see 3 parameter values for grpMu around 10, 11 and 12). I think I don’t get the syntax / approach here. I’ve tried a number of approaches and keep getting various errors (mostly around indexing).

Could you please just show me the correct way to do that? Code below – thanks in advance. Code below:

library(tidyverse)
library(rstan)

mu <- 10
sd <- 1
n <- 100
distDF <- data.frame(x = 1, y = rnorm(100, mu, sd))
distDF <- rbind(distDF, data.frame(x = 2, y = rnorm(100, mu+1, sd)))
distDF <- rbind(distDF, data.frame(x = 3, y = rnorm(100, mu+2, sd)))
ggplot(distDF, aes(y, colour = as.factor(x))) + geom_density(bw = 1)

stanMod <- ’
data {
int N;
vector[N] y;
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
y ~ normal(mu, sigma);
}

fit <- stan(model_code = stanMod, data = list(y= distDF$y, N = nrow(distDF)))
summary(fit)

stanMod2 <-

data {
int N; // number of observations
vector[N] y; // observations
vector[N] x; // group values
int J; // number of groups
int Grp[J]; // group indicators
}
parameters {
vector[J] grpMu;
}
model {
real mu;
real sigma;

mu ~ normal(11,1);
grpMu ~ normal(mu[Grp], sigma[Grp]);
y ~ normal(grpMu, sigma);
}

fit <- stan(model_code = stanMod2, data = list(N = nrow(distDF),
y = distDF$y,
J = length(unique(distDF$x)),
x = as.numeric(distDF$x),
Grp = as.integer(unique(distDF$x))

))

summary(fit)

Did you mean to have different sigma’s for each group?

I suspect you want something like this

mu ~ normal(11,1);
sigma_grp ~ normal(0, 2.5);
grpMu ~ normal(mu[Grp], sigma_grp);
y ~ normal(grpMu, sigma);

Thanks Stijn:

That’s the idea - I just want to see Stan produce 3 mu likelihood values so I can get a handle on the nested syntax. I tried your suggestion (and a number of variations) without success. With the model above, I get: “SYNTAX ERROR, MESSAGE(S) FROM PARSER: Indexed expression must have at least as many dimensions as number of indexes supplied: indexed expression dims=0; num indexes=1 …”. The complete code is below. I’d REALLY like to get a simple working model that I can use to build understanding: Thanks again.

library(tidyverse)
library(rstan)

mu <- 10
sd <- 1
n <- 100
distDF <- data.frame(x = 1, y = rnorm(100, mu, sd))
distDF <- rbind(distDF, data.frame(x = 2, y = rnorm(100, mu+1, sd)))
distDF <- rbind(distDF, data.frame(x = 3, y = rnorm(100, mu+2, sd)))
ggplot(distDF, aes(y, colour = as.factor(x))) + geom_density(bw = 1)

stanMod2 <-

data {
int N; // number of observations
vector[N] y; // observations
int J; // number of groups
int Grp[N]; // group indicators
}
parameters {
vector[J] grpMu;
}
model {
real mu;
real sigma_grp;
mu ~ normal(11,1);
sigma_grp ~ normal(0, 2.5);
grpMu ~ normal(mu[Grp], sigma_grp);
y ~ normal(grpMu, sigma);
}

fit <- stan(model_code = stanMod2, data = list(N = nrow(distDF),
y = distDF$y,
J = length(unique(distDF$x)),
Grp = as.integer(distDF$x)

))

print(fit)

This Stan program recovers your parameters.

data {
  int N; // number of observations
  vector[N] y; // observations
  int J; // number of groups
  int Grp[N]; // group indicators
}
parameters {
  real mu;
  real<lower=0> sigma;
  vector[J] grpMu;
  real<lower=0> sigma_grp;
}
model {
  mu ~ normal(11,1);
  sigma_grp ~ normal(0, 2.5);
  grpMu ~ normal(mu, sigma_grp);
  y ~ normal(grpMu[Grp], sigma);
}

Part of it was my mistake in the grpMu & y line. You also have to declare all parameters and set a lower bound of 0 on scales. The Grp data needs to have a length of N not J.

             mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
mu          11.00    0.01 0.60    9.72   10.62   11.00   11.38   12.25  4000    1
sigma        1.01    0.00 0.04    0.94    0.99    1.01    1.04    1.10  4000    1
grpMu[1]    10.00    0.00 0.10    9.81    9.93   10.00   10.07   10.20  4000    1
grpMu[2]    11.08    0.00 0.10   10.88   11.01   11.08   11.15   11.28  4000    1
grpMu[3]    11.98    0.00 0.10   11.79   11.91   11.98   12.05   12.17  4000    1
sigma_grp    1.44    0.01 0.77    0.54    0.89    1.24    1.76    3.40  4000    1
lp__      -155.88    0.04 1.73 -160.06 -156.81 -155.57 -154.60 -153.46  1641    1

Thanks Stijn:

For people coming from a programming background, contained snippets like this are very valuable.

Stijn,

I thought I understood the hierarchical syntax, but I came across an error with a simple extension to our example. The error is: " normal_lpdf: Random variable has dimension = 3, expecting dimension = 600; Would you mind taking a look at this (maybe I did something stupid)? Thanks for your patience (learning).

The code extension (with the working model you suggested) is below:

library(tidyverse)
library(rstan)

mu <- 10
sd <- 1
N <- 100
J <- 2
K <- 3

mData <- matrix(nrow = 0, ncol = 3)
for (j in 1:J){
for (k in 1:K) {
tmp <- cbind(rep(j, N), rep(k, N), rnorm(N, mu, sd))
mData <- rbind(mData, tmp)
mu <- mu+1
}
mu <- mu-2
}

ggplot(data.frame(mData), aes(X3, colour = as.factor(X1))) + geom_density(bw = 1)

ggplot(data.frame(mData), aes(X3, colour = as.factor(X2))) +
geom_density(bw = 1) +
facet_grid(X1~.)

this is the model you gave me which works fine

stanMod1 <- ’
data {
int N; // number of observations
vector[N] y; // observations
int J; // number of groups
int L1[N]; // group indicators
}
parameters {
real mu;
real<lower=0> sigma;
vector[J] L1Mu;
real<lower=0> L1Sigma;
}
model {
mu ~ normal(11,1);
L1Sigma ~ normal(0, 2.5);
L1Mu ~ normal(mu, L1Sigma);
y ~ normal(L1Mu[L1], sigma);
}

fit <- stan(model_code = stanMod1, data = list(
N = nrow(mData),
y = mData[,3],
J = length(unique(mData[,1])),
L1 = as.integer(mData[,1])
))

print(fit)

this is the extension adding a level which throws the error

stanMod2 <- ’
data {

int N; // number of observations
vector[N] y; // observations

int J; // number of level1 grps
int L1[N]; // level1 indicators

int K; // number of level2 grps
int L2[N]; // level2 indicators

}
parameters {
real mu;
real<lower=0> sigma;

vector[J] L1Mu;
real<lower=0> L1Sigma;

vector[K] L2Mu;
real<lower=0> L2Sigma;

}
model {

mu ~ normal(11,1);

L1Sigma ~ normal(0, 2.5);
L2Sigma ~ normal(0, 2.5);

L1Mu ~ normal(mu, L1Sigma);
L2Mu ~ normal(L1Mu[L1], L2Sigma);
y ~ normal(L2Mu[L2], sigma);

}

fit <- stan(model_code = stanMod2, data = list(
N = nrow(mData),
y = mData[,3],
J = length(unique(mData[,1])),
L1 = as.integer(mData[,1]),
K = length(unique(mData[,2])),
L2 = as.integer(mData[,2])
))

print(fit)

I think there are some misunderstandings about multilevel models. I am not sure what you are trying to do but there is a mismatch between your data generating process in r and your stan program.

In your data generating process, you don’t have an extra level, more like an extra factor at the same level as L1. To make it more concrete, you can think of y as students’ test scores, L1 is the level of the class and L1mu is the average test score in each class. L2 is another factor, for instance gender where L2mu is the average test score per gender.

The last three lines in your stan program than should look something like this.

L1Mu ~ normal(mu1, L1Sigma);
L2Mu ~ normal(mu2, L2Sigma);
y ~ normal(L1Mu + L2Mu, sigma);

The way your Stan program is set-up looks more like an extra level. That is where L1 is the class of the student, and L2 is the school where the class is in.

I think that might explain why I had some difficulty understanding your data generating project (not because it’s wrong, just not how I think). I think if you want to understand stan better it might be worthwhile to spend some time to learn to generate data in a more “statistical notation/code” because that is also how stan code is written. I might also be helpful if you have a specific application in mind. That is why I used the student test score example.

PS: The errors you are getting are problems with indexing because the structure of your simulated data does not match the structure in your stan program.

I’m sorry Stijn, but I think I accidentally pulled a “switcheroo” on you. I modified the data in the first post (see the generating process in my last post above). The last one generates a hierarchical structure “where L1 is the class of the student, and L2 is the school where the class is in”. Just run it and see. There are 2 level 1 groups. And 3 level 2 groups that are dependent on level 1. If you run the code, you’ll get the distributions below.

So, I think the model and data generating process are aligned - it’s just the indexing syntax that doesn’t work for me. Could you please take another look at the model from this perspective?

dist

I used the data generating process in your new post. I still think I am right.

Your data generating process has the following combinations of j and k:

factor
j 1 1 1 2 2 2
k 1 2 3 1 2 3

In a two level model, the combinations of j and k would be something like

level
j 1 1 1 2 2 2
k 1 2 3 4 5 6

Stijn,

Let’s assume that we have hierarchical data. with 2 levels. What would the model look like in that case?

This is an R script for a dataset with 1000 observations, with 5 top level groups, and 20 middle level groups.

mu = 10; sigma = 1; N = 1000
sigma2 = 4; J = 5
sigma1 = 2; K = 20; index1 = sample(1:J, size = K, replace = TRUE)
index = sample(1:K, size = N, replace = TRUE)
 
mu2 = rnorm(J, mu, sigma2)
mu1 = rnorm(K, mu2[index1], sigma1) 
y = rnorm(N, mu1[index], sigma)