Help learning how to write a simple generative hierarchical model for test-scores

I’m trying to write a generative model for predicting how a student is going to perform on a standardized test. As a learning exercise, I want to start simple and slowly build complexity gradually. My first model is the following:

  • Let y_{i,s,t} be the test score for student i in subject s and period t
  • y_{i,s,t} \in [650,850]
  • s \in \{ELA, Math\}
\begin{aligned} y_{i,s,t} \sim N(a[i] + \gamma y_{i,s,t-1} , \sigma) \\ a[i] \sim N(\mu,sigma_a) \\ \mu \sim N_{[650, 850]}(700, 75) \\ \sigma_a \sim N_+(0, 100) \\ \gamma \sim N_+(0.5,1) \\ \sigma \sim \Gamma(2,8) \end{aligned}

My idea is that in this model:

  • \mu tells me the average performance on the test (regardless of the subject)
  • \gamma tells me how persistence performance is (regardless of the subject)

This is my stan code [comments on the code are always welcome but this is not my question yet]:


data {
  int<lower=1> I ;                                                  // number of students
  int<lower=1> N ;                                                  // number of observations
  int<lower=1,upper=I> ii[N];                                       // student id index
  vector<lower=650,upper=850>[N] pre_test ;                         // pre-test
  vector<lower=650,upper=850>[N] raw_score ;                        // raw_score
}


parameters {
  real<lower=650, upper=850> mu;
  vector[I] a_std;              
  real<lower=0> sigma_a;
  real<lower=0> gamma;
  real<lower=0> sigma;
}

transformed parameters {
  vector[I] a =  mu + sigma_a * a_std;            // Matt trick
}

model {
  mu ~ normal(700,75);
  a_std ~ normal(0,1);
  sigma_a ~ normal(0,100);
  gamma ~ normal(0.5,1);
  sigma ~ gamma(2,8);
  raw_score ~ normal(a[ii] + gamma*pre_test, sigma);
}

generated quantities {
  vector[N] score_sim;
  vector[N] log_lik;
  vector[N] theta;
  for (n in 1:N)  {
    theta[n] = a[ii[n]] + gamma*pre_test[n] ;
    score_sim[n] = normal_rng(theta[n], sigma);
    log_lik[n] = normal_lcdf(raw_score[n] | theta[n], sigma);
  }
}

The first thing that I want to do is to allow a and \gamma to be functions of the subject. My idea is that some students are going to be better at math and some others better at reading. Similarly, persistence in grades will be different for math and ELA. This is my attempt to write this slightly more complex model:

\begin{aligned} y_{i,s,t} \sim N(a[i,s] + \gamma y_{i,s,t-1} , \sigma) \\ a[i,s] = \mu + b[i,s] \\ \mu \sim N_{[650, 850]}(700, 75) \\ b[i,s] \sim N(0, \Sigma) \\ \Sigma[1,1] = \sigma_{math} \\ \Sigma[2,2] = \sigma_{ELA} \\ \Sigma[1,2] = \Sigma[1,2] = \rho \sigma_{math} \sigma_{ELA} \\ \gamma = \gamma_0 + \gamma_{math} I[s==math] \\ \gamma_0 \sim N_+(0.5,1) \\ \gamma_{math} \sim N(0,1) \\ \sigma \sim \Gamma(2,8) \end{aligned}
  • My first question is whether there is something wrong with the model as I wrote it (I’m still very new at writing this type of model…)

  • My second question is about implementing this in Stan, but I will write a follow-up comment once someone gives me some feedback on my first question.

Thanks a lot for all the help! This community is amazing!

Ignacio

PS: The more I watch @bgoodri youtube channel, the more I like the idea of thinking in terms of generative models. Thanks @bgoodri !

The question’s clear, but these are hard to answer as it’s asking someone to try to look at your model and debug it. It’s hard to verify something’s right without a lot of work.

Before trying to model covariance, I’d try to fit b with independent priors.

Did you model \gamma as a baseline plus math difference for convenience of priors? If not, you might try modeling independently dpeending on how the \gamma_0 vs. \gamma_{\mathrm{math}} looks in the posterior.

You might get better computation trying to pull the scale down so that \mu was modeled on a standardized scale and then multiplied out to this standardized testing range.

It’ll probably be OK modeling a persistence difference for math, but why not model the two persistences independently? Sometimes the offset will fit better, but it’s also a matter of being able to formulate reasonable priors.

There’s a lot of discussion of these kinds of models in Gelman and Hill’s regression book. That’s being rewritten in Stan, but isn’t quite ready yet.

First, thanks a lot for all the help you are giving me in my “Bayesian learning journey.” I understand that what i’m asking is hard, so I appreciate any amount of help that this amazing community might be willing to provide :-)

This is my best attempt as of right now at doing that:

\begin{aligned} y_{i,s,t} \sim N(a[s=math,i] I(s=math) + a[s=ELA,i] I(s=ELA) + \gamma y_{i,s,t-1} , \sigma) \\ a[s=math,i] \sim N(\mu_{math},sigma_{aMath}) \\ a[s=ELA,i] \sim N(\mu_{ELA},sigma_{aELA}) \\ \mu_{math} \sim N(650, 75) \\ \mu_{ELA} \sim N(700, 75) \\ \sigma_{aMath} \sim N_+(0, 100) \\ \sigma_{aELA} \sim N_+(0, 75) \\ \gamma \sim N_+(0.5,1) \\ \sigma \sim \Gamma(2,8) \end{aligned}

What I tried to do is to say that each student i has a random effect for math and for ela and those effects are independent (which is the assumption that I want to change in my next model).

This is how I implemented the model in Stan:

data {
  int<lower=1> I ;                                                  // number of students
  int<lower=1> N ;                                                  // number of observations
  int<lower=1,upper=I> ii[N];                                       // student id index
  vector<lower=0,upper=1>[N] math;                                  // math dummies
  vector<lower=650,upper=850>[N] pre_test ;                         // pre-test
  vector<lower=650,upper=850>[N] raw_score ;                        // raw_score
}

transformed data {
  vector<lower=0,upper=1>[N] ELA = 1 - math ;                       // ELA dummies
}


parameters {
  vector[2] mu;                                                     // mu[1] is math and [mu2] is ELA
  vector[I] a_math_std;   
  vector<lower=0>[2] sigma_a;
  vector[I] a_ELA_std;  
  real<lower=0> gamma;
  real<lower=0> sigma;
}

transformed parameters {
  vector[I] a_math =  mu[1] + sigma_a[1] * a_math_std;            // Matt trick
  vector[I] a_ELA =  mu[2] + sigma_a[2] * a_ELA_std;              // Matt trick
}

model {
  mu[1] ~ normal(650,75);
  a_math_std ~ normal(0,1);
  sigma_a[1] ~ normal(0,100);
  a_ELA_std ~ normal(0,1);
  sigma_a[2] ~ normal(0,100);
  gamma ~ normal(0.5,1);
  sigma ~ gamma(2,8);
  raw_score ~ normal(a_math[ii] .* math + a_ELA[ii] .* ELA  + gamma*pre_test, sigma);
}

generated quantities {
  vector[N] score_sim;
  vector[N] log_lik;
  vector[N] theta;
  for (n in 1:N)  {
    theta[n] = a_math[ii[n]] * math[n]  + a_ELA[ii[n]] * ELA[n]  + gamma*pre_test[n] ;
    score_sim[n] = normal_rng(theta[n], sigma);
    log_lik[n] = normal_lcdf(raw_score[n] | theta[n], sigma);
  }
}

One thing that I’ve been trying to do is declaring a single parameter vector[2] a_std[I]; instead of one for math and one for ELA. Alas, when I do that I get the following error:

parameters {
  vector[2] mu;                                                     // mu[1] is math and [mu2] is ELA
  vector[2] a_std[I];                                               // this does not work!
  vector<lower=0>[2] sigma_a;
  real<lower=0> gamma;
  real<lower=0> sigma;
}

transformed parameters {
  vector[I] a_math =  mu[1] + sigma_a[1] * a_std[1];             // Matt trick
  vector[I] a_ELA =  mu[2] + sigma_a[2] * a_std[2];              // Matt trick
} 

[1] "Error in sampler$call_sampler(args_list[[i]]) : "                                                                                                      
[2] "  Exception: assign: Rows of left-hand-side (191) and rows of right-hand-side (2) must match in size  (in 'modelae820f4f6_stan_ae6476be8d' at line 24)"
error occurred during calling the sampler; sampling not done

My guess is that my understanding of how to use arrays in Stan is the problem :-( Alas, I think I need to be able to define vector[2] a[I] ; in order to use multi_normal to set the prior for a.

Yes, I thought defining it in this way would be more convenient to define the priors. Every single time I wrote a model until now I’ve been using weakly informative priors. One of the reasons I’m doing this as a learning exercise is to try to use more informative priors for the first time. That said, I will try your suggestion and compare it with what I originally had in mind.

I want to try this too and compaire the two models as a learning exercise.

I look forward to buying it!

What’s line 24 of your program? You have mismatched size assignments.

this is line 24:

vector[I] a_math =  mu[1] + sigma_a[1] * a_std[1];             // Matt trick

and this is how all those parameters are defined:

parameters {
  vector[2] mu;                                                     // mu[1] is math and [mu2] is ELA
  vector[2] a_std[I];                                               // this does not work!
  vector<lower=0>[2] sigma_a;
  real<lower=0> gamma;
  real<lower=0> sigma;
}

My guess is that I’m misunderstanding how to access the array a_std.

Thanks for the help!

When you have

vector[K] foo[N];

you are declaring an N-dimensional array of K-vectors.

In your case, a_std[1] is a vector of size 2 hereas a_math is a vector of size I. What are you tring to do? Your a_std is an I x 2 structure. Maybe you wanted

vector[I] a_std[2];
1 Like

Thanks a lot bob, this actually works. In case this is helpful for someone in the future, this is my code right now:

data {
  int<lower=1> I ;                                                  // number of students
  int<lower=1> N ;                                                  // number of observations
  int<lower=1,upper=I> ii[N];                                       // student id index
  vector<lower=0,upper=1>[N] math;                                  // math dummies
  vector<lower=650,upper=850>[N] pre_test ;                         // pre-test
  vector<lower=650,upper=850>[N] raw_score ;                        // raw_score
}

transformed data {
  vector<lower=0,upper=1>[N] ELA = 1 - math ;                       // ELA dummies
}


parameters {
  vector[2] mu;                                                     // mu[1] is math and [mu2] is ELA
  vector[I] a_std[2];                                              
  vector<lower=0>[2] sigma_a;
  real<lower=0> gamma;
  real<lower=0> sigma;
}

transformed parameters {
  vector[I] a_math =  mu[1] + sigma_a[1] * a_std[1];            // Matt trick
  vector[I] a_ELA =  mu[2] + sigma_a[2] * a_std[2];              // Matt trick
}

model {
  mu[1] ~ normal(650,75);
  a_std[1] ~ normal(0,1);
  a_std[2] ~ normal(0,1);
  sigma_a[1] ~ normal(0,100);
  sigma_a[2] ~ normal(0,100);
  gamma ~ normal(0.5,1);
  sigma ~ gamma(2,8);
  raw_score ~ normal(a_math[ii] .* math + a_ELA[ii] .* ELA  + gamma*pre_test, sigma);
}

generated quantities {
  vector[N] score_sim;
  vector[N] log_lik;
  vector[N] theta;
  for (n in 1:N)  {
    theta[n] = a_math[ii[n]] * math[n]  + a_ELA[ii[n]] * ELA[n]  + gamma*pre_test[n] ;
    score_sim[n] = normal_rng(theta[n], sigma);
    log_lik[n] = normal_lcdf(raw_score[n] | theta[n], sigma);
  }
}

My next goal is to write a_math and a_ELA as coming from multi_normal instead of two independent normals. My idea is that random effects for math and english should be correlated. To do this I’m trying to follow the manual, which brings my next question. The example on the manual has a “group predictors” variable row_vector[L] u[J]; (page 149). I’m not sure what this should be in my example. This is how I tried to translate my code following that example:

data {
  int<lower=1> I ;                                                  // number of students
  int<lower=1> N ;                                                  // number of observations
  int<lower=1,upper=2> jj[N];                                       // subject for student
  vector<lower=650,upper=850>[N] pre_test ;                         // pre-test
  vector<lower=650,upper=850>[N] raw_score ;                        // raw_score
}


parameters {
  corr_matrix[2] Omega ;                                            // prior correlation
  vector<lower=0>[2] tau ;                                          // prior scale
  matrix[I,2] eta ;                                                 // subject coeff
  vector[2] a[I] ;                                                  // individual coeff by subject
  real<lower=0> gamma;
  real<lower=0> sigma;
}


model {
  tau ~ cauchy(0,2.5) ;
  Omega ~ lkj_corr(2) ;
  to_vector(eta) ~ normal(0,5) ;
  {
  row_vector[2] u_eta[I];
  for (i in 1:I)
    u_eta[i] = u[i] * eta;                                          // WHAT should u[i] be here??? In page 149 is "group predictors" in the data block
  a ~ multi_normal(u_eta, quad_form_diag(Omega, tau)) ;
  }

  gamma ~ normal(0.5,1);
  sigma ~ gamma(2,8);
  for(n in 1:N)
  raw_score[n] ~ normal(a[jj[n]]  + gamma*pre_test[n], sigma);
}

generated quantities {
  vector[N] score_sim;
  vector[N] log_lik;
  vector[N] theta;
  for (n in 1:N)  {
    theta[n] = a[jj[n]] + gamma*pre_test[n] ;
    score_sim[n] = normal_rng(theta[n], sigma);
    log_lik[n] = normal_lcdf(raw_score[n] | theta[n], sigma);
  }
}

Thanks!

The Cholesky factor parameterization of a correlation matrix along with multi_normal_cholesky is a lot more stable and efficient.

You might also want the non-centered version, which is also explained in the manual.

Thanks Bob. My idea was to try the Cholesky factor parameterization after figuring out the code from page 149. The Cholesky code from page 151 still uses u which is the part that I cannot figured out in my code above.

I finally found some time to get back to this and I think I figured out how to do the Cholesky factor parameterization. This is the model I will try to fit with stan:

\begin{aligned} y_{i,s,t} \sim N(a[s=math,i] I(s=math) + a[s=ELA,i] I(s=ELA) + \gamma y_{i,s,t-1} , \sigma) \\ a \sim MVN(\mu,\Sigma) \\ \Sigma = \begin{bmatrix} \sigma_{aMath} & 0 \\ 0 & \sigma_{aELA} \end{bmatrix} \Omega \begin{bmatrix} \sigma_{aMath} & 0 \\ 0 & \sigma_{aELA} \end{bmatrix} \\ \mu_{math} \sim N(0, 1) \\ \mu_{ELA} \sim N(0, 1) \\ \sigma_{aMath} \sim N_+(0, 20) \\ \sigma_{aELA} \sim N_+(0, 20) \\ \Omega = LKJ(2) \\ \gamma \sim N_+(1,0.05) \\ \sigma \sim N_+(0,2) \end{aligned}

and this is my stan code:

data {
  int<lower = 0, upper = 1> run_estimation;               // a switch to evaluate the likelihood
  int N ;                                                 // number of observations
  int I ;                                                 // number of students
  int<lower=1,upper=I> ii[N];                             // student id index
  vector<lower=0,upper=1>[N] math;                        // math dummies
  vector<lower=650,upper=850>[N] pre_test ;               // pre-test
  vector[N] raw_score ;              // raw_score
  real mu_mean ;                                          //
  real<lower=0> mu_sigma;                                 //
  real gamma_mean;                                        //
  real<lower=0> gamma_sigma;                              //
  real<lower=0> sigma_sigma;                              //
  real<lower=0> sigma_a_sigma;                            //
}

transformed data {
  vector<lower=0,upper=1>[N] ELA = 1 - math ;             // ELA dummies
}

parameters {
  vector[2] mu;                                           // mu[1] is math and [mu2] is ELA
  vector<lower=0>[2] sigma_math_ela;
  cholesky_factor_corr[2] L_Omega ;  // Cholesky factor of Omega 
  vector[2] math_ela_i_raw[I] ;      // primitives for a_math and a_ela 
  real<lower=0> gamma;
  real<lower=0> sigma;
}

transformed parameters {
  vector[I] a_math_i ;
  vector[I] a_ELA_i ;
  {
    matrix[2, 2] L_Sigma ; // Cholesky factor of Sigma
    vector[2] math_ela_i[I] ;
    L_Sigma = diag_pre_multiply(sigma_math_ela, L_Omega) ;
    for (i in 1:I) {
      math_ela_i[i] = L_Sigma * math_ela_i_raw[i] ; 
      a_math_i[i] = mu[1] + math_ela_i[i][1] ; 
      a_ELA_i[i] = mu[2] + math_ela_i[i][2] ; 
    }
  }
}

model {
  // priors
  mu ~ normal(mu_mean,mu_sigma);
  sigma_math_ela ~ normal(0,sigma_a_sigma);
  L_Omega ~ lkj_corr_cholesky(2) ;
  for (i in 1:I) math_ela_i_raw[i] ~ normal(0,1) ;
  gamma ~ normal(gamma_mean,gamma_sigma);
  sigma ~ normal(0,sigma_sigma);
  // likelihood, which we only evaluate conditionally
  if(run_estimation==1){
    raw_score ~ normal(a_math_i[ii] .* math + a_ELA_i[ii] .* ELA  + gamma*pre_test, sigma);
  }
}

generated quantities {
  vector[N] score_sim;
  vector[N] log_lik;
  vector[N] theta;
  matrix[2,2] Omega;
  matrix[2,2] Sigma_math_ela;
  Omega = multiply_lower_tri_self_transpose(L_Omega);
  Sigma_math_ela = quad_form_diag(Omega, sigma_math_ela);
  for (n in 1:N)  {
    theta[n] = a_math_i[ii[n]] * math[n]  + a_ELA_i[ii[n]] * ELA[n]  + gamma*pre_test[n] ;
    score_sim[n] = normal_rng(theta[n], sigma);
    log_lik[n] = normal_lcdf(raw_score[n] | theta[n], sigma);
  }
}

The question I have is whether or not the generated quantities code for Omega is correct. I tried to recover this parameter and it does not look great:

image

Am I correct in thinking that this figure shows that something is wrong?

This is the R code that I use to produce it:

fake_data <- data.frame(id = rep(1:100,2),
                        pre_test = sample(x = 650:850, size = 200),
                        raw_score = sample(x = 650:850, size = 200),
                        subject = c(rep(1,100),rep(0,100))) %>%
  mutate(name = paste(id, subject, sep = "_"),
         subject = as.character(subject))

compiled_model <- stan_model("model1.stan")

sim_out <- sampling(compiled_model, data = list(run_estimation = 0,
                                                N = nrow(fake_data), 
                                                I = length(unique(fake_data$id)),
                                                ii = fake_data$id,
                                                math = as.numeric(fake_data$subject),
                                                pre_test = fake_data$pre_test,
                                                raw_score = fake_data$raw_score,
                                                mu_mean = 0,
                                                mu_sigma = 1,
                                                gamma_mean = 1,
                                                gamma_sigma = 0.05,
                                                sigma_sigma = 2,
                                                sigma_a_sigma = 20),
                    seed = 42118)

fake_data_matrix  <- sim_out %>% 
  as.data.frame %>% 
  select(contains("score_sim"))


draw <- 21
true_correlation <- as.matrix(sim_out, pars = "Omega")[[draw,2]]

# Now estimate the model with this "true" data, to see that we can estimate the known parameters

recapture_fit <- sampling(compiled_model, data = list(run_estimation = 1,
                                                      N = nrow(fake_data), 
                                                      I = length(unique(fake_data$id)),
                                                      ii = fake_data$id,
                                                      math = as.numeric(fake_data$subject),
                                                      pre_test = fake_data$pre_test,
                                                      raw_score = score_sim,
                                                      mu_mean = 0,
                                                      mu_sigma = 1,
                                                      gamma_mean = 1,
                                                      gamma_sigma = 0.05,
                                                      sigma_sigma = 2,
                                                      sigma_a_sigma = 20),
                          seed = 42118)

## correlation between math and ela?

correlation <- as.data.frame(recapture_fit, pars = "Omega") 

ggplot(data = correlation, aes(x = `Omega[2,1]`))  + 
  geom_density()  + geom_vline(aes(xintercept = true_correlation),color="red") 

As always, thanks for the help.

this is a question for @bgoodri.

in general, the correlation params are hard to fit, especially in more than 2 or 3 dimensions and even moreso wen filtered through a glm.

1 Like

I think I finally understand what it’s doing wrong! I think my model has too many parameters. I’m following @James_Savage’s workflow for generating synthetic data but I’m only generating “one year of data”. That is, I start with y_{i,s,_t=0} and generate y_{i,s,_t=1}. Therefore, in my example, I have 200 observations but way more parameters:

  • 100 student ability for math
  • 100 student ability for English
  • the correlation between pre and post
  • \Sigma

Am I right?

If so, my next step is to try to figure out how to adapt James’ workflow to generate many years for synthetic. Any nudge in the right direction would be greatly appreciated!

As always, ¡gracias!

If you have proper priors on all those parameters, having more parameters than observations isn’t a technical hurdle. It can be a computational problem if there are too many ways to explain the data, especially if the parameters involved are correlated. Covariance matrices are hard to estimate.