Generative model for bounded outcomes?

I’m trying to write a generative model in which the outcome is the score from a standardized test. In this test, the lower bound is 650 and the upper bound 850.

I want to start with a very simple model, use informative priors, and add complexity little by little. Right now, this is my model:

\begin{aligned} y_{i,s,t} \sim N(a[i] + \gamma y_{i,s,t-1} , \sigma) \\ a[i] \sim N(\mu,sigma_a) \\ \mu \sim N(0, 1) \\ \sigma_a \sim N_+(0, 20) \\ \gamma \sim N_+(1,0.05) \\ \sigma \sim N_+(0,20) \end{aligned}

Should I include in the model the fact that y is bounded to be between 650 and 850? I read section 12 of the manual but I’m not sure if using a truncated distribution would correct or not.

Yes, presumably using a shifted and scaled beta distribution that has the same conditional mean.

1 Like

Thanks @bgoodri. If i understand you correctly, instead of modeling y as normal i should model it as Beta(\alpha, \beta). Is that correct?

If so, I have a couple follow-up questions before i try to code this:

  1. The Beta distribution is defined in the interval [0, 1], so my guess is that when you say to use a shifted and scaled beta distribution you mean that I should define a new outcome y_{std} = (y-650)/200 . Is this correct?

  2. If I understand what I read correctly in the manual, instead of defining priors for \alpha and \beta I should define priors for the mean (\phi) and the count parameter (\lambda). Is this correct? If so, should I make \phi or \lambda a function pre-test, student random effect, etc?

You need to subtract the minimum before dividing by the maximum. It is all explained at


The mean can be the same as the mean in your Gaussian model, but you should have a prior on the other parameter.

1 Like

@bgoodri: assuming all test scores are integers, is there a reason to prefer the beta distribution over the beta binomial for shifted data?

I’m trying to generate some fake data with the Beta distribution by following @James_Savage tutorial. I’m parameterizing Beta with its mean (phi) and count (lambda). I made the mean a function of the pre-test and a student random effect. Because it needs to be positive, I made a logit transformation. Is this the right approach? The problem that I have with this is that is hard to think about priors, for example on the parameter on the for the pretest.

The other problem that I have is that the code that I wrote complains that Exception: beta_rng: Second shape parameter is 0, but must be > 0! (in 'model_parccbeta' at line 69). How should I parameterize the beta so that I don’t have that problem?

This is my stan code as of right now:

data {
  int<lower = 0, upper = 1> run_estimation;               // a switch to evaluate the likelihood
  int N ;                                                 // number of observations
  int I ;                                                 // number of students
  int<lower=1,upper=I> ii[N];                             // student id index
  vector<lower=650,upper=850>[N] raw_pre_test ;           // raw pre-test
  vector<lower=650,upper=850>[N] raw_score ;              // raw score
  real nu_mean;                                           //
  real<lower=0> nu_sigma ;                                //
  real mu_mean ;                                          //
  real<lower=0> mu_sigma;                                 //
  real<lower=0> sigma_a_sigma;                            //
}

transformed data {
  vector<lower=0,upper=1>[N] pre_test_std = (raw_pre_test-650)/200;           // pre-test
  vector<lower=0,upper=1>[N] score_std = (raw_score-650)/200 ;                // score
}


parameters {
  real nu ;
  vector<lower=0.1>[N] lambda;
  real mu;
  vector[I] a_std;
  real<lower=0> sigma_a;
}

transformed parameters {
  vector[I] a =  mu + sigma_a * a_std;                    // Matt trick
}

model {
  vector[N] phi ;
  vector[N] expX ;
  vector[N] alpha ;
  vector[N] beta ;
  // priors
  nu ~ normal(nu_mean, nu_sigma) ;
  lambda ~ pareto(0.1, 1.5) ;
  mu ~ normal(mu_mean, mu_sigma);
  a_std ~ normal(0,1);
  sigma_a ~ normal(0,sigma_a_sigma);
  // transformed parameters
  for(n in 1:N){
    expX[n] = exp(nu * pre_test_std[n] + a[ii[n]]) ;
    phi[n] = expX[n]/(1+expX[n]) ;
    alpha[n] = lambda[n] * phi[n] ;
    beta[n] = lambda[n] * (1-phi[n]) ;
  }

  // likelihood, which we only evaluate conditionally
  if(run_estimation==1){
    score_std ~ beta(alpha,beta);
  }
}

generated quantities {
  vector[N] score_sim;
  vector[N] phi ;
  vector[N] expX ;
  vector[N] alpha ;
  vector[N] beta ;
  for (n in 1:N)  {
    expX[n] = exp(nu * pre_test_std[n] + a[ii[n]]) ;
    phi[n] = expX[n]/(1+expX[n]) ;
    alpha[n] = lambda[n] * phi[n] ;
    beta[n] = lambda[n] * (1-phi[n]) ;
    score_sim[n] = beta_rng(alpha[n], beta[n]) * 200 + 650;
  }
}

Thanks for the help!

1 Like

I was able to make some progress in understanding how to use the beta distribution, and I thought I should share this in case is useful to someone else. On addition to the wikipedia i read this paper and section 6.2.1 of the puppies book (I really like how that book is written).

I changed my stan code so the beta distribution is parametarized in terms of its mode and concentration. Something that i found useful was to write a tiny r function to plot the beta for different values of kappa:

my_beta <- function(omega, kappa){
  a <- omega*(kappa-2)+1
  b <- (1-omega)*(kappa-2)+1
  df.plot <- data.frame(x=rbeta(1000,a,b)*200+650) 
  p <- ggplot(data=df.plot, aes(x=x)) + geom_density()
  return(p)
}

my_beta(0.8, 4)

After playing a bit with that function, I rewrote my stan code as follows:

data {
  int<lower = 0, upper = 1> run_estimation;               // a switch to evaluate the likelihood
  int N ;                                                 // number of observations
  int I ;                                                 // number of students
  int<lower=1,upper=I> ii[N];                             // student id index
  vector<lower=650,upper=850>[N] raw_pre_test ;           // raw pre-test
  vector<lower=650,upper=850>[N] raw_score ;              // raw score
  real nu_mean;                                           //
  real<lower=0> nu_sigma ;                                //
  real mu_mean ;                                          //
  real<lower=0> mu_sigma;                                 //
  real<lower=0> sigma_student_sigma;                      //
}

transformed data {
  vector<lower=0,upper=1>[N] pre_test_std = (raw_pre_test-650)/200;           // pre-test
  vector<lower=0,upper=1>[N] score_std = (raw_score-650)/200 ;                // score
}


parameters {
  real<lower=0> nu ;
  vector<lower=2>[N] kappa;
  real mu;
  vector[I] student_std;
  real<lower=0> sigma_student;
}

transformed parameters {
  vector[I] student =  mu + sigma_student * student_std;                    // Matt trick
}



model {
  vector[N] omega ;
  vector[N] expX ;
  vector[N] a ;
  vector[N] b ;
  // priors
  nu ~ normal(nu_mean, nu_sigma) ;
  kappa ~ normal(10, 2) ;
  mu ~ normal(mu_mean, mu_sigma);
  student_std ~ normal(0,1);
  sigma_student ~ normal(0,sigma_student_sigma);

  // transformed parameters
  for(n in 1:N){
    expX[n] = exp(nu * pre_test_std[n] + student[ii[n]]) ;
    omega[n] = expX[n]/(1+expX[n]) ;
    a[n] = omega[n] * (kappa[n]-2) + 1 ;
    b[n] = (1-omega[n])*(kappa[n]-2)+1;
  }
  
  // likelihood, which we only evaluate conditionally
  if(run_estimation==1){
    score_std ~ beta(a,b);
  }
}

generated quantities {
  vector[N] score_sim;
  vector[N] omega ;
  vector[N] expX ;
  vector[N] a ;
  vector[N] b ;
  for (n in 1:N)  {
    expX[n] = exp(nu * pre_test_std[n] + student[ii[n]]) ;
    omega[n] = expX[n]/(1+expX[n]) ;
    a[n] = omega[n] * (kappa[n]-2) + 1;
    b[n] = (1-omega[n])*(kappa[n]-2)+1;
    score_sim[n] = beta_rng(a[n], b[n]) * 200 + 650;
  }
}

This code works! Now I plan to add more complexity to it. If any of you have any feedback to make my code better, I would really appreciate it.

1 Like

Awesome!

Tip: use built in functions like inv_logit() and, if possible, work on the logit scale directly (didn’t have a chance yet to look close enough here).

I made a couple of small changes to the code. Mainly use inv_logit() as @jonah suggested, and I also attempted to vectorize the calculations for a and b.

data {
  int<lower = 0, upper = 1> run_estimation;               // a switch to evaluate the likelihood
  int N ;                                                 // number of observations
  int I ;                                                 // number of students
  int<lower=1,upper=I> ii[N];                             // student id index
  vector<lower=650,upper=850>[N] raw_pre_test ;           // raw pre-test
  vector<lower=650,upper=850>[N] raw_score ;              // raw score
  real nu_mean;                                           //
  real<lower=0> nu_sigma ;                                //
  real mu_mean ;                                          //
  real<lower=0> mu_sigma;                                 //
  real<lower=0> sigma_student_sigma;                      //
}

transformed data {
  vector<lower=0,upper=1>[N] pre_test_std = (raw_pre_test-650)/200;           // pre-test
  vector<lower=0,upper=1>[N] score_std = (raw_score-650)/200 ;                // score
}


parameters {
  real<lower=0> nu ;
  vector<lower=2>[N] kappa;
  real mu;
  vector[I] student_std;
  real<lower=0> sigma_student;
}

transformed parameters {
  vector[I] student =  mu + sigma_student * student_std;                    // Matt trick
}



model {
  vector[N] omega ;
  vector[N] a ;
  vector[N] b ;
  // priors
  nu ~ normal(nu_mean, nu_sigma) ;
  kappa ~ normal(10, 2) ;
  mu ~ normal(mu_mean, mu_sigma);
  student_std ~ normal(0,1);
  sigma_student ~ normal(0,sigma_student_sigma);
  // transformed parameters
  omega = inv_logit(nu * pre_test_std + student[ii]) ;
  a = omega .* (kappa-2) + 1 ;
  b = (1-omega).*(kappa-2)+1 ;
  //for(n in 1:N){
  //  a[n] = omega[n] * (kappa[n]-2) + 1 ;
  //  b[n] = (1-omega[n])*(kappa[n]-2)+1;
  //}
  
  // likelihood, which we only evaluate conditionally
  if(run_estimation==1){
    score_std ~ beta(a,b);
  }
}

generated quantities {
  vector[N] score_sim;
  vector[N] omega ;
  //vector[N] expX ;
  vector[N] a ;
  vector[N] b ;
  omega = inv_logit(nu * pre_test_std + student[ii]) ;
  a = omega .* (kappa-2) + 1 ;
  b = (1-omega).*(kappa-2)+1 ;
  for (n in 1:N)  {
    //expX[n] = exp(nu * pre_test_std[n] + student[ii[n]]) ;
    //omega[n] = expX[n]/(1+expX[n]) ;
    //a[n] = omega[n] * (kappa[n]-2) + 1;
    //b[n] = (1-omega[n])*(kappa[n]-2)+1;
    score_sim[n] = beta_rng(a[n], b[n]) * 200 + 650;
  }
}

Following @James_Savage tutorial I tried to see whether or not I can recover my parameters:

library(tidyverse)
library(rstan)
set.seed(9782)
I <- 50; N <- 100 ; omega <- 0.8 ; kappa <- 10
a <- omega*(kappa-2)+1
b <- (1-omega)*(kappa-2)+1
raw_pre_test <- rbeta(N,a,b)*200+650
raw_score <- rbeta(N,a,b)*200+650

# Compile the stan code ---------------------------------------------------

compiled_model <- stan_model("beta.stan")

# First run ---------------------------------------------------------------

sim_out <- sampling(compiled_model, data = list(run_estimation = 0,
                                       N = N, 
                                       I = I,
                                       ii = rep(1:I,2),
                                       raw_pre_test = raw_pre_test,
                                       raw_score = raw_score,
                                       nu_mean = 1,
                                       nu_sigma = 0.5,
                                       mu_mean = 0,
                                       mu_sigma = 1,
                                       sigma_student_sigma = 0.1),
                    seed = 42118)

fake_data_matrix  <- sim_out %>% 
  as.data.frame %>% 
  select(contains("score_sim"))


# Can I recover the parameters? -------------------------------------------

draw <- 21
score_sim <- extract(sim_out, pars = "score_sim")[[1]][draw,]
true_nu <- extract(sim_out, pars = "nu")[[1]][draw]
true_kappa <- extract(sim_out, pars = "kappa")[[1]][draw,]
true_student <- extract(sim_out, pars = "student")[[1]][draw,]

# Now estimate the model with this "true" data, to see that we can estimate the known parameters

recapture_fit <- sampling(compiled_model, data = list(run_estimation = 1,
                                                      N = N, 
                                                      I = I,
                                                      ii = rep(1:I,2),
                                                      raw_pre_test = raw_pre_test,
                                                      raw_score = score_sim,
                                                      nu_mean = 1,
                                                      nu_sigma = 0.5,
                                                      mu_mean = 0,
                                                      mu_sigma = 1,
                                                      sigma_student_sigma = 0.1),
                          seed = 42118)

## Nu?

nu <- as.data.frame(recapture_fit, pars = "nu") 

ggplot(data = nu, aes(x = nu))  + 
  geom_density()  + geom_vline(aes(xintercept = true_nu), color="red") 

Based on this plot, should I conclude that the model succeded at recovering nu? If not, does anyone have any recommendation about what to do next?

I also tried to recover the students’ random effects:

students <- as.data.frame(recapture_fit, pars = "student") 
colnames(students) <- c(1:I)
students <- students %>% 
  gather("student", "effect", factor_key = T) %>% 
  inner_join(data.frame(student=as.factor(1:I), true_student=true_student), by = "student")
ggplot(data = students, aes(x = effect))  + 
  geom_density()  + geom_vline(aes(xintercept = true_student), color = "red") +
  facet_wrap( ~ student)

and the scores:

scores <- as.data.frame(recapture_fit, pars = "score_sim") 

colnames(scores) <- c(1:N)

scores <- scores %>% 
  gather("N", "score", factor_key = T) %>% 
  inner_join(data.frame(N=as.factor(1:N), true_score=score_sim), by = "N")


ggplot(data = scores, aes(x = score))  + 
  geom_density()  + geom_vline(aes(xintercept = true_score)) +
  facet_wrap( ~ N)

Finally, I checked whether or not the 50% credible interval covers 50 of the true values:

credible_intervals <- scores %>% group_by(N) %>% summarise(lb = quantile(score, probs = c(0.25)),
                                                           ub = quantile(score, probs = c(0.75))) %>% 
  inner_join(data.frame(N=as.factor(1:N), true_score=score_sim), by = "N") %>% 
  mutate(inside = case_when(true_score >= lb & true_score<=ub ~ 1,
                            TRUE ~ 0))

mean(credible_intervals$inside)

[1] 0.54

Thoughts? Should I check if i can fit real data using this model, or should I stay away from real data while i’m building the model?

Sorry I don’t have time right this moment to say more about your full post, but regarding this question, until the model can do in practice what it should do in theory (work well in simulations when it really is the true model) I would not expect it to fit real data. So develop your model and your understanding of the model using simulations where you have total control over the DGP and can see how the model performs under a variety of
possible scenarios. Once you’re happy with the model in simulation conditions,
it can also be helpful to use real data that’s not your actual data (until you’re ready), but other real data that’s already available and similar enough. There are always things that occur with real data you haven’t covered in simulations, so it’s good to get exposed to that too.

1 Like

I found that although the model above works with fake data, when i try do use real data I get the following error:

Rejecting initial value:
  Log probability evaluates to log(0), i.e. negative infinity.
  Stan can't start sampling from this initial value.

Any suggestions on how to debug this? I tried setting init=0.5 and some other values, but that did not help. In case it helps, this is how my fake data looks like:

image

and this is how the real data looks like:

image

You’re probably running into overflow/underflow problems. Or it can be missing constraints. Every legal value of the parameters should have nonzero density (finite log density).

The way to debug is to insert print statements and see when the target() or one of its subexpressions goes off the rails.

1 Like

If you are running a beta model with a beta distributed outcome variable and you really have 0 or 1 outcomes this can lead to problems. So I would first check that 0 < score_std < 1 (if your minimal observed score is 650, I would subtract 649, not 650). Alternatively you can implement a zero-inflated model.

More generally, what you are doing looks a lot like a hierachical (beta) regression model where you are using random effects to account for differences between students. In R formula:
score ~ pre_test + (1 | student_id). Such a model can be implemented easily in brms or rstanarm (whereby you would need to transform your variables to be proportions before feeding them to brms or rstanarm).

1 Like

If I understand this correctly, I should be printing a, b, and pre_test_std. This is what I did:

model {
  vector[N] omega ;
  vector[N] a ;
  vector[N] b ;
  // priors
  nu ~ normal(nu_mean, nu_sigma) ;
  kappa ~ normal(10, 2) ;
  mu ~ normal(mu_mean, mu_sigma);
  student_std ~ normal(0,1);
  sigma_student ~ normal(0,sigma_student_sigma);
  // transformed parameters
  omega = inv_logit(nu * pre_test_std + student[ii]) ;
  a = omega .* (kappa-2) + 1 ;
  b = (1-omega).*(kappa-2)+1 ;
  if(run_estimation==1){
    print("pre_test_std[1] = ", pre_test_std[1]);
    print("a[1] = ", a[1]);
    print("b[1] = ", b[1]);
    score_std ~ beta(a,b);
  }
}
pre_test_std[1] = 0.525
a[1] = 4.46113
b[1] = 2.57094

Rejecting initial value:
  Log probability evaluates to log(0), i.e. negative infinity.
  Stan can't start sampling from this initial value.

Initialization between (-2, 2) failed after 100 attempts. 
 Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
[1] "Error in sampler$call_sampler(args_list[[i]]) : Initialization failed."
[1] "error occurred during calling the sampler; sampling not done"

How can I add a conditional statement to print only the relevant stuff? What i have in mind is: if Log probability evaluates to log(0), then print a, b, and pre_test_std.

My data will indeed take the values 0 and 1 sometimes. I changed how I’m standardizing the data to:

score_std = (raw_score-649)/202

Doing this did the trick!

Do you have a recommending reading for this? Will zero-inflated work if the outcome is bounded on both sides?

for conditional printing:

if (is_inf(beta_lpdf(x|a,b)) {
   print ...
}

However, it is more instructive to look at the pdf for the beta distribution,
p(x|\alpha,\beta) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)},
from which you can for example see that p(x|\alpha,\beta)=0 when x=0 and \alpha \neq 1.

So it is next to impossible that beta_lpdf(x|a,b) does not become -Inf if you have values of zero, because only a combination of parameters that result in a=1 can achieve this. This is however impossible, because in your model a depends on the pre_test score, which is not the same for all people who get a zero in the final second test, so a can’t be zero for all of them.

You can find a discussion of zero inflated models in the stan reference manual, section 13.7 page 198 in stan-reference-2.17.0.pdf.

[Edit: Having said all that, I still think that a binomial or beta-binomial model is the natural choice for integer valued outcomes with a lower and upper bound (though I don’t think the results will be noticeably different).]

1 Like