Posterior predictive checks: which of these models should I choose?

Hi
I fitted three models with varying intercepts and slopes with normal, student_t and beta distributions for the response variable. I’m in the process of validating the models by conducting posterior predictive checks following the recommendations in the paper “Visualization in Bayesian workflow”.

Normal distribution:

student_t distribution:

beta distribution:

Based on these plots, which of these models fits the data best?

Thank you

1 Like

Hello!

My personal point of view is that if your response variable is restricted between 0 and 1, you should choose the beta distribution. The other distributions might predict negative values or values superior to 1, and I guess nobody wants that!

I noticed however that the model with beta distributed variable missed the mode. May be we should worry about that?

Have a good day!
Lucas

1 Like

I second @ldeschamps that if your outcome is guaranteed to be between 0 and 1, Beta is a good bet. The PPCs however seem to indicate that Beta is missing some aspects of the data. Hard to say what is the problem without knowing more about your data/question.

In the above, only student distribution seems to have no major problems, but at the cost of having much wider posterior distribution. So it is possible that it is not a good model, just a very uncertain model.

Thank you @ldeschamps and @martinmodrak. I too prefer the beta distribution for the reasons you stated, but thought I should consider other alternatives.

My response variable is (roughly speaking) the percentage of species in common between pairs of points. So, each observation is the percentage of species in common between a point A and a point B. I am working with observations taken in 11 eleven locations, which means that observations taken in the same location will be more similar to each other than to observations taken on other locations. Aside from this, there is also second “issue”. Observations calculated using the same point will not be independent. The percentage of species in common between point A and point B will not be independent from the percetange of species in common between point A and C.

To deal with the first issue I fitted a model with different intercepts and slopes (per location). To deal with the second issue I added a varying intercept for observations calculated with the same point.

Here’s the code that I used to fit the model with the rethinking package:

 m3<-ulam(alist(
   sor~beta(alpha,beta),
   alpha <- mu * phi,
   beta  <- (1-mu)*phi,
   logit(mu)<-a_samp[sample_id]+a_basin[basin_id]+bnet[basin_id]*net_dist_s+bstrah[basin_id]*strahler_s+
     bpp[basin_id]*pp_s+bdrain[basin_id]*da_diff_s+bflow[basin_id]*flow_distB,
   c(a_basin,bnet,bstrah,bpp,bdrain,bflow)[basin_id]~multi_normal(c(a,bn,bs,bp,bd,bf),Rho,sigma_basin),
   a~normal(0,10),
   c(bn,bs,bp,bd,bf)~normal(0,1),
   sigma_basin~exponential(2),
   Rho ~ lkj_corr(2),
   a_samp[sample_id]~normal(0,sigma_samp),
   sigma_samp~exponential(2),
   phi~exponential(0.1)
 ),data=df,warmup=2000,iter=5000,chains=2,cores=4,control = list(adapt_delta = 0.99),log_lik = TRUE
 )

The corresponding STAN code is:

 data{
     vector[7846] sor;
     int flow_distB[7846];
     vector[7846] da_diff_s;
     vector[7846] pp_s;
     vector[7846] strahler_s;
     vector[7846] net_dist_s;
     int basin_id[7846];
     int sample_id[7846];
 }
 parameters{
     vector[11] bflow;
     vector[11] bdrain;
     vector[11] bpp;
     vector[11] bstrah;
     vector[11] bnet;
     vector[11] a_basin;
     real a;
     real bf;
     real bd;
     real bp;
     real bs;
     real bn;
     vector<lower=0>[6] sigma_basin;
     corr_matrix[6] Rho;
     vector[327] a_samp;
     real<lower=0> sigma_samp;
     real<lower=0> phi;
 }
 model{
     vector[7846] alpha;
     vector[7846] beta;
     vector[7846] mu;
     phi ~ exponential( 0.1 );
     sigma_samp ~ exponential( 2 );
     a_samp ~ normal( 0 , sigma_samp );
     Rho ~ lkj_corr( 2 );
     sigma_basin ~ exponential( 2 );
     bn ~ normal( 0 , 1 );
     bs ~ normal( 0 , 1 );
     bp ~ normal( 0 , 1 );
     bd ~ normal( 0 , 1 );
     bf ~ normal( 0 , 1 );
     a ~ normal( 0 , 10 );
     {
     vector[6] YY[11];
     vector[6] MU;
     MU = [ a , bn , bs , bp , bd , bf ]';
     for ( j in 1:11 ) YY[j] = [ a_basin[j] , bnet[j] , bstrah[j] , bpp[j] , bdrain[j] , bflow[j] ]';
     YY ~ multi_normal( MU , quad_form_diag(Rho , sigma_basin) );
     }
     for ( i in 1:7846 ) {
         mu[i] = a_samp[sample_id[i]] + a_basin[basin_id[i]] + bnet[basin_id[i]] * net_dist_s[i] + bstrah[basin_id[i]] * strahler_s[i] + bpp[basin_id[i]] * pp_s[i] + bdrain[basin_id[i]] * da_diff_s[i] + bflow[basin_id[i]] * flow_distB[i];
         mu[i] = inv_logit(mu[i]);
     }
     for ( i in 1:7846 ) {
         beta[i] = (1 - mu[i]) * phi;
     }
     for ( i in 1:7846 ) {
         alpha[i] = mu[i] * phi;
     }
     sor ~ beta( alpha , beta );
 }
 generated quantities{
     vector[7846] log_lik;
     vector[7846] alpha;
     vector[7846] beta;
     vector[7846] mu;
     for ( i in 1:7846 ) {
         mu[i] = a_samp[sample_id[i]] + a_basin[basin_id[i]] + bnet[basin_id[i]] * net_dist_s[i] + bstrah[basin_id[i]] * strahler_s[i] + bpp[basin_id[i]] * pp_s[i] + bdrain[basin_id[i]] * da_diff_s[i] + bflow[basin_id[i]] * flow_distB[i];
         mu[i] = inv_logit(mu[i]);
     }
     for ( i in 1:7846 ) {
         beta[i] = (1 - mu[i]) * phi;
     }
     for ( i in 1:7846 ) {
         alpha[i] = mu[i] * phi;
     }
     for ( i in 1:7846 ) log_lik[i] = beta_lpdf( sor[i] | alpha[i] , beta[i] );
 }

I just started working with Bayesian statistics and I don’t know anyone who can check my work, so I’m very happy that you can take the time to look at this.

Then I think you might be missing out on using the structure of your data - you have no way to (easily) to take the dependencies you mention into account.

You should however be able to model the presence/absence of each species at each site (e.g. via logistic regression) and then use the posterior predictions of this model to compute the posterior for percentage of species in common between sites - would that make sense to you?

I also edited the post to take advantage of code formatting - you can use triple backticks ``` to start and end a code block. Adding stan just after the initial backticks turns on syntax highlighting for Stan!

1 Like