Positive log probability

I’m estimating a model using 6 chains, 400 warmup, 100 sampling iterations.

I’m getting positive log probability values displaying in the stanfit summary.

How is this possible? Are there circumstances under which a model really can have a positive log probability or does it indicate an error on my part, or a bug in the code?

The model is complex so for now I’ll just post the sampling statements and parameter definitions within the model. I can post more if that is relevant.


parameters{
  real alpha_pr;
  real k_pr;//relative threshold
  real tau_pr;//non-decision time
  
  cholesky_factor_corr[TD_N] L_Omega;
  vector<lower=0>[TD_N] L_sigma;
  
}
model{
  vector[TD_N] td_var[LENGTH];//this is populated using some code not showing here 
  //with values derived from data and also parts of the model.

  alpha_pr ~ normal(0,1.5);
  k_pr ~ normal(log(.5),2);
  tau_pr ~ normal(log(.5),2);

  to_matrix([to_row_vector([response_time[i],response[i]])]) ~ lba(k,A,v,s,tau);
  //the lba is a user-built function from Annis, Miller, & Palmeri (2016);
  //It's complex so to simplify this debugging, 
  //I have constrained it so the maximum log probability it can return is 0

  L_Omega ~ lkj_corr_cholesky(1);
  L_sigma ~ cauchy(0,1); //these yield standard deviations of each individual value.
  td_var ~ multi_normal_cholesky(zeros,L_Sigma);
}   

Here is the full model


//lba function from Annis, Miller, & Palmeri (2017)
// changes made:
//    - replace '<-'operator with '='
//    - replace "normal_log" with normal_lpdf
//    - add in print command to see what probabilities it's producing.
functions{
     
     real lba_pdf(real t, real b, real A, real v, real s){
          //PDF of the LBA model
          
          real b_A_tv_ts;
          real b_tv_ts;
          real term_1;
          real term_2;
          real term_3;
          real term_4;
          real pdf;
          
          b_A_tv_ts = (b - A - t*v)/(t*s);
          b_tv_ts = (b - t*v)/(t*s);
          term_1 = v*Phi(b_A_tv_ts);
          term_2 = s*exp(normal_lpdf(b_A_tv_ts|0,1)); 
          term_3 = v*Phi(b_tv_ts);
          term_4 = s*exp(normal_lpdf(b_tv_ts|0,1)); 
          pdf = (1/A)*(-term_1 + term_2 + term_3 - term_4);
          
          return pdf;
     }
     
     real lba_cdf(real t, real b, real A, real v, real s){
          //CDF of the LBA model
          
          real b_A_tv;
          real b_tv;
          real ts;
          real term_1;
          real term_2;
          real term_3;
          real term_4;
          real cdf;	
          
          b_A_tv = b - A - t*v;
          b_tv = b - t*v;
          ts = t*s;
          term_1 = b_A_tv/A * Phi(b_A_tv/ts);	
          term_2 = b_tv/A   * Phi(b_tv/ts);
          term_3 = ts/A     * exp(normal_lpdf(b_A_tv/ts|0,1)); 
          term_4 = ts/A     * exp(normal_lpdf(b_tv/ts|0,1)); 
          cdf = 1 + term_1 - term_2 + term_3 - term_4;
          
          return cdf;
          
     }
     
     real lba_log(matrix RT, real k, real A, vector v, real s, real tau){
          
          real t;
          real b;
          real cdf;
          real pdf;		
          vector[rows(RT)] prob;
          real out;
          real prob_neg;

          b = A + k;
          for (i in 1:rows(RT)){	
               t = RT[i,1] - tau;
               if(t > 0){			
                    cdf = 1;
                    
                    for(j in 1:num_elements(v)){
                         if(RT[i,2] == j){
                              pdf = lba_pdf(t, b, A, v[j], s);
                         }else{	
                              cdf = (1-lba_cdf(t, b, A, v[j], s)) * cdf;
                         }
                    }
                    prob_neg = 1;
                    for(j in 1:num_elements(v)){
                         prob_neg = Phi(-v[j]/s) * prob_neg;    
                    }
                    prob[i] = pdf*cdf;		
                    prob[i] = prob[i]/(1-prob_neg);	
                    if(prob[i] < 1e-10){
                         prob[i] = 1e-10;				
                    }
                    
               }else{
                    prob[i] = 1e-10;			
               }		
          }
          out = sum(log(prob));
          // if(out>=0){
          //   print("lba lprob:",out," with RT:",RT,"; k:",k,"; A:",A,"; v:",v,"; s:",s,"; tau:",tau);
          //   print("minimizing this to log(prob)=0,prob=1 to avoid errors.");
          // }
          
          return min([out,0]);		
     }
     
    vector lba_rng(real k, real A, vector v, real s, real tau){
          
          int get_pos_drift;	
          int no_pos_drift;
          int get_first_pos;
          vector[num_elements(v)] drift;
          int max_iter;
          int iter;
          real start[num_elements(v)];
          real ttf[num_elements(v)];
          int resp[num_elements(v)];
          real rt;
          vector[2] pred;
          real b;
          
          //try to get a positive drift rate
          get_pos_drift = 1;
          no_pos_drift = 0;
          max_iter = 1000;
          iter = 0;
          while(get_pos_drift){
               for(j in 1:num_elements(v)){
                    drift[j] = normal_rng(v[j],s);
                    if(drift[j] > 0){
                         get_pos_drift = 0;
                    }
               }
               iter = iter + 1;
               if(iter > max_iter){
                    get_pos_drift = 0;
                    no_pos_drift = 1;
               }	
          }
          //if both drift rates are <= 0
          //return an infinite response time
          if(no_pos_drift){
               pred[1] = -1;
               pred[2] = -1;
          }else{
               b = A + k;
               for(i in 1:num_elements(v)){
                    //start time of each accumulator	
                    start[i] = uniform_rng(0,A);
                    //finish times
                    ttf[i] = (b-start[i])/drift[i];
               }
               //rt is the fastest accumulator finish time	
               //if one is negative get the positive drift
               resp = sort_indices_asc(ttf);
               ttf = sort_asc(ttf);
               get_first_pos = 1;
               iter = 1;
               while(get_first_pos){
                    if(ttf[iter] > 0){
                         pred[1] = ttf[iter] + tau;
                         pred[2] = resp[iter]; 
                         get_first_pos = 0;
                    }
                    iter = iter + 1;
               }
          }
          return pred;	
     }
}
data{
   int<lower=1> LENGTH;
   int<lower=2> NUM_CHOICES;
   real<lower=0> A;
   vector[LENGTH] response_time;
   int response[LENGTH];
   int required_choice[LENGTH];//the choice which would be reinforced on each round.
   int cue[LENGTH];
     
  //////neural model
  int DELTA_N;
  int THETA_N;
  matrix[LENGTH,DELTA_N] neural_data;
  
  ////////////\begin{joint model machinery}
  vector[THETA_N+DELTA_N] td_mu_prior;
  vector[THETA_N+DELTA_N] td_sd_prior;
  
  ////////////\end{joint model machinery}   
}
transformed data{
  ////////////\begin{joint model machinery}
  ////////////\end{joint model machinery}
  int THETA_rpe=1;
  int THETA_ev=2;
  //int THETA_oc=1;
  int TD_rpe=THETA_rpe;
  //int TD_oc=THETA_oc;
  
  //int THETA_N=THETA_oc;
  int TD_ev=THETA_ev;
  int TD_N=DELTA_N+THETA_N;
  
  vector[TD_N] zeros = rep_vector(0,TD_N);
  
  real<lower=0> s = 1;

  matrix[LENGTH,NUM_CHOICES] choice_outcomes;
  for (i in 1:LENGTH){
    //if the required choice was chosen then 
    //assign the value of 1 to it and distribute the value of -1 across all alternatives.
    //or if the required choice was not chosen, 
    //assign the value of -1 to it and distribute the value of 1 across all alternatives.
    for (j in 1:NUM_CHOICES){
      if(j==response[i]){
        choice_outcomes[i,j]=((response[i]==required_choice[i])-0.5)*2;
      }else{
        choice_outcomes[i,j]=-((response[i]==required_choice[i])-0.5)*2/(NUM_CHOICES-1);
      }
    }
  }
  
  
}

parameters {
  real alpha_pr;
  real k_pr;//relative threshold
  real tau_pr;//non-decision time
  
  // ////////////\begin{joint model machinery}
  //vector[TD_N] td_mu;
  cholesky_factor_corr[TD_N] L_Omega;
  vector<lower=0>[TD_N] L_sigma;
  // ////////////\end{joint model machinery}
}

transformed parameters {
  real <lower=0,upper=1> alpha;
  real<lower=0> k;
  real<lower=0> tau;
  
  // ////////////\begin{joint model machinery}
  matrix[TD_N, TD_N] L_Sigma = diag_pre_multiply(L_sigma, L_Omega);
  matrix[TD_N, TD_N] Sigma = L_Sigma * L_Sigma';
  // ////////////\end{joint model machinery}
  
  alpha = inv_logit(alpha_pr);
  k = exp(k_pr);
  tau = exp(tau_pr);
}

model {
  matrix[max(cue), NUM_CHOICES] exp_val = rep_matrix(0,max(cue), NUM_CHOICES);
  real pred_err;
  vector[LENGTH] run_pred_err_c2;
  vector[LENGTH] trial_expected_val;
  vector[LENGTH] actual_outcome;
  real outcome;
  vector[NUM_CHOICES] v;
  matrix[LENGTH,NUM_CHOICES] v_record;
  // ////////////\begin{joint model machinery}
  vector[TD_N] td_var[LENGTH];
  matrix[LENGTH, TD_N] theta_delta;
  vector[TD_N] td_mean;
  vector[TD_N] td_sd;
  // ////////////\end{joint model machinery}
  
  //these use VERY weak priors because we are going to use this data to set priors for future analyses
  //so it's important that we don't unduly bias analysis at this level.
  //this priors works out to ROUGHLY a uniform distribution between 0 and 1 when transformed.
  alpha_pr ~ normal(0,1.5);//weak prior, agnostic about learning rate, but still biased to low learning rates.
    //as a compromise between an absolute 0 learning rate (-Inf) and a learning rate flat in the middle of the possible distribution (0).
  
  k_pr ~ normal(log(.5),2); //other thing we could do with these is replace them with half-normals as in the Annis paper.
  //A ~ normal(.5,1)T[0,];
  //tau_pr ~ normal(log(.5),1);//normal(.5,.5)T[0,];
  tau_pr ~ normal(log(.5),2);
  //now we need to loop through the trials, modelin
  //so how do we model choices in this context?
  for (i in 1:LENGTH){//loop through timesteps.
    //the EXPECTED value for this choice, which *should* always be positive(?) but will difer in degree of how positive you'd expect.
    
    trial_expected_val[i] = exp_val[cue[i],response[i]];
    
    for(j in 1:NUM_CHOICES){
      v[j]=logit(exp_val[cue[i],j]/4+0.75);//We could simplify this right down to just passing in the expected value. Should we???
      v_record[i,j]=v[j];
      //if j was the reinforced choice and it was the response value,
      pred_err=choice_outcomes[i,j]-exp_val[cue[i],j]; 
      
      if(j==response[i]){
        //recorded reward prediction error has to be the prediction of *what the subject thought would happen*,
        //i.e., the RPE for the particular choice they made. churr.
        run_pred_err_c2[i] = pred_err;
      }
      
      exp_val[cue[i],j] = exp_val[cue[i],j] + alpha*pred_err;
      
      //the value of the outcome obtained for this choice.
      actual_outcome[i]=choice_outcomes[i,response[i]];
    }
    
    //i don't know what to do with non-resposne time...but let's work that out later.
    to_matrix([to_row_vector([response_time[i],response[i]])]) ~ lba(k,A,v,s,tau);
  }
  
  ////////////\begin{joint model machinery}
   
  //transfer the one theta into the theta-delta matrix, transforming it at the same time.
  theta_delta[:,THETA_rpe]=logit(run_pred_err_c2/4+0.5);
  //theta_delta[:,THETA_oc]=(actual_outcome-mean(actual_outcome))./sd(actual_outcome);
  theta_delta[:,THETA_ev]=logit(trial_expected_val/2+0.5);
  //maybe this could make a difference, could check??
  
  //transfer the deltas into the theta-delta matrix.
  for (d_i in 1:DELTA_N){
    theta_delta[:,THETA_N+d_i]=neural_data[:, d_i];
  }
  
  // //In order to operate a faster model, we don't estimate these means; we just calculate them.
  // //I've got no interest in actually calculating means here, so I think that's the best approach.
  for (tdi in 1:TD_N){
    td_mean[tdi] = mean(theta_delta[:,tdi]);
    td_sd[tdi] = sd(theta_delta[:,tdi]);
  }
  // 
  //standardize the variance.
  for (i in 1:LENGTH){
    //td_var[i,:] = (to_vector(theta_delta[i,:]) - td_mu) ./ td_sd;
    td_var[i,:] = (to_vector(theta_delta[i,:]) - td_mean) ./ td_sd;
  }
  
  //predict the variance from the remaining information.
  L_Omega ~ lkj_corr_cholesky(1);
  L_sigma ~ cauchy(0,1); //these yield standard deviations of each individual value.
  td_var ~ multi_normal_cholesky(zeros,L_Sigma);
  ////////////\end{joint model machinery}

    
}


lp__, get_log_probability, etc. do not actually reflect logarithms of probability. Rather they are logarithms of the posterior PDF (perhaps dropping constants but including Jacobians) and can be positive or negative. In general, they only have a relative interpretation at different proposals for the parameters, so whether they are positive or negative is not important.

Oh I see, thank you!

In that case, can they be used to evaluate the relative probabilities of two models sampled separately?

You mean with the same outcome and the same number of observations? If so, then yes but it is not a good idea because there is no accounting for the different number of (effective) parameters. Use loo in the loo package, which is designed for model comparison, unlike lp__.

I have been comparing a variety of different models with the same outcome and number of trials in the task. I’ll take a look at loo, thanks!

However, actually, I suppose it is not the “same number of observations”. Some of the simpler models ignore some of the data while some others use all of it - for instance - a model that incorporates response time into a hybrid reinforcement-learning/linear ballistic accumulator model, compared with a pure reinforcement-learning model that ignores response times and only uses response outcomes.

There is no model comparison for models with different numbers of observations. You can use the subset of the bigger dataset that is used as the smaller dataset.

1 Like

OK - just to make sure I understand:
Let’s say have two models, one that predicts y from x1 and x2, and another that tries to predict y from only x1:

Here’s the first model:

data{
  vector[100] x1;
  vector[100] x2;
  vector[100] y;
}
parameters{
  real beta1;
  real beta2;
}
model{
  beta1~normal(0,1);
  beta2~normal(0,1);
  y~normal(beta1*x1+beta2*x2,1);
}

and the second model:


data{
  vector[100] x1;
  vector[100] x2;
  vector[100] y;
}

parameters{
  real beta1;
}

model{
  beta1~normal(0,1);
  y~normal(beta1*x1,1);
}

Apologies if I haven’t got this quite right (haven’t actually tested as it’s just an example).

But this is fairly analogous to the difference between my models. Surely there’s a way to compare the fit of adding an extra variable into the model?

Extra predictor, yes (presuming it is the same 100 observations in both models). Use loo in the loo package.

1 Like

OK thanks very much

One more clarification if you don’t mind: is the only problem with using lp__ as a measure of relative fit that it doesn’t do appropriate penalization for adding/removing additional factors?

Then I might think that if

  • I had a model with the same number of parameters, or
  • I compared two quite different models, and the one with more parameters performed less well

I could be justified in using lp__ to conclude which model was a better fit, so as to avoid having to modify the models to add separate log_likelihood calculation.

No, that is just one problem with it, not the only one. As I said above, it is the effective number of parameters that matters, rather than the nominal number. Different models with the same number of nominal parameters can have different numbers of effective parameters or a model with more nominal parameters can have fewer effective parameters. Also, lp__ includes priors and Jacobian adjustments, whereas model comparison is best done with the likelihood only. Basically, use loo in the loo package, which avoids these problems and has diagnostics for when the assumptions underpinning it do not hold, instead of trying to come up with a rationalization for using something that is not intended for model comparison.