When adding random effects to a model, Log probability evaluates to log(0), i.e. negative infinity

Dear Stan community,

I am currently fitting a shifted Wald (or shifted inverse gaussian) to reaction time data from an experiment
using rstan. I already asked a question about the same model, but it’s very different from this one, so I figured I should start a new thread, I hope that’s the way I should have proceeded.

I would like to evaluate the effect of experimental conditions A, B and their interaction on the parameters of this distribution.

The model converges nicely if I’m only using fixed-effects, but given the large inter-individual variability among my participants, I would really like to add random effects for all (or some of) my predictors.
Another feature I wanted to implement, is an across-trial variation of the parameter gamma of my distribution (varying drift as in a drift diffusion process).

When I try running the model (below) with everything at once, I can’t get it to start, with the following output:

Chain 1: Rejecting initial value:
Chain 1: Log probability evaluates to log(0), i.e. negative infinity.
Chain 1: Initialization between (-2, 2) failed after 100 attempts.
Chain 1: Try specifying initial values, reducing ranges of constrained values, or reparameterizing the >model.
[1] “Error in sampler$call_sampler(args_list[[i]]) : Initialization failed.”
[1] “error occurred during calling the sampler; sampling not done”

I can get the model to start after several tries, if I only add random effects by participant (no u_drift) and set the range of initial values to 10e-15, which seems really odd.

My dataset is of about 4400 datapoints split across 20 participants, two subconditions A and two subconditions B (so an average of 55 points per subcondition per participant).

I’ve pieced this model together using tutorials, but I’m probably doing something wrong.

I would be grateful for suggestions or tips pointing me in the direction of what is causing this error, and how to solve it.

Thank you very much!

functions {
  real shifted_Wald_lpdf(real x, real gamma, real alpha, real theta){
    real tmp1;
    real tmp2;
        tmp1 = log(alpha / (sqrt(2 * pi() * (pow((x - theta), 3)))));
        tmp2 = -1 * ((alpha - gamma * (x-theta))^2/(2*(x-theta)));
        return tmp1+tmp2 ;
  int<lower=0> N_obs; // observed rts
  int<lower=1> J;                     // n of subjects
  vector<lower=0>[N_obs] rt_obs;
  int<lower=1, upper=2> soa[N_obs];                         // condition A
  int<lower=0,upper=1> cg[N_obs];      // condition B
  int<lower=1,upper=J> id[N_obs];         // subj identifier
  int<lower=0> Tnum[N_obs]; // trial number
parameters {
  // fixed-effects parameters
  vector[12] b;
  // random effects parameters
  vector<lower=0>[4] sigma_u_g;             // random effects for gamma (drift)
  vector<lower=0>[4] sigma_u_a;             // random effects for alpha (boundary)
  vector<lower=0>[4] sigma_u_t;             // random effects for theta (non-decision time)
cholesky_factor_corr[4] L_u_g;          
  cholesky_factor_corr[4] L_u_a;         
  cholesky_factor_corr[4] L_u_t;           
  matrix[4,J] z_u_g;                        // random effect matrix
  matrix[4,J] z_u_a;
  matrix[4,J] z_u_t;

  vector<lower=0>[2] sigma_drift;               // for variable drift rates across trials
  cholesky_factor_corr[2] L_drift;              
  matrix[2,N_obs] z_drift;


transformed parameters {
  matrix[4,J] u_g;
  matrix[4,J] u_a;
  matrix[4,J] u_t;
  matrix[2, N_obs] u_drift;
    u_g = diag_pre_multiply(sigma_u_g, L_u_g) * z_u_g; 
    u_a = diag_pre_multiply(sigma_u_a, L_u_a) * z_u_a;
    u_t = diag_pre_multiply(sigma_u_t, L_u_t) * z_u_t; 

  u_drift = diag_pre_multiply(sigma_drift, L_drift) * z_drift;


    // model parameters
  real gamma;                // drift
  real alpha;                // boundary
  real theta;                // ndt

  L_u_g ~ lkj_corr_cholesky(2);    
  L_u_a ~ lkj_corr_cholesky(2);   
  L_u_t ~ lkj_corr_cholesky(2);  
  to_vector(z_u_g) ~ normal(0,1); 
  to_vector(z_u_a) ~ normal(0,1); 
  to_vector(z_u_t) ~ normal(0,1);
  sigma_u_g ~ cauchy(0, 1);       
  sigma_u_a ~ cauchy(0, 1);       
  sigma_u_t ~ cauchy(0, 1);      

  sigma_drift ~ cauchy(0,1);       
  L_drift ~ lkj_corr_cholesky(2);  
  to_vector(z_drift) ~ normal(0,1); 

  b ~ normal(0, 100);        // diffuse prior for beta coefficients

  for(i in 1 : N_obs) {
   gamma = b[1] + u_g[1,id[i]] + u_drift[1,Tnum[i]] + (b[2] + u_g[2,id[i]])*cg[i] + (b[3] + u_g[3,id[i]] + u_drift[2,Tnum[i]]  + (b[4] + u_g[4,id[i]])*cg[i])*soa[i];
   alpha = b[5] + u_a[1,id[i]] + (b[6] + u_a[2,id[i]])*cg[i] + (b[7] + u_a[3,id[i]] + (b[8] + u_a[4,id[i]])*cg[i])*soa[i];
   theta = b[9] + u_t[1,id[i]] + (b[10] + u_t[2,id[i]])*cg[i] + (b[11] + u_t[3,id[i]] + (b[12] + u_t[4,id[i]])*cg[i])*soa[i];

  target += shifted_Wald_lpdf(rt_obs[i]|gamma,alpha,theta);

I agree with you approach.

The problem might lie with this expression:
log(alpha / (sqrt(2 * pi() * (pow((x - theta), 3)))))

It asumes:

  • alpha is positive
  • x - theta is positive

But neither of those assumptions seems to follow from your set of priors, which might cause the initialisation failures.

Also, the line should be quivalent to something like
log(alpha) - 0.5 * (log(2) + log(pi()) + 3 * log(x - theta))
which is likely to be numerically more stable (and makes the assumptions explicit)

Lastly, your code seems to diverge from the inverse gaussian definition on Wikipedia: https://en.wikipedia.org/wiki/Inverse_Gaussian_distribution
in multiple places.


I’ll make the change to have a more numerically stable expression. This way of writing the pdf of the shifted inverse-gaussian (or shifted Wald) comes from this article.

I’ll look for a way of better specifying my priors then! Otherwise running 5 chains with a very small r_init (10e-15) usually works. I realize it’s very clumsy, but it shouldn’t affect the result of the model-fitting process, right?

Thanks for the valuable response.

I would not be exactly certain about that - you should definitely check other diagnostics - do you get reasonable n_eff and Rhat? How do the autocorrelation plots look like?

The more I look at your parametrization the more I wonder how it could work even with good inits. If I understand the parametrization you linked to, you need: \alpha > 0, \gamma > 0, X > \theta, although the former two are not explicitly stated:

  • \alpha < 0 would lead to negative density
  • \gamma < 0 would make E(X) < \theta and Var(X) < 0 which is impossible. It is hard to see that directly from the PDF formula, but I guess the PDF would not integrate to 1 for \gamma < 0.

The most simple way to avoid this would be to work with \alpha and \gamma on the log scale and use logistic sigmoid to transform \theta between 0 and min(X), e.g. change your code to:

target += shifted_Wald_lpdf(rt_obs[i] | exp(gamma),exp(alpha),inv_logit(theta) * min(rt_obs));

Note that this changes interpretation and sensible prior choices for your coefficients.

My Rhat and n_eff for the parameterization used in my first post seemed fine to me, but I may be wrong. Here they are for my main predictors:

Inference for Stan model: shifted_wald2.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean    sd   2.5%  97.5% n_eff  Rhat
b[1]   6.112   0.016 0.517  5.134  7.136  1062 1.006
b[2]   0.241   0.009 0.352 -0.448  0.945  1579 1.003
b[3]  -1.226   0.007 0.260 -1.745 -0.731  1298 1.002
b[4]   0.052   0.006 0.235 -0.407  0.506  1491 1.003
b[5]   2.189   0.013 0.360  1.520  2.902   742 1.004
b[6]   0.190   0.005 0.169 -0.141  0.534  1336 1.002
b[7]  -0.599   0.006 0.168 -0.942 -0.289   788 1.004
b[8]  -0.061   0.003 0.104 -0.267  0.149  1273 1.002
b[9]   0.115   0.002 0.048  0.017  0.207   591 1.010
b[10] -0.044   0.001 0.022 -0.087 -0.002  1494 1.003
b[11]  0.103   0.001 0.022  0.060  0.149   708 1.009
b[12]  0.014   0.000 0.014 -0.012  0.041  1446 1.003

Samples were drawn using NUTS(diag_e) at Sat Feb 23 12:46:14 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

I did look at autocorrelation plots, I may have misunderstood how to read them. Here is one, they all resemble it quite closely for my main predictors:

Indeed, I don’t expect much of anything out of my model for γ<0 or α<0. I assumed that failed initializations came from something like that but I’m struggling a bit with parameterization and prior choice in general. I’ll make the change, a first run of the new model seems to give different results indeed. Parameterization choices are clearer to me now, thanks!

If you don’t mind, but I can otherwise keep reading and eventually I’ll figure it out, could you explain how this changes my prior choices? I was using what seemed like weakly informative priors for my main predictors and default priors for my random effects.

Thank you for the help you provided.

Yes, those diagnostics look good.

It may change what is a big effect. On the linear scale, a coefficient of 5 for a yes/no predictor may be very reasonable. On the log scale it would translate to e^5 \simeq 148 which is IMHO huge for almost all applications. Also log scale means your predictors are assumed to have multiplicative instead of additive effects.

Prior choice is hard and heavily problem/dataset dependent - using prior predictive checks (https://arxiv.org/abs/1709.01449) is currently the best way to choose priors, but it can be a lot of work.

Sorry for the long delay, I took some time to try things out and read a bit more.

I understand what changing for the log-scale implicates a bit better, and this actually may be an issue for my current dataset. I don’t expect my effects to be multiplicative, quite the contrary actually, and the output of the modified model, although interesting, is a bit difficult to interpret.

Are there references that could help me design a model that stays on the linear scale and the positive half-line?
Does my first expression of the model make any sense at all, even though impossible values can exist for my parameters?


You could try softplus (log1p_exp in Stan) - this behaves roughly like exp for values below 0 and closely approximates linear function above 1 - see

But keep in mind that having additive effects but positive outcome is a bit inconsistent. For positive outcomes, negative effects cannot be additive. If you however can transform your inputs/problem so that all effects are known to be positive a priori, than you can simply constrain the relevant variables and you can be sure the result will always be positive even without a transformation.

OK great, thanks ! I didn’t know it existed.

If you don’t mind, maybe I’m understanding something poorly.
In the case of my dataset the intercept values should outweigh the sum of other predictors, the effects of experimental conditions are quite small in comparison. I kind of assumed in this case allowing for incoherent values was not an issue, because the sampling process would never converge towards impossible solutions. It was just a matter of providing Stan with appropriate starting values. But I’m starting to think I’m wrong? Does the fact that the model allow for inconsistent outcomes come in play for parameter estimation?

I’ll think about constraining my beta coefficients, I think that’s actually easily doable in my case. It’s a bit more complicated for random effects, but I’ll give it a shot.

I don’t think that’s necessarily wrong, but in my experience it is often a symptom of deeper problems in your theory/model. Also sometimes the model would work initially and then start to break after adding more complexity - and it might not be necessarily immediately visible the problem is that you allowed incoherent values in the first place…

However, if you feel like you know what you are doing and you carefully check your model fit (via posterior predictive checks and the like) than it might turn out to not be an issue.

OK, all of that makes sense. I’m not entirely sure I know what I am doing, but I also don’t plan to make my model any more complicated.
I’ll keep everything in mind, see if I can find a simpler way of answering the questions I have about my data.

I learned a lot, thank you.