Parameters during warmup far outside prior range

Dear all,

I’ve been trying to track down the following error message:
“Exception: bernoulli_logit_lpmf: Logit transformed probability parameter[2815] is nan, but must not be nan! for a model whose code is shown in the end of the post (because maybe not relevant to the question).”

To do this, I put a print function in the code that is executed once per stan iteration through the model. For the parameter that I’m monitoring (rewMagDist with 5 entries), I’ve put a prior of normal(1,0.7) and lower=0. From the print command I get for the first few steps leading up to output-1.csv: Iteration: 1 / 20 [ 5%] (Warmup):
output-1.csv: rewMagDist [0.391818,6.87112,1.86618,4.96533,5.44635]
output-1.csv: rewMagDist [48.5049,0,inf,0,0]
output-1.csv: rewMagDist [0.391818,6.87112,1.86618,4.96533,5.44635]
output-1.csv: rewMagDist [34.2568,0,inf,0,0]
output-1.csv: rewMagDist [0.391818,6.87112,1.86618,4.96533,5.44635]
output-1.csv: rewMagDist [0.978113,0,inf,0,0]
output-1.csv: rewMagDist [0.391818,6.87112,1.86618,4.96533,5.44635]
output-1.csv: rewMagDist [0.662151,0,inf,0,0]
output-1.csv: rewMagDist [0.391818,6.87112,1.86618,4.96533,5.44635]
output-1.csv: rewMagDist [0.439123,0,inf,0,0]
output-1.csv: rewMagDist [0.391818,6.87112,1.86618,4.96533,5.44635]
output-1.csv: rewMagDist [0.366358,0,inf,0,0]
output-1.csv: rewMagDist [0.391818,6.87112,1.86618,4.96533,5.44635]
output-1.csv: rewMagDist [0.383564,0,inf,0,0]
output-1.csv: rewMagDist [0.391818,6.87112,1.86618,4.96533,5.44635]
output-1.csv: rewMagDist [0.388283,0,8.15364e+162,0,0]
output-1.csv: rewMagDist [0.391818,6.87112,1.86618,4.96533,5.44635]
output-1.csv: rewMagDist [0.390574,0,8.43455e+40,0,0]

I’ve noticed that some of the monitored values are inf or very large (8.43455e+40). Is this normal given that I’ve put a prior of normal(1,0.7)? Or does it suggest that something is wrong with my model?
[When I put in my model to only print if the evaluating to nan occurs (i.e. if invTxUtil, see model code is nan), it was always when parameters were very large.]

Many thanks
Jacquie

In case this is useful, here is the model code for a prospect theory model with ~70 human participants with ~1000 trials per participant. I did not manage to get this to finish sampling (I gave up after 10h and sampling still being at 20%); one way I found to make this run faster is to impose hard boundaries on rewMagDist and lossMagDist of <lower=0,upper=2>, which improves the problem I think because these parameters are in an exponent.

data{
  int ntr;          // number of total data points
  int nsub;         // total number of participants
  int subInd[ntr];
  int rewM[ntr,2];  // reward magnitude
  int lossM[ntr,2]; // loss magnitude
  real prob[ntr,2]; // reward probability
  int choice[ntr];  // choice on each trial (1 for left, 0 for right)
}

parameters{
  // individual subject level
  real<lower=0> invTemp[nsub];     // inverse temperature
  real<lower=0> probDist[nsub];    // probability distortion
  real<lower=0> rewMagDist[nsub];  // reward magnitude distortion
  real<lower=0> lossMagDist[nsub]; // loss magnitude distortion
}

model{
  // transformed reward and loss probabilities and magnitudes
  real rewProbTr[ntr,2];
  real lossProbTr[ntr,2];
  real rewMagTr[ntr,2];
  real lossMagTr[ntr,2];
  real utilities[ntr,2];
  // some short-cuts
  real g;
  real invTxUtil[ntr];

  // Priors
  invTemp     ~ normal(0,1); // it's a very small number
  // if there is no bias: value=1, therefore centre prior on 1.
  probDist    ~ normal(1,0.7);
  rewMagDist  ~ normal(1,0.7); // values get crazy quickly, so only allow small range
  lossMagDist ~ normal(1,0.7);

  // Going through each trial (because Stan can only do exponentiation on single numbers)
  for (it in 1:ntr){
    for (io in 1:2){
      // Transform probabilities
      g = probDist[subInd[it]];
      rewProbTr[it,io]  = prob[it,io]^g/((prob[it,io]^g+(1-prob[it,io])^g)^(1/g));
      lossProbTr[it,io] = 1- rewProbTr[it,io];
      // Transform reward magnitudes
      rewMagTr[it,io]  = rewM[it,io]^rewMagDist[subInd[it]];
      lossMagTr[it,io] = lossM[it,io]^lossMagDist[subInd[it]];
      // Compute utility
      utilities[it,io] = rewProbTr[it,io]*rewMagTr[it,io] - lossProbTr[it,io]*lossMagTr[it,io];
    }
    // combine utilities and choice stochasticity
    invTxUtil[it] = invTemp[subInd[it]]*(utilities[it,1]-utilities[it,2]); 
  }
  print("rewMagDist ", rewMagDist);
  // link utilities to actual choices
  choice ~ bernoulli_logit(invTxUtil);
}
~~~~

So the parameters by default are initialized -2 to 2 on the untransformed space. The initialization really doesn’t take into account the priors, so the very first numbers are just totally random (priors aren’t special in any way in Stan – just more terms for the log density). Since you’ve got the > 0 constraint in there, the initial parameters should land somewhere between exp(-2) and exp(2) (not uniformly – but somewhere).

Yeah, there’s something wrong probly. What the printout you included seems to indicate is that every time the sampler tries to take a step, it moves from its initial condition ([0.391818,6.87112,1.86618,4.96533,5.44635]) to something totally crazy ([0.390574,0,8.43455e+40,0,0]).

I don’t quite follow the model, but you’ve got the right idea for debugging. Just do more printing. Make sure and verify the data is what you think it is. For instance, that prob is actually a matrix of probabilities (and doesn’t have values less than zero or greater than one or whatever).

Presumably the problem is with initialization as Ben’s suggesting.

All these powers can be problematic with random inits. You can either try to rescale the parameters, or you can try initializing in a smaller range than (-2, 2) on the unconstrained scale.

So if my choice to get the model to sample ok is to either use inits that are close to the true values or use hard constraints on the parameters, which one is less likely to bias my results? With hard constraints it now works well and parameters are not estimated too close to the hard boundary. And I have also simulated data and recovered the parameters well (for a slightly different version of the model anyway).

However, the thing that I actually wanted to do with this model was to make a hierarchical version where I have group (2) > subject (~ 35 each) > session (~50 per subject) and 20 measurements per session.
My main question of interest is whether the two groups differ in how variable participants’ behaviour is between sessions (called p_groupSD_mu below in model). This model did not run quickly at all, so I have now stripped it down to only have one type of parameter (describing decision noise, with a slightly different formulation than the model above because the model above did not show as good parameter recovery or model-fit) and using non-centred parameterisation.

I’m wondering whether there is something obvious i’m doing wrong because it’s my first 3-level hierarchical model; e.g. I’m not sure whether maybe there is something wrong with how I define the hierarchies that is causing the problem. One issue might be that i’m using a lot of transformations, so that random inits are pushing the parameter of interest to values that it should not take, given the group level prior (i.e. for noiseP_sess near 1). I have therefore tried non-random inits that are closer to the true value, but it is still very slow; and when I add more parameters it fails to finish sampling - after ~ 4 days it only has <1000 samples per chain and runs with fewer samples did not converge [though I did not put inits on all parameters, I will try this now]. The reason that I think that I have to use the transformation for e.g. the noiseP parameter is that from the non-hierarchical model I could see that across people the parameter does not follow a normal distribution, but the transformation does.

data{
  int ntr;          // number of total data points
  int nsub;         // total number of participants
  int maxSess;      // maximum number of sessions per subject
  int subInd[ntr];  // which subject which data point belongs to
  int sessInd[ntr]; // which session each data point belongs to
  int subXGroupInd[nsub]; // which subject is in which group
  int rewM[ntr,2];  // reward magnitude
  int lossM[ntr,2]; // loss magnitude
  real prob[ntr,2]; // reward probability
  int choice[ntr];  // choice on each trial (1 for left, 0 for right)
}

parameters{
  // group level parameters - means
  real<lower=0,upper=1> noiseP_MU_mu;     // random noise (rather than invTemp
  real<lower=0> utilScale;                  // scaling the range of the utilities so that the softmax is not a hardmax
  // group level parameters standard deviations
  vector<lower=0>[1] p_groupMU_sd;       // standard deviations for drawing the individual means
  vector<lower=0>[1] p_groupSD_mu[2];       // mean of the standard deviations for drawing sessions
  vector<lower=0>[1] p_groupSD_sd;       // stdev of the standard deviations for drawing sessions
  
  // individual subject level - means
  real noiseP_subj_pr_pr[nsub];    
  // subject level standard deviations
  vector[2] p_subj_sd_pr_pr[nsub];

  //Session level parameters
  real noiseP_sess_pr[nsub,maxSess];

}

transformed parameters {
  // group
  real noiseP_MU_mu_pr;
  vector[1] p_groupSD_mu_pr[2];
  
  // individual subject
  real noiseP_subj_pr[nsub];
  real<lower=0,upper=1> noiseP_subj[nsub];
  vector<lower=0>[1] p_subj_sd[nsub];
  
  // session
  real<lower=0,upper=1> noiseP_sess[nsub,maxSess];  

  noiseP_MU_mu_pr    = logit(noiseP_MU_mu);
  for (ig in 1:2){
    p_groupSD_mu_pr[ig]    = log(p_groupSD_mu[ig]);
  }

  for (is in 1:nsub){
    noiseP_subj_pr[is]  =  noiseP_subj_pr_pr[is]*    p_groupMU_sd[1] + noiseP_MU_mu_pr; //normal(noiseP_MU_mu_pr[subXGroupInd[is]], p_groupMU_sd[subXGroupInd[is],1]);
    p_subj_sd[is,1]   = exp(p_subj_sd_pr_pr[is,1].*  p_groupSD_sd[1] + p_groupSD_mu_pr[subXGroupInd[is],1]); //normal(p_groupSD_mu_pr[subXGroupInd[is]], p_groupSD_sd[subXGroupInd[is]]) 

    for (isess in 1:maxSess){
      noiseP_sess[is,isess]     = inv_logit(noiseP_sess_pr[is,isess]   *  p_subj_sd[is,1] + noiseP_subj_pr[is]);  //implies normal(noiseP_subj_pr[is], p_subj_sd[is,1]);
    }
  }
}

model{
  // transformed reward and loss probabilities and magnitudes
  real rewProbTr[ntr,2];
  real lossProbTr[ntr,2];
  real utilities[ntr,2];
  real invTxUtil[ntr];

  // Group level priors
  utilScale                 ~ normal(0,0.1);
  noiseP_MU_mu     ~ normal(0,0.5);
  p_groupMU_sd      ~ normal(0,1);
  p_groupSD_sd       ~ normal(0,1);
  for (ig in 1:2){
    p_groupSD_mu[ig] ~ normal(0,1);
  }
  // Individual level priors
  for (is in 1:nsub){
    noiseP_subj_pr_pr[is]     ~ normal(0,1); // to give: normal(noiseP_MU_mu_pr[subXGroupInd[is]], p_groupMU_sd[subXGroupInd[is],1]);
    p_subj_sd_pr_pr[is]        ~ normal(0,1);
  }
  // Session level priors
  for (is in 1:nsub){
    for (isess in 1:maxSess){
      noiseP_sess_pr[is]     ~ normal(0,1); // to give: normal(noiseP_subj_pr[is], p_subj_sd[is,1]);
    }
  }
  // Going through each trial (because Stan can only do exponentiation on single numbers)
  for (it in 1:ntr){
    for (io in 1:2){
      // Transform probabilities
      rewProbTr[it,io]  = prob[it,io];
      lossProbTr[it,io] = 1- rewProbTr[it,io];
      // Compute utility
      utilities[it,io] = rewProbTr[it,io]*rewM[it,io]- lossProbTr[it,io]*lossM[it,io];
    }
    // combine utilities and choice stochasticity
    invTxUtil[it] = inv_logit(utilScale*(utilities[it,1]-utilities[it,2]))*(1-noiseP_sess[subInd[it],sessInd[it]]) + noiseP_sess[subInd[it],sessInd[it]]/2;
  }
  // link utilities to actual choices 
  choice ~ bernoulli(invTxUtil);
}

Yikes! Is it possible that ntr is much greater than nsub*maxSess? (i.e. replicated observations of the outcome for each combination of subject & session) If so, a trick to reduce computation is to compute only the unique rows of the design matrix then indexing into this when expressing the likelihood.

I think there’s a problem with:

for (is in 1:nsub) {
  for (isess in 1:maxSess) {
    noiseP_sess_pr[is] ~ normal(0,1); // to give: normal(noiseP_subj_pr[is], p_subj_sd[is,1]);
  }
}

I think either the inner index needs to be [is, isess] or you can get rid of the loops entirely and do:

to_array_1d(noiseP_sess_pr) ~ normal(0.0, 1.0);

thank you so much for spotting this!! i’ve looked so many times through this code and didn’t see it, so thank you very very much! let’s see whether this goes any better now…

yes, this is true. but i’m sorry i don’t understand what you’re suggesting? is it something about the line:

choice ~ bernoulli(invTxUtil)

I would have thought that each entry in invTxUtil is only used once? [is there maybe a reference to this in the manual i could look at?]

ok, this now actually works :)

but it still does not converge when I add in more parameters. I am wondering whether the following features of my model may be causing problems:

  • per session (lowest level in hierarchy), I only have 20 measurements, and I am trying to fit 4 parameters. Maybe this is not a good idea? [though it did work for a hierarchical logistic regression model on the same data that also had 4 regressors] I will now try making a set of models, each time only having one parameter vary across sessions. this is not ideal, because then I have 4 models etc., but maybe it would be a step forward…
  • I have an ‘unused parameters’ situation - not each participants (mid-level of hierarchy) has 50 sessions. Thus, I have some unused session-level parameters. They have priors from the subject level. Could this be a problem? I could recode it so that I am not including those parameters, but it would make the code more messy.

per session (lowest level in hierarchy), I only have 20 measurements, and I am trying to fit 4 parameters. Maybe this is not a good idea?

20 isn’t a lot for sure haha. Since it’s 4 parameters, start by looking at the pairplots of all these and see if you find correlations n’ such. Do you have posterior predictives for this model? Do those remain good?

Thus, I have some unused session-level parameters

Yeah, this definitely could be a problem – if you don’t have priors on them especially. Check out the diagnostics for them in the output. If the diagnostics look fine (and their n_effs are relatively high and stuff) then maybe you can get away with this? Definitely not ideal, and definitely better to do ragged arrays.

but it would make the code more messy.

Yeah :(

thanks for the quick reply. just to check, to look at pair plots and diagnostics - does my model need to be converged?

does my model need to be converged?

It’s always fair to look at pairplots/traceplots/diagnostics. Those are the best ways of figuring out what’s going wrong.

They’ll likely be all funky, but you’re looking for the funky bit anyway so you can fix it.

Are you modelling each session with it’s own parameter? If so, that’s probably costing you a lot computation-wise. Maybe model a smooth function across sessions? Try a linear function and see whether you get unreasonable posterior predictions, and if so, move to splines or a GP.

yes, each session has its own parameter (in the simplest version of the model). and the parameters from each session are drawn from a normal distribution with the mean and stdev determined by each subject. this is because my main interest for the study is to see whether these stdevs vary between my two groups of participants.

I don’t quite see how I could instead describe the session level effect by a linear function (I don’t think there should be a linear trend between sessions)? Or is what you’re suggesting that I have something like this:
noiseP_sess[is,1] ~ noiseP_subj[is]
noiseP_sess[is,2] ~ normal(noiseP_sess[is,1], sd2_subj[is]);
noiseP_sess[is,3] ~ normal(noiseP_sess[is,2],sd2_subj[is]);
I don’t know whether this would be any easier for my model?

If you have multiple subjects, each of which participate in multiple sessions, then you can model a group-level slope across session and by-participant slope deviations. That is a far simpler model than having a group-level value and subject deviations for every session.

Now I understand what you mean. This is a great suggestion for reducing my number of parameters. I’m so glad you suggested this, I’ve just tried it (only with variational Bayesian, so results might still change somewhat with proper sampling) and the results look very promising!
In fact, I did an even simpler version where by I also got rid of variability across days on the group level (I could put this in again), instead, I now have (I’m using non-centred parametersation in my actual code):

p_sess[subject,session] ~ normal(p_groupMU[subj'sGroup], p_groupSD[subj'sGroup]);

From my samples in Matlab - I don’t think I could do this directly in Stan (?) - I then computed the standard deviation across session for each participant. And then for each of the 2 groups the standard deviation of the (log) standard deviations for subjects in that group. I then make histograms of the samples and compute confidence intervals etc. as I would normally do with parameter samples that I get directly from Stan.

The results are similar to what I had gotten with another version of the model (a regression) for which the original 3-level hierarchy worked, so I would think that this approach is ok; even though now in the model I don’t explicitly model that each participant’s performance is similar across sessions - I don’t see how omitting this could create a bias in my results…?

A demo of what I mean here: Tip for speeding up models with repeated observations