Rejecting initial value

Hi,

I try to fit the following model with brms:

brm(
    data = data,
    family = skew_normal(),
    formula = bf(y ~ 1+ factor1*factor2 + covariate1 + covariate2 + 
                                (1 + factor2 + covariate1 + covariate2|subject), 
                          sigma ~ 1+ factor1*factor2 + covariate1 + covariate2),
    control = list(adapt_delta = 0.95, max_treedepth = 15)
)

In the output I get for some models the info:

Chain 4: Rejecting initial value:
Chain 4:   Log probability evaluates to log(0), i.e. negative infinity.
Chain 4:   Stan can't start sampling from this initial value.

It seems that Stan just tries different values by itself and so the model can be fitted and I get estimates. I wonder if
these estimates are now reliable or if I have to change something about the model specification?

This means that the default method of finding initial values fails. By default the initial values are between -2 and 2 on the unconstrained scale (probably actual scale for most of your variables, see Stan manual for more). Those values might not work well for example, if you have very large predictor values, so e.g. -2 * covariate1 is 10000 while y is around 10 - the difference between prediction and actual value is so big that the log probability underflows to negative infinity.

The large adapt_delta and max_treedepth also indicate there are other problems. I would suggest you:

  • normalize your predictors (AFAIK brms centers them by default but does not normalize)
  • start with a simple model (e.g. y ~ 1+ factor1*factor2 + covariate1 + covariate2 with normal family and not predicting sigma) and once it fits the data, see where posterior predictions fail (see posterior predictive checks) and add complexity gradually and only when justified by a wrong model fit. If you encoutner problems this way, you know it is very likely casued by the last element you added.
1 Like

Also, this is weird:

y ~ 1+ factor1*factor2 + covariate1 + covariate2 + 
                                (1 + factor1*factor2 + covariate1 + covariate2)

why do you repeat the same thing twice in the formula? Didn’t you intend to have something like (1 | factor1*factor2)?

Hi Martin,

…I can’t believe but yeah this a typo that happened and it did not lead to an error. I should take more breaks… It should have been:

formula = bf(y ~ 1+ factor1*factor2 + covariate1 + covariate2 + 
                                (1 + factor2 + covariate1 + covariate2|subject), 
                          sigma ~ 1+ factor1*factor2 + covariate1 + covariate2)

I fixed it now above as well. The issue is not resolved by that though. I did normalize all continuous variables before fitting the model.

About trying simpler models: I am not sure if this is anything that could have been fixed by updates in BRMS or stan but in fact I started with simple models months ago when I started with this but those led to divergent transitions very often. The skew normal is much better able to reproduce the data because even after log-transformation the outcomes remains skewed. According to posterior predictive checks, the skew_normal does better (in contrast to gaussian or student). The control options are required to make any model fit that I ever tried so far (tested months ago with brms 2.3 or 2.4 I guess).

But for some models the pp-check shows that the model with fixed sigma does better because letting sigma be estimated from the predictors causes extreme unrealistic outliers in the predictions for some outcomes.

I do not really understand yet why this model does poorer in those specific cases…

That’s good to know. However, in my experience, divergences rarely go away by making the model more complex, so I would suggest debugging with the simplest model that causes problems anyway.

One more possible cause are the priors - the default priors in brms often don’t work well for non-standard models. Check the results of get_prior and definitely add a prior where there is none and possibly also replace the wide student_t defaults with something like normal(0,1) and see if that helps.

1 Like

I found out back then, that putting a strong prior on the parameter sd such as prior(exponential(35), class = sd) did help models converge without divergent transitions. But setting priors on the Intercept or b did not really help here.

Would you say, that if the model converges better if I do not fit individual slopes for the covariates and one of the factors, that I should leave them out? I put those in mainly because there are papers that suggest that a “maximal model” is better than only allowing only a (1|subject). Theoretically the structure here makes sense, too, because there could be differences in how individuals respond to these predictors…Back then I figured that there was nothing much I could do to solve the problem of divergent transitions other then changing control parameters and trying different priors. I see the problem that I am not given the time to learn this. Hopefully that changes soon…It took too much time to dive that deep. However, that work needs to be finished now.

I think that the problem why I initially posted does not appear if I let \sigma fixed. I relaxed the assumption of equal \sigma because of the paper published by Kruschke. As I said, I do not understand why this model does poorer for some of my outcomes when it comes to predicition (point estimates are fine). If I do not understand it soon, then I will go with fixed \sigma for now even if the paper of Kruschke suggests this should be less accurate here…

Edit: I understand of course that if the model estimates a large sigma for certain predictors that then outliers are produced in the predictions. But those outliers were really extreme (e.g. -20 whereas realistic outcome values where in the range of -5 and +5).

I don’t think this is a strong prior, for normalized predictors, I often use prior(normal(0,1), class = sd) and it seems to work OK (but that also depends on the scale of the outcomes).

Non-converging model is generally a no-go, it’s estimates can be very wrong. So unless you’ve investigated and are certain that the divergences are actually not a problem (hard to do and rarely the case) you should avoid the non-converging model.

Yeah, reality can be annoying at times :-) And doing things properly tends to be a loooot of work. You should however really do some posterior predictive checks to see if your model is not grossly wrong. This is not a lot of work and can catch a lot of errors.

If sigma is the main issue, you might consider keeping it semi-fixed, i.e. putting very narrow priors on coefficients affecting sigma (except for the intercept). Note that sigma is fit on the log scale so as a rule of thumb when you have 6 coefficients, each constrained to normal(0,1), the model considers sigmas up to exp(6 * 1.96) ~= 128027 as a priori plausible and up to exp(6) ~= 403 as reasonably likely.

Finally, setting priors is best done with prior predictive checks (see [1709.01449] Visualization in Bayesian workflow), but that can be a lot of work so probably not suitable under time pressures :-)

1 Like

Thanks so much for your input. Really helpful.

First of all: Unfortunately, I get similar issues if sigma is fixed. I do not get it in all chains but in the fourth in this case. I think I overlooked this months ago because brms did not always print these messages. I tried the same model with gaussian family and the the issue disappears (at least this time) but therefore I get divergent transitions again.

I don’t think this is a strong prior, for normalized predictors, I often use prior(normal(0,1), class = sd) and it seems to work OK (but that also depends on the scale of the outcomes).

Does brms automatically use a half gaussian then for the sd parameter (because that paramter is always positive, right?)? Or is does brms just like with sigma model it as log(sd)? I need to make sure I do not misunderstand something now: If I plot that prior distribution:

qplot(rexp(35, n = 1e4), geom = "density") vs. e.g. the normal you use, then the rexp(35) prior makes values higher than 0.10 already extremely unlikely so this is a stronger prior than the normal I thought.

About convergence: Yes, I always made sure models converged. But I only managed to get that with the skew_normal so far and that rexp prior on the sd parameter.

If sigma is the main issue, you might consider keeping it semi-fixed, i.e. putting very narrow priors on coefficients affecting sigma (except for the intercept). Note that sigma is fit on the log scale so as a rule of thumb when you have 6 coefficients, each constrained to normal(0,1) , the model considers sigmas up to exp(6 * 1.96) ~= 128027 as a priori plausible and up to exp(6) ~= 403 as reasonably likely.

So, in my case, when it comes to the prior for sigma, I should pick much stronger prior such as maybe normal(0, 0.25). Or/and I could limit it to 1-2 coefficients maybe (actually that is the main idea of the Kruschke paper anyway so that the assumption of homogeneity of variance between the groups of interest is relaxed)…

EDIT: Indeed if I use strong priors here and reduce the coefficients the problem of unrealistic expectations disappears!

I will try a bit more to figure why I do only get the issue with the skew_normal and not the gaussian…

Yes, it does.

You are correct, I messsed up the exponential definition (thought it’s mean is \lambda while it actually is \lambda^{-1}), sorry for the confusion

Great, best of luck! I think the problem might be that the skew is actually not well identified by the data (the data give you little information about the skew). This might be because the you have a lot of predictors. Also you may wanna put a strong prior on the skew, so it is included only when data strongly speak in favor of having some skew.

1 Like