If one imputes data in
brms, there is by default an implicit -Inf to + Inf pior on the imputed values.
While it is helpful that one can use the
stanvar function to set a prior on imputed values, it would in many cases be useful to also set bounds on imputed values.
Is this already possible in brms and I am just overlooking this?
Thanks in advance - G
@paul.buerkner pointed ouf that one can use the
trunc function to set bounds on imputed values.
bf( y | mi() + trunc(0,1) ~ x
will set the bounds for imputed values for y to 0 and 1.
However estimating such a model is a bi tricky, because one has to set the priors for the imputation model carefully, so that the the likelihood for the imputed values does not become 0, but it works. If one is not careful with the priors the model will have lots of divergences and fail to converge.
To be more specific, if x has missing data bound between 0 and 1 and we use an impuation model x \sim + \alpha + \beta X the priors for \alpha and \beta must be such that if we calculate
mu_x = a + b %*% X then
normal_lpdf(x | mu_x, sigma_x) is unlikely to become zero. In practice, this means that the prior for \alpha is something like N(.5,.1) and the prior for \beta has to induce a fair amount of regularisation. (how much obviously also depends on the particular data)
Perhaps brms should actually use the truncation bounds as hard boundes for the missing values on that variable? I think that would actually be consistent behavior. What do you think? If you agree, please open the github issue again and remind me of this idea.
I am not exactly sure what you mean.
As far as I can see, using
trunc already defines a variable with bounds and allows only values within those bounds. I also think that Stan “samples” from the bounded values without that one puts a prior on the the imputed values.
Or am I missing something here?
I think what makes the model hard to for estimate in my particular case is that I am using tight bounds [0,1] for all imputed values x and effectively estimate a linear probability model. If all observed and imputed x are between 1 and 0,
mu_x is large (lets say 15) and
sigma_x is small (lets say 0.1) this could cause underflow problems in this part, specifically the
log_diff_exp, of the likelihood of the Stan model:
target += normal_lpdf(Y_lx[n], mu_x[n], sigma_x) -
log_diff_exp(normal_lcdf(ub_x[n] | mu_x[n], sigma_x),
normal_lcdf(lb_x[n] | mu_x[n], sigma_x))
However, I am not 100% sure this is the issue because while on the one hand I can tame the model by setting shrinkage priors for b_x and Intercept_x, on the other hand Stan implements
log_diff_exp in a manner that protects from underflow problems (I think). I also see that the model is easy to fit if I comment out the
log_diff_exp of the likelihood (which makes it a true linear probability model). So while it empirically looks like the problem is with this part of the likelihood, I don’t think I fully understand why this happens.
PS: I think the model would be easier to fit if one could use non-Gaussian likelihoods for imputed variables. But I think implementing this likely necessitates a lot of downstream changes, so I am refraining to make a to make a feature request for this. :-)