Thank you! This solution is accurate and wicked fast!
I’m still trying to unpack what you’ve done. Apologies if this is getting cumbersome. I’m just very thirsty to learn and this thread has become a fountain of knowledge.
Logit transformation
The logit transform is kind of blowing my mind, but it seems to be the most important element of removing the divergence I was observing, so I’m just putting down some thoughts.
I’ve not tried working in transformed response space before (I’ve only ever worked with transformed independent variables), so this is a novel realm for me and I don’t entirely understand it, yet. Looking at a graphical comparison of theta versus logit(theta) below, it appears that the relationship is essentially linear, given the domain of theta:
In an effort to understand the utility of this transformation I replaced some of the code with a simple linear transform that multiplied theta
by 10, effectively expanding the response space.
Became:
transformed data {
vector[N] theta_logit = 10 * theta;
}
And
Became
model {
…
theta_logit ~ normal(10 * theta_pred, sigma);
}
Using iter = 10000
, warmup = 5000
, thin = 2
, and chains = 4
, I got the following results:
10x transform:
extract(vgBayes, pars = c("s", "r", "a", "n", "sigma", "RMSE")) %>%
purrr::map_df(mean)
s r a n sigma RMSE
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.802 0.200 0.210 1.54 0.175 0.0174
Logit transform:
s r a n sigma RMSE
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.802 0.201 0.203 1.56 0.100 0.0176
It appears that the logit transform is more accurate than the 10x transform, may be because it varies across zero?
Does anyone have any thoughts on why this is? Or perhaps a favorite book, website, YouTube series, or article that you’d recommend, but is still comprehensible for someone that has only taken up to linear algebra and integral calculus?
Parameters and desperate voodoo
Uninformative priors are okay?
That’s really interesting that you can get away with not explicitly estimating the r
and s
parameters. From the manual, this suggests an implicit uniform prior of 0 and 2, which I guess makes sense, since it is in the ball park of the expected result. I’m guessing that’s not a problem, because, as you say, the function is relative well behaved (now).
I played around with removing a
and n
from the model block, and I got slightly worse results, but not by much, suggesting a moderately informative prior is still useful – probably because a
and n
are highly correlated, so adding priors helps uncorrelated them a bit.
No need to truncate?
I also find it interesting that you can get away with no truncation. I’m guessing this is because the truncation would only come into play if the posterior creeped into the edge of the truncation values?
In any case, I really appreciate the effort you put into this @andre.pfeuffer!