Hi! I’ve spent over a year on this problem and need help!
I am trying to model RMSE values produced by external algorithms.
I have 4 factors (for simplicity, being modeled as unordered):
size (3 levels)
noise (3 levels)
shape (2 levels)
interps (9 levels; the repeated measure)
Also, I have:
g_subj (unique id)
g_rep (20 levels; unique id; subjects are nested within replicates; each rep is a replication of the entire experiment)
The RMSE values are y and I am using the custom gamma2() distribution from @paul.buerknernotebook to model the heteroscedasticity.
in fitting the variance for the Gamma, are we estimating different shape parameters (essentially, different gamma distributions)? Does this completely eliminate the variance-mean relation?
but keep getting different combinations of max tree depth and divergents for every iter after warmup.
I was able to successfully implement:
y ~ g_size * g_noise * g_shape * g_interps
v ~ g_size * g_noise * g_interps
I originally started with nlme but it had trouble modeling 0 variances and 0 random effects. I then moved to lme4 but it could not model the heteroscedasticity, so now I’m using brms.
Run time for the models range from 4 hours to 3 days depending on the complexity.
Thank you so much @paul.buerkner for responding. I’ve poured over so many of yours and @martinmodrak docs and blogs! They are by far the most helpful things on Bayesian modeling.
I have tried simpler models. From my lmm books, the modeling approach I am taking is the top-down strategy by first fitting the loaded means model with all the fixed-effects and then add the varying group-specific effects.
I can successfully model without any group effects:
y ~ g_size * g_noise * g_shape *g_interps
v ~ g_size * g_noise * g_interps
It is when I add the varying intercepts for g_rep and g_subj on y, that I have convergence issues.
From the glmer and lme models I ran, I know I need to model at least + (g_noise * g_interps | g_rep) and if possible + (interps | g_subj), as both have certain levels that have a significant effect.
I think the problem stems from some levels of the factors having rather large varying effects SD while several others are zero and several others still are extremely small say 0.00025. I am having trouble specifying the priors on these different ranges. Since all my predictors are factors, there is no standardizing to do. Also, I’ve read on this forum that setting priors on SD smaller than 0.02 should never be done.
Additionally, I don’t know that I have the prior knowledge to make such a strong prior.
My first concern would be if you have enough data to inform all the parameters - it seems you have well over 300 fixed effects + all the varying intercepts/effects. Data not being informative about all parameters can result in all sorts of pathologies as discussed in https://betanalpha.github.io/assets/case_studies/underdetermined_linear_regression.html
A good way to see if you can add additional parameters to your model and still learn something is via posterior predictive checks (see docs for pp_check, other posts here or https://arxiv.org/abs/1709.01449). If you can devise a posterior predictive check that shows a systematic inability of the model to capture some aspect of the data - e.g. if you do grouped PP check for your replicates and some replicates are “above” what your model expects and some are “below” you might be able to add replicates as a predictor. If on the other hand the posterior variability encompasses the between-replicate variability quite well, it is much less likely that your model is able to inform the per-replicate varying intercept.
Such good information @martinmodrak. Thanks for pointing me in a direction.
Several points of confusion:
I thought this was a perk about Bayesian models. That they allow for modeling when you don’t have enough data.
Number of obs: 3240, groups: g_subj, 360; g_rep, 20; I can scale up to include more reps but this costs some time. Each new rep adds 18 subjects with 9 repeated measures. I limited it to 20 reps because it takes a long time to run. Should I add more reps and see what happens?
I’ve read the underdetermined blog and while I understand it on the surface, I don’t think I fully grasp what it is saying; I am having trouble connecting the dots with what I know and cementing it in my head…is there an additional reference? maybe that clarifies the hyperplane of degenerate model configurations? Is this not the same as nonidentifiability and multimodel issues?
I have done pp_checks on models with and without group effects and for both it fails to capture many regions. Are pp_checks valid when the model doesn’t converge?
Here is the pp_check grouped for the nonconverging model with only g_rep vaying intercept group effect. pp_check(model, type = 'stat_grouped', stat = "mean", group = "g_rep")
However, if I were to model with gamma instead of gamma2 (only estimating a single shape parameter for gamma instead of multiple v parameters in gamma2), the pp_check of a nonconverging model with all g_rep group effects and the pp_check of a nonconverging model with varying interps and intercept g_subj group effects, both have seemingly perfect fits with outrageously large estimates for the gamma shape parameter and the Dharma plots exemplify the heteroscedasticity I am expecting. This perfect fit behavior I would assume is from not having enough data. But again, I thought this was the whole point of Bayesian, that with priors you can overcome this lack of data. Perhaps this is bayesians way of overfitting… in that you have to make ridiculously strong priors?
I thought when modeling one is to always include the predictors that convey the structure in the data, regardless if they are significant. How can adding the varying intercepts cause there not to be enough data? I’ve read in posts on here that the amount of data needed for additional group-effect estimates is irrelevant for Bayesian.
When you say “Data not being informative about all parameters” what exactly do you mean? Are you saying that there is not enough information in the data to estimate a particular effect and more data would NOT be helpful unless it conveys new information? or just that not enough data was collected to discern every effect from each other?
I tried running a model with g_size and g_noise as numerical instead of factors and I get “Rejecting initial values… Shape parameter is 0, but must be > 0! … Shape parameter is inf, but must be finite!”
which is from 0/# and #/0 operations in the gamma2 reparameterization definition for the shape parameter. Why was this not a problem when using factors? and how would I fix this? @paul.buerkner
Looking at this a bit more, I think model mismatch is a slightly more likely cause of your problem than underdetermined regression. Details below.
Well the Bayes approach helps when there is little data, but there are some limits for computing the posteriors. A simple example is fitting a normal distribution with unknown mean and unknown variance. When we have at least 2 observations, both mean and variance are constrained mostly independently, making the posterior more or less a “blob” in 2D space. But when we only have 1 observation the situation changes qualitatively - either we can have small variance and the mean is constrained close to the single observation or large variance and then the mean is allowed a wide range of values. This dependency structure results in a funnel-like posterior (the extent of the funnel is constrained only by the prior). This is not a problem in theory, but in practice such dependencies and structures make it hard to explore the posterior with an algorithm (including Stan). The problem is similar to the “Neal’s funnel”, but cannot be easily resolved by reparametrization.
Similarly, if you try to fit a linear regression (even with known variance) to a single point, the coefficients become unconstrained individually - any value of intercept is consistent with the data if you choose an appropriate slope. The data only carve out a subspace of possible pairs.
In both cases the problem is related to non-identifiability - you end up with many very different model configurations consistent with the data.
While you probably have quite a lot of data to fit the mean, I don’t have intuition how this might interact with you also having a predictor for the variance of the gamma as variances require more data to be learned than means. I would guess it is more likely not a problem.
I would also note that if you are constrained by runtime, you might want to try INLA - it is a bit more cumbersome to use, but it scales well beyond what you can do with brms (but it doesn’t let you put predictors on variance)
Interpreting pp_checks for non converging models is sometimes still helpful, but you cannot rely on it too much.
This might be an important clue. Just to be clear using the gamma family with fixed shape still doesn’t converge? Could you share the pp_checks? I am unfortunately not familiar with Dharma plots, so can’t comment on that. Also what do you mean by “outrageously large” estimates? Is the posterior narrow around a large estimate or is it very wide including both small and large shapes?
Once again the amount of data is irrelevant for Bayesian theory, but might be a problem for Bayesian computation. I meant “not informative” in the sense I mentioned above, that data only limit combinations of parameters but not individual parameter values.
Anyway, I think you are most likely struggling with your model not being a good match for the data. Your current model is too large to debug anything, so I would suggest you start with modelling only a small subset - say a single size, single noise, single shape and single interp (i.e. eliminating all the fixed effects) and only a few subjects and replicates. Does the model converge? Do the pp_checks look healthy? I would guess that maybe the problem is that the gamma2 response is not a good model and you need a more heavy-tailed distribution (say log-normal). pp_checks should be able to show if this is the case. I would definitely look at the dens_overlay, and stat checks (for sd and min / max stats).
OOOOh…it took me reading this a few time to understand. With 2 points, we can average to estimate the mean and average to estimate the variance; the mean and variance are independent in the posterior.
But if we have 1 point, we can no longer average and may either have an exact fit through the data point which will give a variance of 0 or a smoother fit with a larger variance. This is the dependency between the mean and variance that arises when there is not enough data. I believe I read this in one of my books I’ll have to go back and find these sections.
Please correct me: We can see the same phenomenon (this qualitative change) in the underdetermined blog, the hyperplane of degenerate configurations has this same dependency of the measurement variance either decreasing to 0 for an exact fit or increasing for a smoother fit. This change in measurement variance and how it affects the fit can be seen as the concentration of the likelihood fx (& the posterior density) (aka the variance) varying around this hyperplane (aka the exact fit).
As the variance decreases to 0, the concentration of the likelihood moves closer about the hyperplane and the posterior density becomes more focused/stronger about the hyperplane. As the variance increases, the concentration of the likelihood moves further about the hyperplane and the posterior density becomes weaker about the hyperplane. The hyperplane of degenerate configurations are potential solutions, when the variance is 0, there is an exact fit through one of the configurations, but as the variance increases, we allow for more potential configurations to be possible solutions. It is this fluctuation of the likelihood function and posterior density that make sampling difficult; it creates the funnel whereby a large step size is needed when the variance is large making the posterior density more spread out (too small a step size may lead to max tree depths), and a small step size is needed when the variance is near 0 (too large a step size and we may have divergences).
So now I wonder, How does this relate to multiple measurement variances in heterogeneous variance situations or non-normal variance situations?
Yes, more coefficients to estimate than data, underdetermined, underconstrained system with many solutions defined by the hyperplane of degenerate configurations.
I am still confused as to how priors fit into all of this and why adding stronger priors would not alleviate the problem. Wouldn’t the priors serve to additionally constrain the problem for a solution such that it wouldn’t be considered underdetermined anymore?
Why would you guess it is not a problem? is there no way to tell if you have enough data or whether additional stronger priors will help? I thought heterogeneous variances were a common problem encountered but finding materials explaining this easily in a bayesian sense is sparse. From examples, I believed I had enough data.
I believe modeling the variances is crucial, otherwise nlme with log(y) or glmer would have sufficed :(. But sadly, there are obviously visible heterogeneous variances.
The plots for this severly non-converging model in brms look great if I can just get it to converge!:
v ~ g_size*g_noise*g_interps),
prior =c(
# y ~ Intercept + b
prior(normal(4, 3),class="Intercept"),
prior(normal(2, 3),class="b"),
# v ~ Intercept + b
prior(normal(0, 4),class="Intercept",dpar="v"),
prior(normal(0, 5),class="b",dpar="v"),
# y ~ sd
prior(normal(0, 0.2),class='sd')),
family = gamma2, stanvars = stanvars_gamma2,
sample_prior = "yes",
seed = seed3,
data = t_long_CNSubset20red1_sub3_unord, warmup = 1000, iter = 2000, chains = 4, core = 4,
control = list(adapt_delta = 0.90, max_treedepth = 10))
In both of these more complex models, the shape parameter is estimated in the tens of thousands.
Not a problem for bayesian theory because it works in the limit of inf, while for computations we are finite and must conted with pathological geometries?
But aren’t these the same thing? Doesn’t limiting data combinations limit parameter values; As you limit enough data combinations you’ll eventually arrive at a unique solution of parameter values? oh but wait, I’m not switching between frequentist to bayesian. In bayesian, data defines the likelihood, once you have enough data to properly represent the likelihood, more data is irrelevant, thus data defines the likelihood which defines the range of values over the parameters?
How can the pp_checks look healthy when the model is so blatantly wrong?
How can gamma2 be wrong when the data is gamma? RMSE is a chi distribution. I tried log-normal and it did not go well but I will take your advise and try again. How would the pp_check be able to show if a more heavy-tailed distribution is appropriate vs poor priors of which I’ve seen in the brms workflow example.
Sometimes they do. But sometimes you would need unreasonably strong priors. And I think (or more precisely: I guess) that in high dimensional settings stuff becomes way more difficult as there are way more many options for the geometry to get problematic. And the higher the dimension, the stronger priors you would likely need. But I am just guessing. As I said it is quite possible, this is not at all the issue with your model.
Yes in the sense that you could get the posterior right, if you had infinite computational resources. (the limit with respect to infinite data is not interesting for this case). With finite resources you might be unable to explore the posterior fully.
I am sorry if I am not very clear - the point I was trying to make is similar to the case of fitting a line to a single point.
Note: I am not suggesting you fit your whole dataset with a simple model, but to filter your data to only contain values that were generated with the same single size/noise/shape/interp. In this setting, the larger model is correct only if this small model is correct for each individual subset.
The PPChecks you sent look a bit like you might be overfitting, I would expect there to be a bit more variability in the densities, but it is far from clear. The main problem I see with your model is that it is very big and complex and thus hard to debug.
If you absolutely know it is gamma than stay with gamma. But I don’t see why RMSE would be chi distributed unless you have some guarantees on the process that generated those RMSE values. I am really just guessing, the model is very big and the data complex.
One way to show this is with using the stat check using maximum or a high quantile as the statistic. If the maximum in data is systematically below maxima from your model, you might need a heavier tail. But this is not a hard rule. Failing PP check just tells you something is wrong. It is then often a bit of detective work to figure out what is wrong.
OMG! I can’t believe I didn’t think to do this!! frump!
That’s what I thought, but this is for the complex models with many group effects. However, when working with model 10b (varying intercepts for g_rep and g_subj with gamma2), the results are more appropriate to what I would expect (reasonable captures the data without overfitting; good looking residuals) except it doesn’t converge.
I goofed in posting the results for model 10b. You should be able to see them now.
The data is a synthetic data set. I have latitude, longitude, depth measurements for a 51x51 grid. Depth are computed from 2 different shape functions (g_shape). These are my truth data sets. From these truth data set I create test cases. I apply 3 levels of down-sampling to make the data set sparser (g_size), and a random sample of noise is added to the depths of which 3 levels of noise magnification is then applied (g_noise). This constitutes 18 g_subj for 1 g_rep. This is replicated by drawing a new sample of noise; the only thing that differs between g_reps is the sampled noise and the only thing that differs between the 18 g_subj for the same g_rep is the experiment combinations applied. Essentially, each g_rep is a remeasuring of depths at specific/precise locations. Thus, g_rep and g_subj are blocking factors. Upon each test case, software is ran to reconstruct the 51x51 grid via various interpolators. From the results, I compute the RMSE between the interpolated surface and truth. From the definition of RMSE, I believe it to be a chi distribution. According to https://en.wikipedia.org/wiki/Chi_distribution: It [chi distribution] is the distribution of the positive square root of the sum of squares of a set of independent random variables each following a standard normal distribution
Interestingly, I could not find any example where a model was fit to RMSE values.
I expect the RMSE to be small for low noise and dense test cases using the standard interpolator and to increase as sparsity and noise increase depending on the interpolator being used.
Ok… so I followed your advice @martinmodrak and reduce the data set to include only the g_subj for the intercept (g_size = 1, g_noise = 1, g_shape = 0, g_interps = EM031). This gives 20 g_subj which are all the g_reps for an experimental factor combination. And I ran into a slew of problems.
In the plot, the residuals are better, but we see the problem with the estimates as nlme fails to handle 0 residuals.
I tried to replicate these results in brms but ran into convergence even for the most simple model. I remembered you saying a mis-specified model such as the wrong distribution would cause these issues so I tried the gamma distribution and did have somewhat better luck in converging.
This brings me back to the confusion regarding gamma2 in @paul.buerkner notebook .
is the variance of gamma2 still a function of the mean even though it is estimated independently of the mean?
what is the difference between gamma and shifted_gamma? is this a scenario for shifted_gamma?
how would I model each factor combination as having a different variance-mean function; the variance increases with the mean but at a different rate depending on the factor combination. For example, larger fitted values might vary less than smaller fitted values. or for another example, two estimated mean values that are equal may have different variances.
I believe I should be able to get brms to match nlme HRV1. How do I implement varPower? I’ve tried different variations of
while this converges fine, I get a large shape parameter estimate:
shape 454.64 166.20 189.24 842.70 1.00 5466 2458
I would expect this to capture the variance-mean structure since it is one factor combination but it does not. I don’t know where to go from here… :’(
When I try to model shape ~ 1 there are converges issues. Do you know why explicitly modeling shape in the formula would cause problems in converging? Aren’t they the same thing?
I don’t really understand nlme, but to me, the residual plot doesn’t necessarily suggest heteroskedasticity - you overestimate small values and underestimate big values, so that simply might be that you need a more heavy-tailed distribution (like the log-normal). Still the regularity of the residuals is puzzling and might suggest other issues.
Good. I would then suggest you repeat the process (e.g. by focusing only on one/few subjects/replicates) to find the smallest model/dataset that reliably has convergence problems and the biggest model/dataset that reliably fits. You can then use a combination posterior predictive checks on the biggest converging model to suggest changes to response and use visualisations (as in Visual MCMC diagnostics using the bayesplot package • bayesplot) to pinpoint problems in the diverging models (which is much easier to do with fewer parameters).
I am not sure a large fitted shape necessarily means that you failed to capture the variance-mean structure. Importantly, the varying intercepts for subject and replicate can effectively act somewhat similarly to increasing variance in the response distribution.
That should IMHO really be the same… Weird. But note that convergence issues are often a bit stochastic, do you get the same results when you repeat the fits? Also you can use make_stancode to show the generated stan code for each model and inspect where they differ.
I think they should be roughly the same, but as I noted above, convergence (and especially the specific way non-convergence shows) can be a bit stochastic, so I would rerun the fits and only worry if the results are always different.
I think that you are quite unlikely to have the RMSE be derived from “independent random variables each following a standard normal distribution” (unless this is guaranteed by your interpolator, but I don’t see how you could do this - in particular, RMSE between nearby points on the grid are unlikely to be independent), so this is a useful starting point, but not a strong guarantee.
Ok I’m so confused about whether Bayesian can handle 0 variance group-effects, perfect correlation, and 0 residual variance. Could you clarify this for me? What if it’s truly 0 or perfect? Does a truly 0 variance in the model create the same rank deficiency problems? Does a converging Bayesian model with 0 variance and perfect correlations mean the model is overparameterized and untrustworthy?
No, brms generally expects everything to be non-zero. I’d be however hugely skeptical of 0 variance or perfect correlation in data for anything except the most well done physics experiments (and even there, I would pause and recheck).
Hallelujah! It worked! I’ve got a model!!!
I finally got a working model without error or warnings, and the pp_check and DHARMa plots are beautiful!
All thanks to the parallelization merged to the master on github 2 days ago! It appears I am too impatient and assumed the long times and failures to converge meant there had to be something wrong with the model when it seems it just needed a longer tree depth of about 17. So yesterday, I was able to get the parallelization up and running and here’s what I got for the first model I ran. It completed in roughly 13 hours! Let me know if you see something I’ve overlooked, any problems with the model or results, I’d appreciate any feedback.
brms_20_lognormal_mdl4_2 <- brm(bf(y ~ g_size*g_noise*g_shape*g_interps + (1 | g_rep),
sigma ~ g_size*g_noise*g_interps),
family = lognormal(), #lognormal(link = "identity", link_sigma = "log")
prior =c(
# y ~ Intercept + b
prior(normal(0, 5),class="Intercept"),
prior(normal(0, 5),class="b"),
# v ~ Intercept + b
prior(normal(0, 5),class="Intercept",dpar="sigma"),
prior(normal(0, 5),class="b",dpar="sigma"),
# y ~ sd
prior(student_t(3, 0, 0.25),class='sd')),
sample_prior = "yes",
seed = seed3,
data = t_long_CNSubset20red1_sub3_unord, warmup = 1000, iter = 2000, chains = 4,
threads = threading(8, grainsize = 100),#8
backend = "cmdstanr", core = 4,
control = list(adapt_delta = 0.80, max_treedepth = 20))
seed is set to to 54647 in the beginning of the vignette, and the initial models are ran without specifying a seed. Then the benchmark_threading function calls update without a seed for the scaling_model, followed by another update with seed=1234 in the run_benchmark function. Are all these right? This might explain the weird dips in the beginning of the graphs.
The same for iter in the update for the scaling_model.
In the first pic, were the chunks run serially on the one core?