Problems converging using custom gamma2 distribution

Wow the speedups look quite nice especially with 100 iterations! Tagging @wds15 I am sure here is interested.

2 Likes

Eek! Exciting!
A couple questions:

  1. seed is set to to 54647 in the beginning of the vignette, and the initial models are ran without specifying a seed. Then the benchmark_threading function calls update without a seed for the scaling_model, followed by another update with seed=1234 in the run_benchmark function. Are all these right? This might explain the weird dips in the beginning of the graphs.
  2. The same for iter in the update for the scaling_model.
  3. In the first pic, were the chunks run serially on the one core?

Good point about the seeds. @wds15 set the seeds so perhaps he knows more.

on second thought, the seed and iter look fine.

If that the latency in Amdahl’s law, then with 8 cores there’s about an 87% improvement.

Here it is with increased reps from 20 (n=3240) to 200 (n=32400):

speedup_1_200
speedup_2_200
speedup_3_200

Very nice to see reduce_sum in brms show its utility.

So we talk about 13h runtime on 8 cores instead of about 100h? That’s cool.

The speedup which you see with more chunking on a single core is possible due to better CPU cache use when working on smaller parts. I think it’s usually a good thing to increase the number of chunks until you see a slowdown.

BTW, your machine has how many cores? 32? Are these all physical cores?

So we talk about 13h runtime on 8 cores instead of about 100h? That’s cool.

Did you get that from eyeballing the Runtime with Varying Cores Plot?

To clarify, the slowdown plot runs them sequentially to gauge the effect of chunking. And chucking is the number of partial sums? so you’re saying to increase the number of partitions into partial sums until you see a slowdown when running sequentially on one core, and this helps gauge the number of within-chain threads? Why does one chunk have slowdown? (It’s not completely clear to me how to read this plot.)

Also, should we expect the pattern moving from 50 to 100 iter to continue when moving to 2000? What about when max_treedepth is increased? In my last plot, I see that for grainsize 450, speedup stalls for 2, 4, and 8 cores at 100 iters but not with iter 50. And in all plots, the model maxes out treedepth of 10 and I’ll be needing a depth of 17.

I’m running an AWS EC2 C4.8xlarge. There are 36 vCPUs.

I did eyeball your speed up for 8 cores which is about 7x

This helps gauge the grainsize.

I thought that’s extensively explained in the vignette, no?

It should converge to a stable non changing curve.

Deeper trees are more leapfrog steps which is just like increasing number of iterations in this context from my understanding.

I had very mixed experience with these virtual cpus…you do not know if these are really on just one physical machine.

It explained the overhead of chunking but it also said that 1 chunk would be simply using all the data, which to me is just the original problem without chunking so how would it be slower than itself. It was just vague enough to confuse me lol.

hmm…were those experiences bad? Is there something in particular I should be aware of? I doubt they are on the same one physical machine. I haven’t had issues with parallel cloud computing before but they weren’t with aws. It is a Compute Optimized Instances:

Compute Optimized instances are ideal for compute bound applications that benefit from high performance processors. Instances belonging to this family are well suited for batch processing workloads, media transcoding, high performance web servers, high performance computing (HPC), scientific modeling, dedicated gaming servers and ad server engines, machine learning inference and other compute intensive applications.

I was affraid that the text is not clear…will need to review then. The point is: 1 chunk is the original problem, yes. For each additional chunk we have to create copies of all the parameters in memory. Thus, for each additional chunk we end-up having one more copy in memory (and acutally even more things are needed to do for more chunks). As a result calculating 2 chunks is more work than calculating just 1 chunk as there is the overhead which increases with more chunks. This translates into a slowdown when constraining to just 1 core. However, as you have observed a speedup with few chunks its clear that more things are ongoing. This is due to better CPU cache use with smaller chunks… so it‘s complex. And after all the key thing to take away is that this game is not trivial and one can be very happy when one has a problem like yours which scales not so bad with up to 8 cores.

Virtual CPUs were bad in my experience, but then I don‘t know if I had a „compute tailored node“.

1 Like

Few more questions:

  1. in the paper describing the custom gamma2 distribution, @paul.buerkner chooses to fit the variance instead of the mean. I don’t understand how this is different (or better) than fitting the to the shape? I get that the variance is now independent from the mean but so what…the shape was already independent and wouldn’t the shape fit to the data the same way?
  2. and since shape is on a smaller scale than v, wouldn’t it be easier to model (sample)?
  3. will priors affect benchmarking?

I think modelling variance in this case is simply that: a modelling choice. Whether it is a good choice for your data is hard to guess, though a good use of posterior predictive checks could tell you if one of those has problems for your data. (e.g. looking at the “stat_grouped” check with stats like sd, min, max over various subgroups of your data could give some clues).

The difference between fitting shape and fitting variance would manifest if you have sets of observations which differ in predictors for mean, but have the same predictors for shape/variance. This is easiest to see if we model only the mean . Changing mean (\mu) and holding shape (\alpha) constant means the variance is (if my math is correct) \sigma^2 = \frac{\mu^2}{\alpha}. Obviously, if we hold \sigma^2 constant, the implied \alpha will change with \mu. I can imagine cases when one is preferred and case where the other is preferred.

Generally yes, this can happen - poorly chosen priors will affect your model and your inferences. However, if your data contain enough information about all the model parameters, your inferences will be the same under a broad range of priors. If you have the time, it is IMHO good practice to check if your inferences change with a bit wider and a bit narrower priors. If your inferences do not change, you are quite likely OK.

Does that make sense?

how would this imply one over the other and not problems with priors?

I don’t understand. I can’t see the difference…could you share those examples?
wait…are you saying the difference is whether we want the variance to change with the mean or whether we want the shape change with the mean?
If we model the variance, then we want the shape to change with the mean
and if we model the shape, then we want the variance to change with the mean?
…nope, still don’t get it.

In my problem, it is possible to have the same mean but different variances depending on which group you’re in. For example, with low noise and dense data, I expect the standard interpolator to perform reasonably well with RMSE that are small and close together. However, with a different worse-performing interp, I could get the same mean but a larger variance. As the noise and sparsity increase, I expect RMSE to increase. If the interp is worse-performing, I expect the RMSE to increase. As the RMSE increases, I expect there to be less stability, hence, larger variances. but i expect this variance-to-mean relationship or shape to be different for each size, noise, interp group.
Does that make sense? I feel like I should model shape but I’m not sure and don’t know if its worth the time and effort to wait for custom distributions with within chain threading.

ok. I initially ran the benchmarking with very broad priors as the initial guess and I was not sure if i should re-run the benchmarking with more narrow priors to make sure I didn’t choose a terrible grainsize.

Understanding a model is always a bit of detective work, so no hard rules. But let’s say you find that the variability (sd) is not well modelled for a subgroup, while the mean is modelled well. One possibility is that the relation between mean and shape is wrong. If you use the mean + variance version and see that the sd is lower than should be in cases where the mean is large and higher than it should when the mean is low, than it suggests that modelling shape instead might help. That sort of thing.

Generally posterior predictive checks (unlike prior predictive checks) can uncover a broad range of problems with the model, not just bad priors. But going from “something is bad with the model” to “this specific thing is bad” can be challenging.

Yes that’s what I meant.

I don’t have very specific examples, but I think that when the gamma distribution is used because I actually assume the process to be roughly a sum of exponentials (say time until failure of multi-component system where multiple components have to break to cause a failure of the whole), then modelling shape seems like a good idea as it has direct interpretation. If I use gamma, just because I have a positive outcome and lognormal has too large variance than maybe modelling variance is more natural.

This is IMHO possible with both parametrizations. What differs is exactly how the mean/variance relationship is allowed to behave.

1 Like

but how does this affect sampling? why would modeling variance instead of shape allow for better sampling when they are all related?

If all your predictors are discrete (booleans/factors) and you use the same predictors for both mean and shape/variance than the difference should be negligible - you allow each subgroup with the same predictors to have its own mean and shape/variance (although priors could also be affecting this to a lesser extent - what is a relatively wide prior for shape might turn out to be a narrow prior for variance and vice versa). In any other scenario there might (and might not) be a substantial difference in both model fit to data and the inferences you will make.

Why is that important? The short answer would be: it is a different model (mathematically) and different models can have different properties including how well they converge on a given dataset. To be a bit more specific, Andrew has written a lot about the “Folk theorem” which is the observation that often a model that is hard to compute is also a bad model for the data. So if one of the parametrizations fits your data while the other “fights” your data, this may (and may not) manifest as differences in speed of convergence (or whether the model converges at all).

Does that make sense?

I guess… I have only factors. The mean is predicted by g_shape, g_size,g_noise,g_interps. Shape/Variance is predicted by all but g_shape because they are similar enough. Then from what you say, there should be negligible difference but so far in my runs I see that for the large dataset with 200 reps, I was able to converge using Gamma2 in 3 days, while with Gamma I am going on a week with only about 60% done on each chain. So its having a much more difficult time converging/sampling when modeling shape. I have set the appropriate priors for shape and for variance. Priors for shape are smaller N(0,2) and shape_intercept N(8,2) while priors for variance are N(0,3) and variance_intercept N(3,1).
idk I still don’t get how this variance parameterization is having such an effect on sampling.I guess I need a more theoretical understanding of the mechanisms taking place that are affected by parameterization. I’ll look more into it and then get back with more questions.

I honestly don’t know exactly what is happening, just that something is weird. But one thing that should help you figure it out would be to reduce the dataset - e.g. choose some combination of g_size, g_noise, g_interps, select only this subset and fit a simpler model y ~ g_shape + (1 | grep), v ~ 1. This should let you iterate faster and also make it clear whether the problems are with the shape/variance parametrization… (you can obviously try multiple such subsets to see if things are different in other parts of the model).

Some observations:

  1. it was easy to set the priors and run log(y) with Gaussian distribution, but setting the priors for a lognormal distribution has been difficult as determined by the pp_check looking great for gaussian but having too long of a tail for lognormal. I tried prior predictive checks but got nowhere. I read from another post according to paul that prior predictive checks are not possible yet. I haven’t been able to reduce the tail of the lognormal, so either the failures of the gaussian are hidden in the pp_check or I’m not understanding how to get the two to match up, but more than likely, It seems when there is not enough data, the priors are more important for the lognormal, which is not something i expected.
  2. When using the Gamma, I have low Bulk and Tail ESS for only the Intercept (about 200-300) with an Rhat of 1.01. From shinystan, I get warnings of params with < 10% neff and mcse. Increasing the number of iterations doesn’t seem to do much to increase ess. However, for the lognormal, while the Rhat is about the same, the ESS were better. Gamma2 while finished faster, had terribly low ESS (all much less than 100). So maybe this parameterization is not the way to go after all.
  3. I did not model 1|r|g_rep for both y and shape, even though this is how it was done in all brms docs. This is because that model would be too complex, overparameterized. However, I don’t believe not modeling 1|r|g_rep for shape would cause the low ess for intercept. When I tried modeling it, it was overparameterized, the model converged quickly with high ess for all params but the est.error was enormous for all (about 2-3 as opposed to the 0.2).
  4. monotonic effects, I read the paper and vingette on this. Monotonic effects force this monotonic behavior. And wrongly assuming this has the same affect as any other wrong modeling assumption. Failing to use monotonic effects affect ess, although this is probably not what I’m seeing. I believe my g_size and g_noise should be monotonic but that is not what I’m seeing when modeling as unorded factors. This is similar to Figure 12 in the paper. But I didn’t understand the explanation very well.
1 Like