GPU GLMs: float or double

I implemented generalized linear models on GPU. Some discussion on GPU GLMS and graphs with speedups are in issue 1184. Then the idea came up, that GPU implementation can use float calculations instead of double to be even faster. As a prototype I implemented poisson_log_glm_lpmf with floats. Here are the speedups compared to implementation with doubles (K is number of attributes and N is number of instances):

An important thing to consder are also numerical errors due to reduced precision. So I compared both floatand double implementations with CPU implementation. On next graph there are maximum relative errors among logposterior and all derivatives. I generated three test cases for each size. y is generated between 0 and 100, all other inputs between -1 and 1.

Now the question is: Do we want GPU implementations of GLMs to use float or double?

EDIT: added version with kahan summation

@syclik, @seantalts, @Bob_Carpenter what’s your take on this? This has broader implications, because it can be done for all GLMs and possibly other types of models.

Does anyone have an idea how these errors would affect HMC and if they would translate into errors in parameter estimates?

What would be the criteria for adding this to Stan - if it passes the test’s, that’s it? Or do we need more extensive experiments.

We definitely need more tests. Based on what you’re reporting, running with CPU vs GPU with floats would provide different results conditioned on the same seed. So far, we haven’t dealt with this sort of behavior (if you set the seed, it’s reproducible on the same machine with the same compiler, etc… it still falls within that definition, but given any machine / compiler combination, we’d assume that it was ok.).

I’m not sure yet what would be enough testing. (I want to set some expectations… I can’t prespecify what’s enough here; as we learn more, we can be more precise.)

What I’d like to see is an argument / justification to explain the trade-off between performance and correctness in this case, how it’ll affect users and how we (the developers) will address misconceptions about floating point errors, and essentially something well thought through and one proposal for how we would pull off this change in expected behavior. If it’s well justified and well thought through, I can support it (if the results are right… if they’re wrong, that’s a hard pass from me).

Does that make sense? Those are just suggestions… I’m just trying to think through what would make the decision easier. Right now, I’ve been told that it’s faster and there’s error. That’s not quite enough.

GPU will in general not reproduce results from CPU with same seed even if it is doubles, no? So that’s not an argument (just) against floats.

I don’t think there is a way to comprehensively test everything that could go wrong. And tests would have to include how HMC interacts with these errors.

What if using floats is just an (unsafe) option on the GPU?

Really? That’s news to me. Can you confirm?

If that’s true, we should put that doc clearly where users well know that is expected behavior.

Definitely true. Opencl uses other compiler and math library (made by device vendor). Also there are some small diferences in code, such as order of summation when calculating sum across values calculated by different threads,

@rok_cesnovar or @tadej can probably give more background on why, but yes, you’ll get different results. From Tadej’s plots above you can already see that doubles also have an error (relative to CPU).

I’ve been running a lot of end-to-end tests recently and I can confirm that you get numerically different parameter estimates on different hardware (even different GPUs give different results). But the differences are not beyond what you would expect from using a different seed (that is, it’s within MCMC error).

You also get different parameter estimates on the same CPU using the same seed if you use a GLM primitive or if you do your own linear term. For example, bernoulli_logit_lpmf( do the linear term yourself) vs bernoulli_logit_glm_lpmf(…).

1 Like

A bunch of other people are relevant when the discussion is numerics for HMC, but I’d like to also hear from @Daniel_Simpson, @yizhang, and @betanalpha

We definitely need the double-precision for Cholesky decomposition, but we might be able to get away without it for matrix multiplies.

If models were just Poisson GLMs, I think it’d be fine, as 10^{-5} is good enough for HMC when the geometry’s simple, as it would be in a GLM prior with independent predictors. In our ODE solvers, we have found issues when precision gets down around 10^{-4} or 10^{-5} (is that right, @charlesm93 or @yizhang?). But for simple sampling from normals, I’ve found even 10^{-2} accuracy is sufficient (in that the error on expectation estimates is undistinguishable from 10^{-14} precision).

But I’m not sure what increasing relative error from 10^{-14} to 10^{-5} is going to do in general when that GLM is just one component of a model. I would imagine that unless something interacted directly with it, like a hierarchical prior, then it shouldn’t matter.

This is the easiest possible test case, with independent, unit-scale, centered predictors. It’d be interesting to see what happens in more structured prediction matrices. One way to do that is generate different scales, different centerings, and different degress of correlation among the parameters.

Another test would be to add hierarchical structure to priors and see if that induces any bad numerical interactions. That’s the situation we normally use these in.

Are there other contexts in which a matrix multiply is just one operation in a sequence of operations that might degrade if the matrix multiply is too imprecise? I think that’s the real question we need to answer.

1 Like

In my experiences I’ve seen problems arise for errors as low as 10^{-8} in complex models, so I think that the general issue is that there will be no universal thresholds unless the error in the calculation farmed off to the GPU can be uniformly bounded across all inputs.

The characteristic sign of numerical problems for the gradients is HMC is the acceptance probability proxy statistic dropping despite changes to the step size (equivalently for the Stan adaptation, changes to adapt_delta). See https://arxiv.org/abs/1502.01510 for more details.

The challenge with general numerical accuracy questions is that the log target density will an approximation in addition to the gradient, so the acceptance probability proxy statistic might itself not be accurate enough to capture drifts in the numerical integration.

1 Like

That’s super helpful. I must not have read that paper on subsampling carefully enough!

With really easy models like high-dimensional standard Gaussians, 10^{-2} is sufficient (at least as evaluated by the distribution of square errors across runs). But the real question is how this scales to harder problems.

Making easy problems fast isn’t my highest priority for Stan—I’d like to focus on making hard problems tractable.

Do you know if it’s possible to use float with specific sum etc algorithms, where floating point error is not propagated or is there too much overhead for just using double?

e.g.

Yes. That’s why nobody uses these algorithms.

1 Like

Is this the case in GPU too, where number of float-cores can be much larger than double-cores.

(Even if this would be “doable” I think ot would need to be implemented in gpu libs not in our)

I checked this out. It is quite interesting. I believe I could use this in GLMs with little impact on performance.

In above question, do we have to read the or as xor? More precisely how expensive is it to have
both, double and float?

The amount of code would be doubled. Also I don’t think we can somehow automatically select which implementaqtion to use for a particular model so that falls on the user.

You could default double and have a compiler flag for float. Do you mean the programming effort doubles? I see. Just my 2 cents.

Double versions are already implemented (but not jet merged into stan math). Some more programming effort is needed to implement float versions of the remaining 3 models. Maintenece effort doubles.

I tested Kahan summation and updated graphs in first post with it. It reduces errors a bit without affecting performance.

1 Like