GPU GLMs: float or double

By performance, you mean speed here?

And like @tadej, my main concern is how we make the decision of which to use case-by-case and what does it do to our maintenance burden.

I have no idea what the ramifications are of using both float and double in a single program—it’d be easier to control if instead of a GPU flag with had two, GPU_FLOAT and GPU_DOUBLE.

I think the maintenance burden is much worse than doubled as we now have to maintain some mechanism to configure everything.

Yes, speed. You can look at updated graphs in the first post.

I guess if we wanted to have both float and double support we could do something clever to avoid having actually duplicated code.

We’d just have to make sure that the small clever thing wasn’t harder to maintain than two larger, simpler things. It’s always a tough judgement call.

@Bob_Carpenter The Kahan summation algorithm indeed introduces minimal overhead. I also think, by writing a custom addition/subtraction operator/routine, the maintainability shouldn’t be too much of a problem.

The real problem is, I doubt kahan summation will help in this case.

As @Erik_Strumbelj said, the loss of precision in GPUs is not because of IEEE-754, but because of the hardware itself. GPUs use intrinsics to speed up basic arithmetic and that introduces hard-to-predict rounding errors here and there.

Compared to what?

I’ve never seen a problem outside a theory textbook where it was useful, but I didn’t do much numerical computing until about 10 years ago.

Even IEEE-754 only provides only minimum guidelines on the precision of intermediate calculations.

Worked on some PDE solvers that support multiples real & int types, yes maintenance is a pain, so is writing unit tests.

It’s in general hard to say how much abs tolerance is sufficient, especially for ODEs with sensitivity unknowns(because we want abs tol to lead to insignificant error, and the definition of insignificant is case-dependent). It’s actually quite easy to come up a benign-looking simple ODE that fail with seemingly reasonable tolerance setting. My favorable example is https://www.radford.edu/~thompson/vodef90web/ErrorControlMatters.pdf. The case study is rather relevant because rk45 is Stan’s most popular integrator.

1 Like

Speaking of too clever, tbh idt it would be that hard to have a pre-processor for the kernel code. We have like 90% of the stuff there to manage that.

Let’s say we wanted to do this for add because it’s simple

    __kernel void add(__global double *C, __global double *A,
                      __global double *B, int rows, int cols) {
      const int i = get_global_id(0);
      const int j = get_global_id(1);
      if (i < rows && j < cols) {
        C(i, j) = A(i, j) + B(i, j);
      }
    }
);

Let’s add templating

template <typename T1, typename T2, typename T3>
__kernel void add(__global T1 *C, __global T2 *A, __global T3 *B,
                  int rows, int cols) {
      const int i = get_global_id(0);
      const int j = get_global_id(1);
      if (i < rows && j < cols) {
        C(i, j) = A(i, j) + B(i, j);
      }
    }
    // \cond
);

Currently, we can’t really do anything with the above because we setup the kernel to compile at the construction of the kernel_cl struct. But we really want to be able to call code like the below and have a kernel generated that’s for those types.

matrix_cl<int> A;
matrix_cl<double> B;
// Return will be the type A and B should be promoted/preferred to
matrix_cl<promote<A::type, B::type>> C;
// Add an int and a double
opencl_kernel::add(C, A, B, A.rows())

The above should call the kernel

// However we want to mangle the name
__kernel void double_int_double_add(__global double *C, __global int *A,
                                   __global double *B, int rows, int cols) {
      const int i = get_global_id(0);
      const int j = get_global_id(1);
      if (i < rows && j < cols) {
        C(i, j) = A(i, j) + B(i, j);
      }
    }
);

And we can do this! We’ve been treating the fact that kernels have to come in as strings as more of a bug than a feature, but idt it would be hard to write a little preprocessor over the strings to get what we want.

The step-ish things that need to happen for the above to compile and execute are

  1. Type for matrix_cl<T> need to be passed to the OpenCL JIT compiler
  • We can just store the type in the matrix_cl so it can by accessed like A::type
  • the kernel_cl constructor does not compile the kernel. It is only given the kernel string and other compile options.
  • Compilation can happen actually JIT at the call site.
  1. Make a mangled signature

So for the above we need to parse the kernel string for a template line above a __kernel signature. Cool, we see there is a T1, T2, and T3. We also see the matrix_cl’s come in with types double, int, and double. So we will start by mangling the name like

__kernel void __T1_template___T2_template___T3_template_add(__global __T1_template *C,
                                                            __global __T2_template *A, 
                                                            __global __T3_template *B,
                                                            int rows, int cols) {
      const int i = get_global_id(0);
      const int j = get_global_id(1);
      if (i < rows && j < cols) {
        C(i, j) = A(i, j) + B(i, j);
      }
    }
);

Now we know our templates, we have the signature right, and the types each should be. So we can do a loop over each template type to clean everything up. Essentially a gsub replacing __T1_template with double etc.

// However we want to mangle the name
__kernel void double_int_double_add(__global double *C, __global int *A,
                   __global double *B, int rows, int cols) {
      const int i = get_global_id(0);
      const int j = get_global_id(1);
      if (i < rows && j < cols) {
        C(i, j) = A(i, j) + B(i, j);
      }
    }
);

Cool! Done deal! Now we need to make sure we don’t compile that kernel twice

  1. Compile that kernel and store it in a map inside of opencl_context (key = signature and value = kernel)
  2. When we call a kernel, check the map for <types>_kernel_name. If it exists just make a copy of it and pass it to that kernel_cl. Then pass that kernel the kernel args, and execute it. Else compile it first.

I think that makes sense. does that make sense? “C++ without classes” I guess

2 Likes

That sounds nice, but I don’t think it would be that simple to write a preprocessor. We don’t actually need templates - all this can be done using macros. Switching between float and double in kernel code can be as easy as #define double real or #define float real and than using real instead of float/double. If for some use case we need multiple input types we can have real1, real2 … No need for writing separate preprocessor.

How would you want to have stan construct and call the right kernel? The same map thing I have above?

If it’s similar to what I have above I’d rather do the preprocessor. In my mindspace I think I have some reasonable ways to make the preprocessor look nice

Yep, same or similar.

What advantages do you see in having a preprocessor? To me it only seems as more code to maintain.

tbh I just h8 macros haha. I’m honestly not even a fan of the A(i,j) ones we wrote. But we are stuck in C99 land so they are one of the few tools we have available for metaprogramming directly.

The only objective benefit of the preprocessor is mixed types in signatures and following a standard. And I think it would be more flexible if we wanted to do things in the future. I think the thing I’m proposing above could support both type and value template parameters. Though it’s also bulkier so there’s def a tradeoff here.

For yours we could also just pass a compile flag like -DREAL_TYPE=type and have something like

#ifdef REAL_TYPE
#define real REAL_TYPE
#else
#define real double
#endif

Sorry to post back so late. I wanted to provide a little more context.

I’m thinking about these questions in terms of reproducibility. As it’s been discussed in this thread, we really care about MCMC estimates. That’s true.

What we care about in software is that we are able to reproduce a result. That means understanding the conditions under which it will reproduce.

Right now, as far as we know, if we have the same operating system and compiler, then running a program with the same seed with reproduce results identically (except if there’s something wrong with teh hardware). What I don’t know is what happens with GPUs. If you have two identical computers with GPU, do they provide the same results? What if the GPUs are same manufacturer, but different models? Across manufacturers? Will different firmware change it?

I can believe that whatever we trust as our baseline precision is actually arbitrary in terms of trying to get at MCMC estimates. As long as the log prob and derivatives are computed accurately enough, we’re ok. At the Math level, we just need to know how to reproduce. I think I was asking questions from that perspective.

I think they should.

Different GPU firmware or GPU driver version might change it. Different GPU model definitely will change it.

The issue of numerical reproducability is on two different levels here:

a) differences in implementation of floating point arithmetics
b) non-determinism due to parallel execution

Point a) is why we wont see the exact same results when comparing a x86 CPU and GPU results even if we only ran 1 thread on the GPU, event though both comply with the IEEE754 standard. They do differ in how they handle rounding. But in terms of reproducability on the exact same hardware this is not problematic. An extreme and useless case of running 1 thread on the GPU would produce the exact same result. Same goes for all GPUs that have the exact same handling of rounding which is at least all GPUs that fall in the same architecture family.

The far large “issue” is the different order of arithmetic operations and the non-determinsim of this order. If the algorithm has any sort of parallel reduction, even for a simple case where 10 threads perform an atomic add operation on a variable, you are not guaranteed to get the exact same results even between 2 runs on the same hardware. This is true for CPU multi-threading and GPUs. It is a bit more evident on GPUs as you have 1000s of threads, but the problem is the same.

But this is common to all scientific HPC applications and there is no way around it.

4 Likes

Thank you for breaking this down so clearly! That makes sense. So this means that on a gpu (or with threading), there may be run to run variations with the same seed. And that implies that we can’t guarantee exact reproducibility with a gpu (or threading).

That complicates life, but that’s fine. If we haven’t documented this, we should take the bulk of your post and include it in the doc.

Yes, we can definitely make a note of this somewhere. Do you have any suggestions where to put it? Cmdstan manual under “B.8 Optional Parallelization Support”? Or did you mean somewhere in the Math wiki or readme?

Maybe this section should go in the Stan User’s Guide? I really like the idea that we’re trying to essentially bring good scientific computing to scientists who are not really trained in computational issues, and I think a lot of those topics end up in the Stan User’s Guide. @mitzimorris you’re the lead for that, right? What do you think of putting Rok’s section on determinism in scientific computing there?

the repo is https://github.com/stan-dev/docs

@rok_cesnovar - please file an issue and I’m happy to help you implement it.

Thank you. The issue is here, if anyone else is interested.

the Stan User’s Guide has a section on Floating Point here: