Speeding up hierarchical models

Comments on this thread got me thinking about speeding up likelihood evaluation for hierarchical models: https://github.com/stan-dev/stan/issues/2912

The calculations are memory bound right now (see 1, 2).

Excluding multiple node parallelism, if we want to speed up a memory bound calculation we can:

  1. Use the GPU for more memory bandwidth
  2. Change our calculations to use less memory
  3. Change our calculations to not be memory bound

I think there are ways to attack each of these points. I’ll list them here.

A basic hierarchical likelihood is something like:

y_n \sim N(X_{n} \beta + \alpha_{i[n]} + \ldots, \sigma)


y is a length N array of observations
X is an NxK matrix of covariates
\beta is a length K vector of non-hierarchical parameters
\alpha is a length I vector of hierarchical parameters
i is a length N array of indices into alpha in the range 1 to I

The \ldots s area to indicate there can be many hierarchical terms.

The hierarchical terms commonly get more complicated in two ways. First, the term above corresponds to the lme4 syntax:

(1 | group)

We might also have:

(z | group)

Where z is some covariate. A model with such a term expands to:

y_n \sim N(X_{n} \beta + z_n \alpha_{i[n]} + \ldots, \sigma)

z here is a length N vector that elementwise scales the hierarchical term

Another way hierarchical models get more complicated are interactions in the groups, i.e.

(1 | group1:group2)

In this case, each \alpha might have multiple indices:

y_n \sim N(X_{n} \beta + z_n \alpha_{i[n], j[n]} + \ldots, \sigma)

This can be reordered to be the first, but this notation will be more efficient (no need to create multi-indexes).

The distribution itself will probably change as well. The normal distribution is just being used as a placeholder.

A GPU interface for hierarchical models

Alright, so if we want to run something like:

y_n \sim N(X_{n} \beta, \sigma)

on GPUs, it’s fine. The bulk of the work is the matrix-vector multiply, and that is something that is cleanly implemented on the GPU.

We already have the glm interfaces that work for likelihoods like:

y_n \sim N(X_{n} \beta + \hat{\alpha}_n, \sigma)

where \hat{\alpha} is the indexed version of alpha above (like, all the n terms \alpha_{i[n]} in a single length N vector).

That interface is:

real normal_id_glm_lpdf(real y | matrix x,
                                 real alpha,
                                 vector beta,
                                 real sigma)

I think we can expand this interface to support all the hierarchical models above. The problem currently is we write the hierarchical terms with for loops, and we can’t put that on the GPU for free. The new thing is to develop an interface to sum all the random effects, so if the above computations were broken in two, something that can compute the \mu terms here:

\mu_n = z_n \alpha_{i[n], j[n]} + \ldots \\ y_n \sim N(X_{n} \beta + \mu_n, \sigma)

The Stan signature would be:

real hierarchy(int[,] idxs, args...)

which probably isn’t too exciting, but what we can do is use function overloading to peel out groups of arguments that define each hierarchical term.

I’m switching to Stan-pseudocode since we don’t have function overloading or a way to handle variadic arguments other than special cases, but presumably this bit would be written in C++. If we have the overloads:

vector hierarchy(int[,] idxs,
               vector alpha, int m,
               args...) {
  return alpha[idxs[m]] + hierarchy(idxs, args...);

This corresponds to peeling off a term: \alpha_{i[n]}. There are a couple new assumptions we need to make:

M is the number of hierarchical terms in the model
idxs is a length M array of length N arrays of indices
m is the index into idxs of which index alpha should use.

vector hierarchy(int[,] idxs,
               vector z, vector alpha, int m,
               args...) {
  return z .* alpha[idxs[m]] + hierarchy(idxs, args...);

And this corresponds to peeling off a term: z_n \alpha_{i[n]}

vector hierarchy(int[,] idxs,
               matrix alpha, int m1, int m2,
               args...) {
  vector[N] mu;
  for(n in 1:N) {
    mu[n] = alpha[idxs[m1, n], idxs[m2, n]];
  return mu + hierarchy(idxs, args...);

And this corresponds to peeling off a term: \alpha_{i[n], j[n]}. In this case, alpha is a matrix and has two indices, so there are two m terms.

vector hierarchy(int[,] idxs,
               vector z, matrix alpha, int m1, int m2,
               args...) {
  vector[N] mu;
  for(n in 1:N) {
    mu[n] = z[n] * alpha[idxs[m1, n], idxs[m2, n]];
  return mu + hierarchy(idxs, args...);

And this peels off a term: z_n \alpha_{i[n], j[n]}.

The implementation should be different from what is written there, but the point is it’s possible to write down an interface for the hierarchical terms above so that we can work on an efficient implementation for an expanded hierarchical GLM function which could look like:

real normal_id_glm_lpdf(real y | matrix x,
                                 vector beta,
                                 int[,] idxs,
                                 real sigma)

which would handle all the different hierarchical likelihood variations described above.

Just as an example, we might have the regression:

y ~ x0 + x1 + (1 + z | g1) + (1 | g1:g2)

This could translate to the Stan code:

target += normal_id_glm_lpdf(y | X, beta,
                                 { gidx1, gidx2 },
                                 alpha1, 1,
                                 z, alpha2, 1,
                                 alpha3, 1, 2,

where X holds the covariates x0 and x1, gidx1 and gidx2 are length N indices of the group ids for each outcome, alpha1 is the group1 hierarchical intercepts, alpha2 are the group1 hierarchical slopes, and alpha3 are the hierarchical intercepts for the group1:group2 interaction.

Presumably with this interface defined we could think about how to implement this on the GPU. GPUs have like 10x the memory bandwidth as CPUs.

Using Less Memory

If there are N observations, K non-hierarchical parameters, M indices, and Z hierarchical slopes, then the forward evaluation requires roughly:

N * K + N * M + N * Z reads (this assumes N is much larger than the number of random effects and so is dominating the memory requirements)

We might not be able to get away with using floats in our calculations, but I don’t see why we couldn’t get away with making all our covariates (X and all the zs) floats. This would make our matrix-vector products twice as fast, because we are reading half the memory, and also lower the memory needs for the hierarchical slopes.

We could also make integers in the data block shorter versions if the limits are statically defined. Probably most hierarchical models we will work with in the short term have less than 65536 groups. In this case, all the group indices could be uint16s, which would cut the memory requirements for reading the indices by four.

This is safe, because we always do size checks on reads. So if the data block says:

data {
  int<lower = 1, upper = 1000> gidx1;
  int<lower = 1, upper = 50> gidx2;

We use uint16s for gidx1 and uint8s for gidx2.

Unfortunately we wouldn’t be able to speed up:

data {
  int G;
  int<lower = 1, upper = G> gidx1;

Change our calculations to not be memory bound

Alright now this is the hard one, but also maybe the coolest.

So just the basic linear regression is a good enough example to get memory bound computations:

y \sim \mathcal{N}(X \beta, \sigma)

We see that here.

This is because a matrix-vector multiply is memory bound. The way to not be memory bound with matrix-vector multiplies, is to try to rearrange your calculations to be matrix-matrix multiplies.

matrix-matrix multiplies are computation bound, and so it is possible to do them much more efficiently (and make use of all cores on a computer, etc.).

Now first of all, we do have multiple matrix-vector multiplies we need to do.

If \beta_1, \beta_2, \beta_3, … correspond to the \beta parameters on different chains, then we need to do the matrix-matrix multiply:

X [ \beta_1, \beta_2, ... ]

This would work perfectly fine with our autodiff. Numerically none of the terms from the different chains interfere with each other – we could break it all apart afterwards easily.

The problem is this is not how our models get compiled to C++. Our models get compiled as single chains. I think we should look at introducing a megamodel concept that computes likelihoods for multiple chains at a time and brings them together.

This would change how the samplers interfaced with the model class (multiple chains would need to be coordinated), but the model interfaces would need to be expanded from things that look like:

real log_prob(vector params);

To things that look like:

real[] log_prob(vector[] params);

And the compiler would need to be able to recognize situations where it could promote matrix-vectors into matrix-multiplies. Also the hierarchical terms I defined above should have sparse matrix-matrix versions (they are currently basically sparse matrix-vectors), but that is a detail for now.

Anyhow, that’s it for the description, but as an example of the possible utility of this I wrote a model that is the likelihood for C independent chains together, and compared the efficiency of that running that model on one core against running a C = 1 version of that model on C cores with C chains.

The comparison isn’t quite accurate, cause in the first model with the C likelihoods, the u-turn condition will have to wait on all the independent things to u-turn, which is a bit weird, but whatever.

So here’s the model:

data {
  int N; // N data points
  int K; // K parameters
  int C; // C chains
  matrix[N, K] X;
  real y[N];

parameters {
  matrix[K, C] beta;

model {
  matrix[N, C] mu = X * beta;
  for(c in 1:C) {
    y ~ normal(mu[, c], 1.0);

Here’s code to run it:


N = 4000
K = 50
C = 6

beta = rnorm(K)
X = matrix(rnorm(N * K), nrow = N)
y = rnorm(N, X %*% beta, 1.0)

m = cmdstan_model("megamodel.stan")

system.time({ f1 <- m$sample(list(N = N,
                                  K = K,
                                  C = 1,
                                  X = X,
                                  y = y), num_cores = C, num_chains = C, save_extra_diagnostics=TRUE) })

system.time({ f3 <- m$sample(list(N = N,
                                  K = K,
                                  C = C,
                                  X = X,
                                  y = y), num_cores = 1, num_chains = 1, save_extra_diagnostics=TRUE) })

Running 6 chains on 6 cores takes 27s on my computer. Running the 6 chain 1 core megamodel takes 46s.

Which is cool. So we did the same thing with 1/6 the computation in twice the time. So 3x speedup.

Now, I’m running some big calculations on the other two cores of my computer I didn’t want to stop (they’ve been running for like 10 hours now), so this comparison is probly a bit off, but whatever.

Since the u-turn condition in the megamodel is weird (the megamodel did about twice as many leapfrogs as the other), we can also compare the ESS/s for each model.

We can do this with posterior with (arranging to show lowest bulk ESS):

# Multiple chains single model
f1$summary() %>%

# Single chain multi-model
f3$draws() %>%
  as_draws_df() %>%
  pivot_longer(cols = starts_with('beta'),
    names_to = c("n", "c"),
    names_pattern = "beta\\[([0-9]+),([0-9]+)\\]",
    names_ptypes = c(integer(), integer()),
    values_to = "beta") %>%
  mutate(.chain = c) %>%
  select(-c, -.draw) %>%
  pivot_wider(names_prefix = "beta",
              names_from = n,
              values_from = beta) %>%
  summarise_draws() %>%

For the 6 core 6 chain run we get:

   variable        mean    median     sd    mad        q5       q95  rhat ess_bulk ess_tail
   <chr>          <dbl>     <dbl>  <dbl>  <dbl>     <dbl>     <dbl> <dbl>    <dbl>    <dbl>
 1 lp__       -2044.    -2044.    4.98   4.95   -2053.    -2037.     1.00    2510.    3517.
 2 beta[24,1]    -1.68     -1.67  0.0161 0.0163    -1.70     -1.65   1.00    8005.    4342.
 3 beta[28,1]     0.157     0.157 0.0155 0.0151     0.132     0.183  1.00    8681.    4619.

For the 6 chain 1 core megamodel we get:

   variable        mean      median      sd     mad          q5        q95  rhat ess_bulk ess_tail
   <chr>          <dbl>       <dbl>   <dbl>   <dbl>       <dbl>      <dbl> <dbl>    <dbl>    <dbl>
 1 lp__     -12266.     -12267.     12.8    13.0    -12287.     -12245.     1.00    1789.    2386.
 2 beta47       -0.622      -0.623   0.0159  0.0157     -0.649      -0.596  1.00   14946.    3768.
 3 beta4         0.448       0.448   0.0158  0.0156      0.422       0.474  1.00   15074.    3927.

So if we ignore lp_ and the ess_tails (cuz we’re doing wishful thinking), we actually get 300 ESS/s for the 6 core 6 chain model and 320 ESS/s for the 1 core 6 chain megamodel.

So we did better than 6 cores on 1 core. This is for the simplest model possible, so take it with a grain of salt, but that’s pretty cool.

To check the treedepths for the 6 core 6 chain model:

f1$sampler_diagnostics() %>% summarise_draws()
  variable          mean   median     sd    mad       q5      q95   rhat ess_bulk ess_tail
3 treedepth__      3        3     0      0         3        3      NA       NA         NA 
4 n_leapfrog__     7.00     7     0.103  0         7        7       1     6024.      6024.

For the 6 chain 1 core megamodel:

f1$sampler_diagnostics() %>% summarise_draws()
  variable           mean    median     sd    mad        q5       q95  rhat ess_bulk ess_tail
3 treedepth__       4         4      0      0         4         4     NA         NA       NA 
4 n_leapfrog__     15        15      0      0        15        15     NA         NA       NA

And we can benchmark the gradient calls themselves with Rstan:

# Check the performance of the gradients
m2 = stan_model("megamodel.stan")

# Make N really big so each gradient takes a while so timings work
N2 = 400000

X2 = matrix(rnorm(N2 * K), nrow = N2)
y2 = rnorm(N2, X %*% beta, 1.0)

f4 = sampling(m2, iter = 1, data = list(N = N2,
                                        K = K,
                                        C = 1,
                                        X = X2,
                                        y = y2))

f5 = sampling(m2, iter = 1, data = list(N = N2,
                                        K = K,
                                        C = C,
                                        X = X2,
                                        y = y2))

beta = rnorm(get_num_upars(f4))
system.time({ grad_log_prob(f4, beta) })
system.time({ grad_log_prob(f5, rep(beta, C)) })

The single chain model takes about 80ms to evaluate a gradient. The megamodel takes about 160ms.

This would all be contingent on how possible it is to do static analysis and generate code for efficiently evaluating multiple chains together.

Anyway the overall efficiency here would go up as we added more chains, though we’d still be memory bound unless we started running hundreds of chains. In that case we’d probably be losing efficiency trying to piece things back together and we presumably wouldn’t have our within-chain diagnostics, etc.

Anyway, I think we could think about these things in terms of:

  1. GPU gives us 10x memory bandwidth
  2. We could maybe require half the memory for the calculations we’re currently doing
  3. We could code-gen megamodels to make the memory bottleneck less of an issue

Long post. Hope that stuff is right. @Bob_Carpenter, @rok_cesnovar, @stevebronder, @tadej, @avehtari, @wds15, @yizhang

Edit: Fixed m1/m2 indexing error pointed out by @stevebronder here


Oh yeah I @'ed a bunch of people to get discussion. Maybe I missed something and it sucks or whatever. I don’t even really know how to do half of this. Adding @rybern and @nhuurre for the compiler stuff.

Just pointing out that rstanarm does something equivalent to \mathbf{Z} \mathbf{b} with \mathbf{Z} a sparse matrix, which helps on the memory but there are no *_glm functions with signatures like that. I can make \mathbf{Z} \mathbf{b} part of the intercept, I guess.

Oh yeah we could have an actual sparse matrix * matrix. If you actually build a Z b. I never read the lme4 thing so I’m still not sure which matrices in that are real or not lol.

I am currently using something equivalent to csr_sparse_matrix_multiply for that. But once we have actual sparse matrices in the Stan language, it would be better if the _glm functions had signatures for that.

Having sparse matrix support for the glm functions sounds like a huge win (once we have proper sparse matrices).

I did not yet have time to absorb this material in detail… will do that later. However, I wonder if can actually already now try out some of these ideas. For example, when running a threaded map_rect, then you basically get independent AD sweeps and the data is shared (same for reduce_sum). So maybe we can use that in an efficient way to explore some the numerical experiments we should do here?

EDIT: Actually, I think it should be possible to use a big threaded run to inform 4 independent chains using the same data. Essentially we need to declare the parameters 4 times in the model block and just formulate 4x the same thing. In order to mimic different inital seeds, we merely add as data argument offsets for each parameter (which would be understood as raw parameters). This way we have a shared data model and the entire computational load is in just one chain. This should give an idea on how well data sharing fares, I would hope.

EDIT2: This way we also get to know what efficiency gains there are if all warmup info is shared (though it would be in-efficient for dense mass matrices… but for diagonal ones it should be ok)… though the sharing only happens for the stepsize; not the mass matrix…

Oh yeah, speaking of shared parameters I forgot, but there’s another thing.

Scope of variables defined in data block

When we autodiff an operation like X \beta, even if X is data, we’re forced to save a copy of X on the autodiff stack.

This is necessary because variables defined in the program forward pass can go out of scope by the time we do the backward pass (something defined in a for loop, for instance).

So when we read X (which is expensive), we currently also write an X (which is also expensive) and then during the backwards pass we use that alternative X. We can save the write if we recognize somehow that X isn’t going out of scope. The same thing goes for the indices in the hierarchy portion of the model (though these indices wouldn’t exist in the base for-loop autodiff version).

So that’s 2/3 of the memory accesses.

Edit: changed “memory reads” to “memory accesses”.

Yeah,I had to think about this a while, but I guess we could reduce_sum over C in the first mega_model and see what happens. The issue there is that we’d end up with C copies of X in the autodiff stack.

functions {
  real func(vector[] beta, int start, int end, real[] y, matrix X) {
    real out = 0.0;
    for(c in 1:size(beta)) {
      out += normal_lpdf(y | X * beta[c], 1.0);
    return out;

data {
  int N; // N data points
  int K; // K parameters
  int C; // C chains
  matrix[N, K] X;
  real y[N];

parameters {
  vector[K] beta[C];

model {
  target += reduce_sum(func, beta, 1, y, X);

Then you’d run this with set_num_threads(C), C = C in the data arguments, num_chains = 1, num_cores = 1.

To check the posteriors you need swap names_to = c("n", "c") to names_to = c("c", "n") in the pivot_longer code.

I’m not sure about this cause one thread per matrix vector leans heavily on cache + threads being synchronized to work properly (in the sense that it depends on threads all reading the same parts of X at the same time). I just don’t know if that would work. I guess things will be synchronized because we’re starting threads together.

It’s also less friendly to something like a GPU where there will only be one calculation at a time.

An upside is this would parallelize the whole computation, not just the matrix-vector multiplies.

I’m not sure if threads vs. processes or anything computational is the way to think about this stuff. We need to rearrange computation to minimize how much memory it uses. Threading leans on processes synchronizing through cache to do that, which is one way of doing it.

I assume the easiest thing for us to do is write custom likelihoods in C++ and test. Presumably the compiler work could be done independently of interface work (at least once a specification is set up), but I don’t know the difficulty of that effort. It’s probably immense. Presumably we hack some C++ from generated models for a while if we go this route.

Not quite. We get C copies, yes… but each copy is on its own nested stack which is not as bad as having C copies on a single huge stack as CPU caches are used more efficiently.

But each of those C copies is big enough to bust the stack.

I guess this is a use for recursive reduce_sum lol.

I guess it isn’t recursive I guess it’d just be a reduce_sum inside another reduce_sum. The other one to synchronize the computation over C and the inner one to stay in cache. You’d hope that those inner things get executed in some sane way and don’t get mixed up?

Something like:

functions {
  real func2(real[] y, int start, int end, matrix X, vector beta) {
    return normal_lpdf(y | X[start:end] * beta, 1.0);
  real func1(vector[] beta, int start, int end, real[] y, matrix X) {
    real out = 0.0;
    for(c in 1:size(beta)) {
      out += reduce_sum(func2, y, 1, X, beta[c]);
      //out += normal_lpdf(y | X * beta[c], 1.0);
    return out;

data {
  int N; // N data points
  int K; // K parameters
  int C; // C chains
  matrix[N, K] X;
  real y[N];

parameters {
  vector[K] beta[C];

model {
  target += reduce_sum_static(func1, beta, 1, y, X);

This is right up my alley with GPU + stanc3, so sign me up for “A GPU interface for hierarchical models”. I will l discuss this with Tadej and break down the challenges.

Yes. This would be a major use case for sparse matrices. (You can (and should) marginalise out the random effects for a linear mixed models, just like him do with a GP with Gaussian observations).

Not completely sure these have the arithmetic intensity to make GPUs worth while, but I’ll leave that to the experts.

You definitely shouldn’t do any of this without reading the lme4 paper.


We should be able to adj_jac_apply matrix multiplication, but one thing I have never understood is why we need to copy each element of the matrix. Since Eigen utilizes contiguous memory, it seems like we should only have to store a pointer to the first element of the matrix and the number of rows and columns in order to call Eigen::Map<...> when it reverses through.

It’s a scope thing.

So if we had something like:

real func(real a) {
  real b = 1.0;

  return a * b;

model {
  target += func(a);

After func returns, b is out of scope (it was a stack variable). The a * b operation needs to save a copy of b so that when it’s time to do reverse mode it has a copy. We don’t know the scope of the variables being passed in, so we can’t make any assumptions about how long they’ll live (other than the varis, I guess).

Doesn’t this depend on the number of observations (N) and parameters (P)?

Assuming \beta are my parameters I’ll integrate out and \tau are my hyperparameters, I’d start out with p(\beta | \tau) and p(y | \beta) and integrate out \beta to get something like p(y | \tau). Now all the fanciness of my hierarchical model has been packed into the covariance in p(y | \tau).

If I have N = 10000, then that’s a 10kx10k matrix, and even if the covariance has all sorts of block structure, it’s still quite a big thing to deal with right?

Presumably if we integrated out P = 10000 parameters that rocks, but integrating out P = 1000 parameters seems less so. Or does can and should mean always?

Well I skipped the estimation part, but I just went and read through how Z is constructed. They mention that they order groups to reduce fill in. This would be easy if there was only ever one group. Do you recall any discussion of what happens with multiple groups?

+ @bgoodri, for sparse matrix glms, all we’re looking for is something like (and including all the other glm functions, not just normal) this right?:

real normal_id_glm_lpdf(real y | matrix x,
                                 vector beta,
                                 sparse_matrix z,
                                 vector alpha,
                                 real sigma)

It looked like the likelihood stuff all got packed into the Z matrix. I think all the prior stuff can remain separate.

Within a Stan program, it seems as if a matrix of doubles should live until the class destructor is called because it ordinarily would have been declared in data or transformed data. Does

model {
  matix[K, K] foo = rep_matrix(0, K, K);

get made into a matrix of doubles or vars? If the latter, maybe that is something the stanc3 optimizer could move to transformed data?

I think @rok_cesnovar wants to make up a way so that data can live for the whole sampling process (so GPU/MPI copies only happen once and that is controlled by the compiler).

I’m not sure. It seems like something should be possible. I’m not sure what the implications will be on the Math library, or what.

Isn‘t this where perfect forwarding aka universal references would do the trick? Ping @stevebronder

Also, one more cool thing about the small chunking of reduce sum is that while it triggers copies from stuff, we do immediate reverse sweeps such that things stay in the cpu cache which is a big thing as it drastically improves cpu cache use. This is why we should profile a bit with the Linux perf tools, I think.

I think you are missing m1, m2 here

In the proposal for static matrices one of the benefits I footnote is that making vari_type<T> also allows for vari_type<float>.

I think we need to start allowing for attributes on types in the stan language. For instance with sparse, complex, and static on the horizon (however distant) the combinations of explicit matrix types would explode. If we had attributes on our base types we could have

static short matrix[N, M] A;
short int idx[M, N];
static long matrix[N, M] B;

short would be float, long is long double, and short for int can be uint16.

I wonder if there would be some dirty trick we could do with like a second stack that holds just data. Then the first time that data is allocated we keep it on some stack and reference it then on.

Possibly, it would allow for b to be moved into the struct that holds the chain method. So it would be allocated once and forwarded along until it came to it’s final resting place in the struct that holds the chain method. That’s useful for arrays

This is a bit crazy but it’s 10:30 on Sunday so whatever. We could do something like a data_stack where it has a counter and a vector of std::pair<int, double**>. When we do data_stack.allocate(1024) it pushes that amount as the lhs of the pair, allocates the memory on the rhs side, returns back a reference to the rhs of the pair, and increments the counter by 1. Then when we do recover_memory it doesn’t actually clean anything up but resets the counter to 0. Then the next time it calls allocate(1024) it checks if the size matches up with with what was already there and returns the already created data else throws (or whatever we want to happen in the else clause).

At least the GPU part is already done and is live in cmdstan for all 6 GLMs since we moved to stanc3 in 2.22.

What happens is that in addition to

Eigen::Matrix<double, -1, -1> X_d;
Eigen::Matrix<double, -1, 1> y_v_d;
std::vector<int> y_vi_d;

stanc3 adds

matrix_cl<double> X_d_opencl__;
matrix_cl<double> y_v_d_opencl__;
matrix_cl<int> y_vi_d_opencl__;

In the model constructor it adds

X_d_opencl__ = to_matrix_cl(X_d);
y_v_d_opencl__ = to_matrix_cl(y_v_d);
y_vi_d_opencl__ = to_matrix_cl(y_vi_d);

And the call is changed to

   normal_id_glm_lpdf<false>(y_v_d_opencl__, X_d_opencl__, alpha,
   beta, sigma));

With that all GLMs get substantial speedups (table below is for bernoulli but its mostly the same for others as well).

n k chains gpu speedup
65k 10 1 1
65k 10 2 1.5
65k 10 4 2.4
262k 10 1 2.7
262k 10 2 3.4
262k 10 4 7
4M 10 1 6
4M 10 2 7.5
4M 10 4 11
30k 100 1 2.7
30k 100 2 4.3
30k 100 4 7.5
30k 1000 1 12
30k 1000 2 18
30k 1000 4 25
10k 2000 1 12
10k 2000 2 18
10k 2000 4 30

Without this we would have transfers each iteration, and then these GLMs would probably never be faster on a GPU due to the lack of arithmetic intensity as @anon75146577 pointed out.

Given that this expanded hierarchical GLM should only have more arithmetic intensity its bound to be even faster.

EDIT: This works for data as well as transformed data in the exact same way.

1 Like