Using the Apple M1 GPUs: question from a noob

mike-lawrence · July 11, 2021, 12:00am

Take a look at your RAM usage as you increase the number of subjects. I wonder if you’re maxing out, at which point your OS will start paging to your disks, which would cause a huge sudden slow down. It could also be a bottleneck in cmdstan writing to CSV after every iteration when there are a huge number of parameters.

Joshua_Hartshorne · July 11, 2021, 12:22am

Memory actually looks good. I still have a few gigs of physical memory available. I’m not using the cmdstan backend, I don’t think.

Joshua_Hartshorne · July 11, 2021, 1:16am

Here is seconds (y-axis) by number subjects (x-axis) for

gamm4(logRT ~ s(age, by=congruent, bs='gp'), random = ~ (1|id), data = flanker[flanker$switch_target==TRUE,])

So I can probably do all 16000 subjects in less time than it was taking me before (by which I never managed to run an RT mixed model using brms, I always gave up) … but it’s not going to be quick.

secbysubj.pdf (25.0 KB)

mike-lawrence · July 11, 2021, 1:49am

brms has backend options of rstan or cmdstanr, but both in turn use cmdstan.

Joshua_Hartshorne · July 11, 2021, 5:26pm

@martinmodrak @paul.buerkner On the topic of what’s taking so long, here’s some reports from system.time() for some different runs with different numbers of subjects:

   user  system elapsed 
  4.388   0.898 258.292 

    user   system  elapsed 
  33.256   18.779 7359.998

Should ‘elapsed’ be so much larger than ‘system’ and ‘user’? I did some googling. The most informative I found was this, but I’m still unsure what to do.

FWIW here is current usage, with 2000 subjects:

Joshua_Hartshorne · July 12, 2021, 1:28am

EDIT: Realized stupid Excel was plotting my x-axis as categorical. I fixed that, which changes interpretation.

So here are some time tests. Here’s the gamm4 formula:

logRT ~ s(age, by=congruent, bs='gp'), random = ~ (1|id)

It’s roughly the same for bam and brm, just in their respective formats. For brm, I’m using 2000 iterations and 4 chains.

Keep in mind that I have 16,336 subjects – roughly right where the plot area ends.

So … things slow down quickly. brm is clearly unrealistic, at least without something to speed it up a lot. Even with gamm4, I’m not sure it’s realistic to have random slopes in addition to random intercepts. I haven’t run that, but I’d expect it to be a lot slower, and it’s already pretty slow once we extrapolate out to 16,000 subjects.

So any ideas as to how I can speed up brm? Or should I just go with gamm4? And what’s up with the crazy ‘elapsed’ times I mentioned above?

wds15 · July 12, 2021, 7:42am

use threading from brms to enable within-chain parallelisation. That should work nice in this setting, I’d expect.

jroon · July 12, 2021, 7:52am

Joshua_Hartshorne:

@martinmodrak @paul.buerkner On the topic of what’s taking so long, here’s some reports from system.time() for some different runs with different numbers of subjects:
   user  system elapsed 
  4.388   0.898 258.292 

    user   system  elapsed 
  33.256   18.779 7359.998 
Should ‘elapsed’ be so much larger than ‘system’ and ‘user’? I did some googling. The most informative I found was this , but I’m still unsure what to do.

So from R help, system.time() calls proc.time(), which returns an object of type proc_time which is described in the help as:

An object of class "proc_time" which is a numeric vector of length 5, containing the user, system, and total elapsed times for the currently running R process, and the cumulative sum of user and system times of any child processes spawned by it on which it has waited. (The print method uses the summary method to combine the child times with those of the main process.)

Since your parallel R sessions you see in your Activity Monitor screen shot are child processes, the elapsed time sums up the time of each individual parallel R session. Hence your elapsed time is much bigger than user and system time.

martinmodrak · July 12, 2021, 9:09am

A few more notes: you can use the optimizing and variational (ADVI) methods of Stan (via brms) to get faster approximate results (I would expect optimizing to give very similar results to any other penalized maximum likelihood approach), ADVI can have a bit more robust uncertainty quantification but is also more costly. In both cases (and also with mgcv) comparing the results to MCMC on smaller datasets is a good safeguard.

Looking at this plot for a smaller dataset could still be very useful - if there are problems, they might affect all versions of the model in all packages, although MCMC is the only algorithm to signal a problem.

Joshua_Hartshorne · July 12, 2021, 1:31pm

@martinmodrak sorry, you are exceeding my knowledge. Do I understand that you are saying ADVI would be at least as good as gamm4 with a gaussian process basis function?

Also, how do I use ADVI through brms? Looking at the docs for brm, I see an algorithm parameter with a few options for variational inference:

algorithm: Character string naming the estimation approach to use. Options are “sampling”
for MCMC (the default), “meanfield” for variational inference with independent normal distributions, “fullrank” for variational inference with a multivariate normal distribution, or “fixed_param” for sampling from fixed parameter values. Can be set globally for the current R session via the “brms.algorithm”
option (see options).

But perhaps you meant something passed through control, which case can you point me to an example?

So here’s the gaussian process model with 500 subjects, 2000 iterations, and 4 chains. Note that this reported 18 divergent transitions. You’ll need to zoom in a fair bit to see the divergent transitions.

What I don’t know is how to interpret this. For what it’s worth, below is a plot from a much simpler model using thin plane splines and all the data. There are no subject random effects and only one smooth term. I post it simply as another example in case it’s helpful to explain what it is I’m supposed to get from these plots:

Thank you so much everybody! I really appreciate you taking the time to help me figure this out.

Joshua_Hartshorne · July 12, 2021, 1:44pm

Ahh, that makes sense. I would have thought that was system time. Well, that eliminates one possible issue, at least.

Based on the like, looks like @martinmodrak thinks this is a good idea? Just confirming. I’ll run a few tests.

UPDATE: I’ve tested the brms model with 2 chains, 3 threads/chain, and 6 cores. It’s faster, but still painfully slow:

Of course I haven’t tried the approximations suggested by @martinmodrak because I’m still not sure how to do them.

mike-lawrence · July 12, 2021, 5:01pm

Still no GP for age, but in case you’re curious, here’s the model I promised above:

See the pdf for a figure describing the structure. There’s a somewhat unconventional thing whereby instead of one big multivariate normal for all the subject coefficients, I instead break them into k sets of 3-variable mvns, which I’m hoping achieves a good/principled balance between the desire to let the three parameter classes (influences on errors, influences on log-RT location, influences on log-RT scale) mutually inform without the bias-to-zero on this mutual-informativity that the full mvn would impose.

franzsf · July 12, 2021, 11:02pm

Minor suggestion here, given that I know nothing about your data or this field, but you could try different parameterizations of the splines following the functional variations (Figure 4) in this excellent paper:

While it doesn’t support (afaik) the ti or te smoothers, brms does support the t2. I’ve also had luck significantly reducing computation with specifying lower values of k explicitly. I think the default is 10.

martinmodrak · July 15, 2021, 7:46pm

Yes, that’s what I had in mind, forgot the exact brms syntax. Meanfield and fullrank are two variants of ADVI, see Stan manual for more details. It also seems you cannot do optimizing directly via brm, but you can always use make_stancode and make_standata to get a Stan model you can use with the optimization mode.

To a first approximation, you want to see nice uncorrelatec 2D Gaussian blobs everywhere. The first plot does not show something obviously wrong as the pairs that have weird shapes are AFAIK parameters that are transformed from the scales used internally by brms (not on computer so cannot check the naming, but I think the z variables are the untransformed ones). It is also usually better to log transform the sd parameters as those are sampled on the log scale.

The second plot looks more funky, especially the Intercept vs. sd, but I am honestly quite unsure what to make of it.

rok_cesnovar · July 16, 2021, 5:26am

Hey all,

this thread is super interesting and I dont really have anything to add at this point in the discussion.

I would maybe suggest editing the title to more accurately describe what the majority of the thread has been about as it has really gone in a different direction since the first post. Might help others that would benefit from this discussion and get others to chime in that might not open the thread given the title.

Joshua_Hartshorne · July 23, 2021, 7:33pm

@rok_cesnovar – stupid question: How do I edit the title? I don’t see an edit option.

Joshua_Hartshorne · July 23, 2021, 7:35pm

BTW I have partly answered my original problem. I had two problems. First, I was using RStudio 1.1, which apparently had a known bug with parallel, now fixed. Upgrading solved the problem.

But this raised a second problem: opencl support is deprecated on MacOS. So I think I probably can’t make use of parallelization in brms as described in the brms documentation.

Right?

martinmodrak · July 27, 2021, 2:35pm

Discourse put some limits on what new users can do. I see an edit icon immediately next to the title. If you don’t just giveme the intended title (preferably in a personal message to not spam the thread) and I’ll edit it for you.

Not necessarily: OpenCL has been deprecated but not removed (at least until Catalina). Not sure if everything will run on OpenCL 1.2 and I am not a Mac user myself, but it does not seem hopeless.

ucfagls · May 20, 2022, 11:21am

I’ts not just that the s(age, by = congruent) terms are centred about the overall model constant term, the basis for each of the smooths that this creates doesn’t include the span of constant functions. As a result, unless you include congruent as a parametric term, there is nothing in the model that is explicitly there to model the deviations of group means from the constant term. That means the centred splines have to work very hard to model both the variation in Y as a function of age and the group means (or at least deviations from the constant term).

It is almost certainly a mistake to not include congruent as a parametric factor term.

maedoc · May 21, 2022, 8:58pm

I don’t expect they’ll drop CL, since they seem to be using a CL to Metal translator when using CL on macOS now. Though, confusingly, when one uses CL via native arm64, it goes to GPU, while when using CL via Intel emulation, it runs on CPU. That’s kind of weird but still works, suggesting it will keep working in the future.

Topic		Replies	Views
Brms: How do I confirm that the GPU is working for speedup by opencl？ brms	7	2144	August 16, 2021
Large longitudinal models run in brms crashing on one Mac but not another brms	3	204	March 9, 2024
The analyses using brms has slowed down, any hint? General performance	9	1349	April 24, 2020
Brms: Error in opencl(c(0, 0)) : could not find function "opencl" brms	4	688	August 15, 2021
brms：OpenCL function on MacOS Catalina brms mac , gpu	2	966	November 13, 2023

Using the Apple M1 GPUs: question from a noob

Related topics