Stan_lmer really slow

I have a very similar situation to OP.

I haven’t noticed much difference between brms with cmdstanr vs rstan, but I have found using threads=threading(2) tends to slow down my brm(y ~ a + b + c + (1|id), chains=4, cores=4) sampling by a factor of 2-3 on a 12-physical-core Ryzen.

Is this thread the best reference on using GPU parallelism?

What is the main difference between writing that model manually and using brms to generate one, such that manually writing it would make it perform faster?