When working with reduce_sum, any arguments that aren’t ‘sliced’ get copied in full to every core/thread. This copy cost is much greater for parameters than for data (since both the value and the gradient need to get copied).
When calling: array[N] real Xb=to_array_1d(X*beta); // linear predictors, the entire N values are upgraded to parameters. It will probably be more efficient to copy X and beta separately, so that only M values are parameters.
And to second @jsocolar’s point, reduce_sum needs the complexity of the likelihood to outweigh the overhead of parallelism. If you have access to a GPU, you’ll likely get much better performance by using normal_id_glm with opencl