Presumably this is something you need to run many times for many datasets?
Take a look at the tricks I used here; you’ll have to add the group structure but otherwise I think the reduced-redundant-computation and sufficient-stats tricks should give you substantial speed ups.