STAN model is very slow when using large data (only uses one CPU!)

Hi Community,

I am currently using cmdstanpy to train a model, but with >3k records, it gets very slow. I have noticed that It only uses 1 core no matter what I do per chain. So, I read about the reduced_sum method and want to implement it in my code to utilize all available cores (I have 96 cores available). I have a model block, which is shown below. Can you help me with the correct reduced_sum implementation for this?

for (b in 1:B){
        s_level[b] ~ std_normal();
        
        for (r in 1:R){
            intercept_raw[b,r] ~ std_normal();
            sigma_raw[b,r] ~ std_normal();
            
            beta_seasonality_raw[b,r] ~ std_normal();
            
            value[I[b,r,1]:I[b,r,2]] ~  normal(mu[I[b,r,1]:I[b,r,2]], sigma[b,r]);;
        };
    }

It’s hard to say much without more information, but the first thing I’d recommend is vectorizing your likelihoods and priors, instead of doing them one-by-one in the for() loop. This may reduce the need for reduce_sum().

// declare an in-block variable
s_level ~ std_normal();
to_vector(intercept_raw) ~ std_normal();
to_vector(sigma_raw) ~ std_normal();
to_vector(beta_seasonality_raw) ~ std_normal();

// Do something to align value, mu, and sigma to the same set of indices; may require a for loop

value ~ normal(mu_aligned, sigma_aligned);

Thank you so much @ Christopher-Peterson for your reply. I am new to stan programming and don’t know how I can implement it. I would appreciate it if you could help me vectorize this block. I am really grateful for your response.

I can help, but I’ll probably need to see the whole model code.

Incidentally, it may be worth seeing if brms can fit your model (although that would be based in R, not Python); it can automatically generate Stan code that uses reduce_sum(), and generally uses efficient parameterizations.