# If I can spped up base my reduce sum stan code

I have 30000 samples to run in my program. I use reduce sum method but it still needs to run for a long time.

I have the code like this. If I can write the loop of reduce sum to a vector?

``````functions {
real partial_sum(
int[] slice_y,
int start, int end,
matrix x1,matrix x2, vector[] thresh,
vector beta1, vector[] beta2, int[] g
)
{
real lp = 0;
for(i in start:end)
lp += ordered_logistic_lpmf(slice_y[i-start+1] |x1[i]*beta1+x2[i]*beta2[g[i]]  , thresh[g[i]]);
return lp;
}
real induced_dirichlet_lpdf(vector c, vector alpha, real phi) {
int K = num_elements(c) + 1;
vector[K - 1] sigma = inv_logit(phi - c);
vector[K] p;
matrix[K, K] J = rep_matrix(0, K, K);
// Induced ordinal probabilities
p[1] = 1 - sigma[1];
for (k in 2:(K - 1))
p[k] = sigma[k - 1] - sigma[k];
p[K] = sigma[K - 1];
// Baseline column of Jacobian
for (k in 1:K) J[k, 1] = 1;
// Diagonal entries of Jacobian
for (k in 2:K) {
real rho = sigma[k - 1] * (1 - sigma[k - 1]);
J[k, k] = - rho;
J[k - 1, k] = rho;
}
return   dirichlet_lpdf(p | alpha)
+ log_determinant(J);
}
}

data {
int<lower=1> N;             // Number of observations
int<lower=1> K;             // Number of ordinal categories
int<lower=1> D;
int<lower=1> DN;
array[N] int<lower=1, upper=K> y; // Observed ordinals
matrix[N,D-DN] x1;
matrix[N,DN] x2;
int g[N];
int<lower=1> P;        //P different threshold
}
parameters {
vector[D-DN] beta1;
vector[DN] beta21;
real<lower=0> sigma;

vector<lower=0>[DN] Omega;
ordered[K - 1] thresh[P];
vector[DN] yita[P-1];
//cholesky_factor_corr[DN] L_Omega;

}

transformed parameters{
vector[DN] beta2[P];

beta2[1] = beta21;
for (i in 2: P){
beta2[i] = beta2[i-1] + Omega.*yita[i-1];
}
}
model {
//vector[DN] Zero= rep_vector(0,DN);
beta1~ normal(0,10);
beta21 ~normal(0,10);
thresh[1] ~ induced_dirichlet(rep_vector(1, K), 0);
sigma ~ normal(0,1);
Omega ~ normal(0,1);
//yita[1] ~ normal(0,1);
for (i in 1: (P-1)){
(thresh[i+1]-thresh[i])~ normal(0,sigma);
//(beta2[i+1]-beta2[i])~ normal(0,Omega);
yita[i] ~ normal(0,1);
}
target += reduce_sum(partial_sum, y, 1, x1,x2, thresh, beta1,beta2, g);
}

``````

Some suggestions?

You can do a few things within `partial_sum`, though that may not be the bottleneck. You can order the data such that the groups `g` are ordered, which would minimize slicing the array. You can also make `lp` a vector and just sum once in the return statement:

``````int len = end-start+1;
array[len] real lp;
for(i in 1:len) {
lp[i] = ...;
}
return sum(lp);
``````

cmdstanr has profiling that you can use to see what parts of the model are bottlenecks.

Also, in `reduce_sum`, you are setting the `grainsize` to 1, which means let stan automatically decide. If memory copies are limiting instead of CPU crunching, you can start with a more conservative split. Basically, something like dividing the number of observations by the number of physical cores per chain. That way, you use as many physical cores as you have available but minimize memory copies. Or, depending on how many groups you have, you might split the data across groups; then inside the `partial_sum` there is no loop at all.