If I can spped up base my reduce sum stan code

I have 30000 samples to run in my program. I use reduce sum method but it still needs to run for a long time.

I have the code like this. If I can write the loop of reduce sum to a vector?

functions {
  real partial_sum(
    int[] slice_y,
    int start, int end,
    matrix x1,matrix x2, vector[] thresh,
    vector beta1, vector[] beta2, int[] g
    real lp = 0;
    for(i in start:end)
      lp += ordered_logistic_lpmf(slice_y[i-start+1] |x1[i]*beta1+x2[i]*beta2[g[i]]  , thresh[g[i]]);
    return lp;
  real induced_dirichlet_lpdf(vector c, vector alpha, real phi) {
    int K = num_elements(c) + 1;
    vector[K - 1] sigma = inv_logit(phi - c);
    vector[K] p;
    matrix[K, K] J = rep_matrix(0, K, K);
    // Induced ordinal probabilities
    p[1] = 1 - sigma[1];
    for (k in 2:(K - 1))
      p[k] = sigma[k - 1] - sigma[k];
    p[K] = sigma[K - 1];
    // Baseline column of Jacobian
    for (k in 1:K) J[k, 1] = 1;
    // Diagonal entries of Jacobian
    for (k in 2:K) {
      real rho = sigma[k - 1] * (1 - sigma[k - 1]);
      J[k, k] = - rho;
      J[k - 1, k] = rho;
    return   dirichlet_lpdf(p | alpha)
           + log_determinant(J);

data {
  int<lower=1> N;             // Number of observations
  int<lower=1> K;             // Number of ordinal categories
  int<lower=1> D;
  int<lower=1> DN;
  array[N] int<lower=1, upper=K> y; // Observed ordinals
  matrix[N,D-DN] x1;
  matrix[N,DN] x2;
  int g[N];
  int<lower=1> P;        //P different threshold 
parameters {
  vector[D-DN] beta1;  
  vector[DN] beta21;
  real<lower=0> sigma;
  vector<lower=0>[DN] Omega;
  ordered[K - 1] thresh[P]; 
  vector[DN] yita[P-1];
  //cholesky_factor_corr[DN] L_Omega;


transformed parameters{
  vector[DN] beta2[P];
  beta2[1] = beta21;
  for (i in 2: P){
    beta2[i] = beta2[i-1] + Omega.*yita[i-1];
model {
  //vector[DN] Zero= rep_vector(0,DN);
  beta1~ normal(0,10);
  beta21 ~normal(0,10);
  thresh[1] ~ induced_dirichlet(rep_vector(1, K), 0);
  sigma ~ normal(0,1);
  Omega ~ normal(0,1);
  //yita[1] ~ normal(0,1);
  for (i in 1: (P-1)){
   (thresh[i+1]-thresh[i])~ normal(0,sigma);
   //(beta2[i+1]-beta2[i])~ normal(0,Omega);
   yita[i] ~ normal(0,1);
  target += reduce_sum(partial_sum, y, 1, x1,x2, thresh, beta1,beta2, g);

Some suggestions?

You can do a few things within partial_sum, though that may not be the bottleneck. You can order the data such that the groups g are ordered, which would minimize slicing the array. You can also make lp a vector and just sum once in the return statement:

int len = end-start+1;
array[len] real lp;
for(i in 1:len) {
  lp[i] = ...;
return sum(lp);

cmdstanr has profiling that you can use to see what parts of the model are bottlenecks.

Also, in reduce_sum, you are setting the grainsize to 1, which means let stan automatically decide. If memory copies are limiting instead of CPU crunching, you can start with a more conservative split. Basically, something like dividing the number of observations by the number of physical cores per chain. That way, you use as many physical cores as you have available but minimize memory copies. Or, depending on how many groups you have, you might split the data across groups; then inside the partial_sum there is no loop at all.