Slow computing speed in fitting AR1xAR1 model

Thank you @bbbales2 for replying to my question. I updated the function chunk with rstan code provided from this thread Multivariate animal model.

stanmodelcode5 <- "
functions {
 matrix chol_AR_matrix(real rho,int d){
   matrix[d,d] MatAR;
   MatAR = rep_matrix(0,d,d);
   for(i in 1:d){
     for(j in i:d){
       if(j>=i && i==1) MatAR[j,i] = rho^(j-1);
       else if(i>=2 && j>=i) MatAR[j,i] = rho^(j-i)*sqrt(1-rho^2);
     }
   }
   return MatAR;
 }
 
 matrix as_matrix(vector X, int N, int K) { 
   matrix[N, K] Y; 
   for (i in 1:N) {
     Y[i] = to_row_vector(X[((i - 1) * K + 1):(i * K)]); 
   }
   return Y; 
 }
 vector chol_kronecker_product(matrix LA, matrix LG, vector a) {
   vector[num_elements(a)] new_a;
   new_a = rep_vector(0, num_elements(a));
   for(iA in 1:cols(LA)){
     for(jA in 1:iA){
       if(LA[iA, jA] > 1e-10){ // avoid calculating products between unrelated individuals
         for(iG in 1:cols(LG)){
           for(jG in 1:iG){
             new_a[(cols(LG)*(iA-1))+iG] = new_a[(cols(LG)*(iA-1))+iG] + 
                                           LA[iA, jA] * LG[iG, jG] * a[(cols(LG)*(jA-1))+jG];
           }
         }
       }
     }
   }
   return new_a;
 }
}
data {
 int<lower=0> N;
 real y[N];
 int nrow;
 int ncol;
} 
parameters {
 real mu;
 real<lower=0,upper=1> rho_r;
 real<lower=0,upper=1> rho_c;
} 
model {
 matrix[nrow*ncol, 1] a;
 a = as_matrix(chol_kronecker_product(chol_AR_matrix(rho_r,nrow),chol_AR_matrix(rho_c,ncol), rep_vector(1,nrow*ncol)), nrow*ncol, 1);
 
 target += normal_lpdf(mu | 0, 10);
 target += normal_lpdf(y  | mu, 1);
} 
"

y <- rnorm(20) 
dat5 <- list(N = 20, y = y,nrow=18,ncol=93); 
fit5 <- stan(model_code = stanmodelcode5, model_name = "example", 
            data = dat5, iter = 100, chains = 1)
print(fit5)

The cost time for 100 iterations is

Elapsed Time: 135.205 seconds (Warm-up)
Chain 1:                102.643 seconds (Sampling)
Chain 1:                237.848 seconds (Total)

The output from check_hmc_diagnostics(fit5) is

Divergences:
0 of 50 iterations ended with a divergence.
Tree depth:
0 of 50 iterations saturated the maximum tree depth of 10.
Energy:
E-BFMI indicated no pathological behavior.

The output of get_num_leapfrog_per_iteration(fit5) is

 [1]  3 31 13  3 15 31 23 11 31  1  7 31 31 15 15 15  3 31 15 15  7  7  3
[24] 11 11 31 31 31  7 31 11 31 31 31 31 11 11  3 15 31 23 15 19  7 19 31
[47]  3 31 31  7

Compared to the model with given \rho s, the get_num_leapfrog_per_iteration is

 [1] 3 3 1 7 3 3 7 3 3 3 3 3 3 1 3 3 3 1 3 7 3 3 3 3 3 3 3 3 3 3 1 7 3 1 1
[36] 1 3 3 3 1 1 1 1 3 1 3 3 3 1 3

The problem is the model itself is not efficient. I will have a look at my code and any comments are welcome.

Thank you so much for your time and help.