Slow computing speed in fitting AR1xAR1 model

Zhanglong · July 13, 2020, 2:09am

Hi Guys, I am using rstan to fit the model where the residual term is AR1\otimes AR1. Well, the problem is that the computing speed is extremely slow even if I put the term in a simple model.

stanmodelcode <- "
functions {
// cholesky decomposition of an AR matrix
  matrix chol_AR_matrix(real rho,int d){
    matrix[d,d] MatAR;
    MatAR = rep_matrix(0,d,d);
    for(i in 1:d){
      for(j in i:d){
        if(j>=i && i==1) MatAR[j,i] = rho^(j-1);
        else if(i>=2 && j>=i) MatAR[j,i] = rho^(j-i)*sqrt(1-rho^2);
      }
    }
    return MatAR;
  }
// Kronecker product of cholesky decompistions of two matrices
  matrix chol_kronecker_product(matrix matA, matrix matB) {
    matrix[rows(matA)*rows(matB),rows(matA)*rows(matB)] matC;
    matC = rep_matrix(0,rows(matA)*rows(matB),rows(matA)*rows(matB));
    for (k in 1:rows(matA)){
      for (l in 1:k){
        for (m in 1:rows(matB)){
          for (n in 1:m){
            matC[rows(matB)*(k-1)+m, cols(matB)*(l-1)+n] = matA[k,l] * matB[m,n];
          }
        }
      }
    }
    return matC;
  }
}
data {
  int<lower=0> N;
  real y[N];
  int n1;
  int n2;
} 
parameters {
  real mu;
  real<lower=0,upper=1> rho_c; // rho columns
  real<lower=0,upper=1> rho_r; // rho rows
} 
transformed parameters {
  matrix[n1*n2,n1*n2] ar1byar1; 
  ar1byar1 = chol_kronecker_product(chol_AR_matrix(rho_c,n1),chol_AR_matrix(rho_r,n2)); 
}
model {
  target += normal_lpdf(mu | 0, 10);
  target += normal_lpdf(y  | mu, 1);
} 
"
y <- rnorm(20) 
dat <- list(N = 20, y = y,n1=98,n2=13); 
fit <- stan(model_code = stanmodelcode, model_name = "example", 
            data = dat, iter = 100, chains = 1) 
print(fit)

As you can see that I didn’t estimate the two parameters, but only put them in the stan code to test the computing speed.
With unknown \rho s, the computing speed is extremely slow (350 seconds). However, if I give \rho_1=0.5 and \rho_2=0.6, for example, in the data chunk, the speed is much faster and runs as a normal simple model (1.5 seconds).

Alternatively, if I only calculate AR1 matrices without calculating Korenecher product, the computing speed is still okay to estimate unknown \rho s.

I don’t know why it is so slow if I put AR1 and Korenecher product together.

Thank you all in advance for your help and time.

R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
locale:
[1] LC_COLLATE=English_Australia.1252 
[2] LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252

Update:
I have put print("chains") somewhere in the code to monitor the process and noticed that rstan is still running but didn’t achieve any valid chains.

bbbales2 · July 13, 2020, 6:17pm

Is this the total Stan runtime? That can get longer cause the posterior is harder to sample (and in this case maybe the gradients wouldn’t be that expensive).

In Rstan, get_num_leapfrog_per_iteration(fit) * number_post_warmup_draws / walltime should be a better measure of exactly how expensive each function calculation is (Check HMC diagnostics after sampling — check_hmc_diagnostics • rstan for get_num_leapfrog_per_iteration docs).

Zhanglong · July 14, 2020, 2:15am

Thank you @bbbales2 for replying to my question. I updated the function chunk with rstan code provided from this thread Multivariate animal model.

stanmodelcode5 <- "
functions {
 matrix chol_AR_matrix(real rho,int d){
   matrix[d,d] MatAR;
   MatAR = rep_matrix(0,d,d);
   for(i in 1:d){
     for(j in i:d){
       if(j>=i && i==1) MatAR[j,i] = rho^(j-1);
       else if(i>=2 && j>=i) MatAR[j,i] = rho^(j-i)*sqrt(1-rho^2);
     }
   }
   return MatAR;
 }
 
 matrix as_matrix(vector X, int N, int K) { 
   matrix[N, K] Y; 
   for (i in 1:N) {
     Y[i] = to_row_vector(X[((i - 1) * K + 1):(i * K)]); 
   }
   return Y; 
 }
 vector chol_kronecker_product(matrix LA, matrix LG, vector a) {
   vector[num_elements(a)] new_a;
   new_a = rep_vector(0, num_elements(a));
   for(iA in 1:cols(LA)){
     for(jA in 1:iA){
       if(LA[iA, jA] > 1e-10){ // avoid calculating products between unrelated individuals
         for(iG in 1:cols(LG)){
           for(jG in 1:iG){
             new_a[(cols(LG)*(iA-1))+iG] = new_a[(cols(LG)*(iA-1))+iG] + 
                                           LA[iA, jA] * LG[iG, jG] * a[(cols(LG)*(jA-1))+jG];
           }
         }
       }
     }
   }
   return new_a;
 }
}
data {
 int<lower=0> N;
 real y[N];
 int nrow;
 int ncol;
} 
parameters {
 real mu;
 real<lower=0,upper=1> rho_r;
 real<lower=0,upper=1> rho_c;
} 
model {
 matrix[nrow*ncol, 1] a;
 a = as_matrix(chol_kronecker_product(chol_AR_matrix(rho_r,nrow),chol_AR_matrix(rho_c,ncol), rep_vector(1,nrow*ncol)), nrow*ncol, 1);
 
 target += normal_lpdf(mu | 0, 10);
 target += normal_lpdf(y  | mu, 1);
} 
"

y <- rnorm(20) 
dat5 <- list(N = 20, y = y,nrow=18,ncol=93); 
fit5 <- stan(model_code = stanmodelcode5, model_name = "example", 
            data = dat5, iter = 100, chains = 1)
print(fit5)

The cost time for 100 iterations is

Elapsed Time: 135.205 seconds (Warm-up)
Chain 1:                102.643 seconds (Sampling)
Chain 1:                237.848 seconds (Total)

The output from check_hmc_diagnostics(fit5) is

Divergences:
0 of 50 iterations ended with a divergence.
Tree depth:
0 of 50 iterations saturated the maximum tree depth of 10.
Energy:
E-BFMI indicated no pathological behavior.

The output of get_num_leapfrog_per_iteration(fit5) is

 [1]  3 31 13  3 15 31 23 11 31  1  7 31 31 15 15 15  3 31 15 15  7  7  3
[24] 11 11 31 31 31  7 31 11 31 31 31 31 11 11  3 15 31 23 15 19  7 19 31
[47]  3 31 31  7

Compared to the model with given \rho s, the get_num_leapfrog_per_iteration is

 [1] 3 3 1 7 3 3 7 3 3 3 3 3 3 1 3 3 3 1 3 7 3 3 3 3 3 3 3 3 3 3 1 7 3 1 1
[36] 1 3 3 3 1 1 1 1 3 1 3 3 3 1 3

The problem is the model itself is not efficient. I will have a look at my code and any comments are welcome.

Thank you so much for your time and help.

bbbales2 · July 14, 2020, 1:52pm

The number of leapfrog iterations for the model with the parameters is on the low side. I wouldn’t be surprised to see 128 or 256 occasionally in a model I thought was working alright.

It’s just with fixed parameters the number of leapfrog iterations are extremely low lol. The posterior must basically be a normal distribution.

Zhanglong · July 15, 2020, 2:42am

Thanks for the reply. I still could not solve the problem. The model looks good and the code is all right. Perhaps with a long time computation, the outcome is okay, but it is not efficient.

I encountered this problem a few weeks ago when I was reading the reference “Flexible modelling of spatial variation in agricultural field trials with the R package INLA” and tried to use INLA to fit the model with AR1\otimes AR1 spatial residual covariance. I used brms in the first place and found that it is not capable to do so. Then I coded up in rstan myself and am facing this problem.

bbbales2 · July 15, 2020, 6:29pm

Wait a minute:

model {
 matrix[nrow*ncol, 1] a;
 a = as_matrix(chol_kronecker_product(chol_AR_matrix(rho_r,nrow),chol_AR_matrix(rho_c,ncol), rep_vector(1,nrow*ncol)), nrow*ncol, 1);
 
 target += normal_lpdf(mu | 0, 10);
 target += normal_lpdf(y  | mu, 1);
}

a isn’t used anywhere, so it seems like there’s a problem with thtis model.

Zhanglong · July 16, 2020, 12:35am

Thanks for the reply. That’s right. I didn’t use a in the model. I put it there only for testing the computing and sampling speed. If I put it in the model, such as

// another parameter in the parameter chunk, and assume mu is a vector
// real<lower=0> sig; 
vector[N] b;
to_vector(b) ~ normal(0,1);
a = sig*chol_kronecker_product(chol_AR_matrix(rho_r,nrow),chol_AR_matrix(rho_c,ncol), b);
target += normal_lpdf(y | mu, exp(a));

it is still slow.

For a small size of data set, such as 10 by 10, it works okay. But for large size, it is too slow.

Zhanglong · July 20, 2020, 12:35pm

It is not a solution, but I would like to keep this thread up to date.
The model is a linear mixed model Y=X\beta+Zu+e, where the residual term is var(e) = \sigma^2 \Sigma = \sigma^2AR1(\rho_1)\otimes AR1(\rho_2).
I used the following stan code to fit the model, and the result is the same as given by asreml. The computing speed is still slow, but, at least, the outcome proves that the model is correct.

// generated with brms 2.13.0
functions {
  matrix chol_AR_matrix(real rho,int d){
    matrix[d,d] MatAR;
    MatAR = rep_matrix(0,d,d);
    for(i in 1:d){
      for(j in i:d){
        if(j>=i && i==1) MatAR[j,i] = rho^(j-1);
        else if(i>=2 && j>=i) MatAR[j,i] = rho^(j-i)*sqrt(1-rho^2);
      }
    }
    return MatAR;
  }
  
  matrix chol_kronecker_product(matrix matA, matrix matB) {
    matrix[rows(matA)*rows(matB),rows(matA)*rows(matB)] matC;
    matC = rep_matrix(0,rows(matA)*rows(matB),rows(matA)*rows(matB));
    for (k in 1:rows(matA)){
      for (l in 1:k){
        for (m in 1:rows(matB)){
          for (n in 1:m){
            matC[rows(matB)*(k-1)+m, cols(matB)*(l-1)+n] = matA[k,l] * matB[m,n];
          }
        }
      }
    }
    return matC;
  }  
}
data {
  int<lower=1> N;  // number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  int<lower=1> J_1[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_1;
  int prior_only;  // should the likelihood be ignored?

  int nrow;
  int ncol;
}
transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  vector[Kc] b;  // population-level effects
  real Intercept;  // temporary intercept for centered predictors
  real<lower=0> sigma;  // residual SD
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  vector[N_1] z_1[M_1];  // standardized group-level effects
  
  real<lower=0,upper=1> rho_r;
  real<lower=0,upper=1> rho_c;
}
transformed parameters {
  vector[N_1] r_1_1;  // actual group-level effects
  r_1_1 = (sd_1[1] * (z_1[1]));
}
model {
  // initialize linear predictor term
  vector[N] mu = Intercept + Xc * b;
  
  matrix[N,N] BigSig = sigma*chol_kronecker_product(chol_AR_matrix(rho_r,nrow),chol_AR_matrix(rho_c,ncol));

  for (n in 1:N) {
    // add more terms to the linear predictor
    mu[n] += r_1_1[J_1[n]] * Z_1_1[n];
  }

  // priors including all constants
  target += student_t_lpdf(Intercept | 3, 84.7, 31.3);
  target += student_t_lpdf(sd_1 | 3, 0, 31.3) - 1 * student_t_lccdf(0 | 3, 0, 31.3);
  target += std_normal_lpdf(z_1[1]);
  target += student_t_lpdf(sigma | 3, 0, 31.3) - 1 * student_t_lccdf(0 | 3, 0, 31.3);

  // likelihood including all constants
  if (!prior_only) {
    target += multi_normal_cholesky_lpdf(Y | mu, BigSig);
  }
}
generated quantities {
  vector[N] y_rep;
  matrix[N,N] BigSig;
  vector[N] mu = Intercept + Xc * b;
  for (n in 1:N) {
    mu[n] += r_1_1[J_1[n]] * Z_1_1[n];
  }
  BigSig = sigma*chol_kronecker_product(chol_AR_matrix(rho_r,nrow),chol_AR_matrix(rho_c,ncol));
  y_rep = multi_normal_cholesky_rng(mu, BigSig);
}

I will investigate more on this problem and try to find a solution that the computation could finish within a reasonable amount of time.

Zhanglong · September 21, 2020, 8:31am

I probably have resolved the problem. With the property of Kronecker product:
A\otimes B = (L_A L_A^\top)\otimes (L_B L_B^\top) = (L_A\otimes L_B)(L_A \otimes L_B)^\top, I found that in stan doing Kronecker product on vectors is faster than doing it on matrices.

This post transforms the Kolesky decomposition matrix L to a vector by L*z. However, it is still slow.

My idea is using the formula (L_A\otimes L_B)(z_1\otimes z_2) = (L_Az_1)\otimes (L_B z_2), with which we do Kronecker product on two vectors. It proves that this solution is much faster. I computed three matrices for 1000 times as an example and here are the results (suppose we have already got the three L matrices with sizes n1,n2,n3 and M=1000:

system.time({
   for(i in 1:M) r1 = as_matrix(chol_kronecker_product_three(L1,L2,L3,z5),n1*n2,n3)
 })
   user  system elapsed 
  28.37    0.02   28.52 
 system.time({
   for(i in 1:M) r2 = kroVecToMat(chol_kronecker_product_two(L1,L2,z4),L3%*%z3)
 })
   user  system elapsed 
   2.77    0.00    2.76 
 system.time({
   for(i in 1:M) r3 = kroVecToMat(kroVec(L1%*%z1,L2%*%z2),L3%*%z3)
 })
   user  system elapsed 
   0.16    0.00    0.16

R and rstan functions

z1 <- rnorm(n1)
z2 <- rnorm(n2)
z3 <- rnorm(n3)
z4 <- kronecker(z1,z2)
z5 <- kronecker(z4,z3)

stanfun <- 
  "
functions{
  matrix kroVecToMat(vector A, vector B){
    vector[rows(A)*rows(B)] C;
    for (j in 1:rows(A)) {
      C[((j - 1)*rows(B) + 1):(j*rows(B))] = B*A[j];
    }
    return to_matrix(C,rows(A),rows(B),0);
  }
  
  vector kroVec(vector A, vector B){
    vector[rows(A)*rows(B)] C;
    for (j in 1:rows(A)) {
      C[((j - 1)*rows(B) + 1):(j*rows(B))] = B*A[j];
    }
    return C;
  } 
  
  matrix as_matrix(vector X, int N, int K) { 
    matrix[N, K] Y; 
      for (i in 1:N) {
        Y[i] = to_row_vector(X[((i - 1) * K + 1):(i * K)]); 
      }
    return Y; 
  }

  vector chol_kronecker_product_three(matrix LA,matrix LB,matrix LC, vector d) {
    vector[num_elements(d)] new_d;
    new_d = rep_vector(0, num_elements(d));
    for(iA in 1:cols(LA)){
      for(jA in 1:iA){
        for(iB in 1:cols(LB)){
          for(jB in 1:iB){
            for(iC in 1:cols(LC)){
              for(jC in 1:iC){
                new_d[cols(LC)*(cols(LB)*(iA-1)+iB-1)+iC] = new_d[cols(LC)*(cols(LB)*(iA-1)+iB-1)+iC] + LA[iA,jA]*LB[iB,jB]*LC[iC,jC]*d[cols(LC)*(cols(LB)*(jA-1)+jB-1)+jC];
              }
            }
          }
        }
      }
    }
    return new_d;
  }
  
  vector chol_kronecker_product_two(matrix LA,matrix LB,vector d) {
    vector[num_elements(d)] new_d;
    new_d = rep_vector(0, num_elements(d));
    for(iA in 1:cols(LA)){
      for(jA in 1:iA){
        for(iB in 1:cols(LB)){
          for(jB in 1:iB){
            new_d[cols(LB)*(iA-1)+iB] = new_d[cols(LB)*(iA-1)+iB] + LA[iA,jA]*LB[iB,jB]*d[cols(LB)*(jA-1)+jB];
          }
        }
      }
    }
    return new_d;
  }
  
  matrix chol_three(matrix LA,matrix LB,matrix LC, vector d) {
    vector[num_elements(d)] new_d;
    new_d = rep_vector(0, num_elements(d));
    for(iA in 1:cols(LA)){
      for(jA in 1:iA){
        for(iB in 1:cols(LB)){
          for(jB in 1:iB){
            for(iC in 1:cols(LC)){
              for(jC in 1:iC){
                new_d[cols(LC)*(cols(LB)*(iA-1)+iB-1)+iC] = new_d[cols(LC)*(cols(LB)*(iA-1)+iB-1)+iC] + LA[iA,jA]*LB[iB,jB]*LC[iC,jC]*d[cols(LC)*(cols(LB)*(jA-1)+jB-1)+jC];
              }
            }
          }
        }
      }
    }
    return to_matrix(new_d,cols(LA)*cols(LB),cols(LC),0);
  }
  
  
}
"
expose_stan_functions(stanc(model_code = stanfun))

Comment (on 23/09/2020): Perhaps I am overthinking, but I do have a concern. The original method uses (L_1\otimes L_2)z, where z\sim N(0,1), to calculate the Kronecker product of two matrices and convert it to a vector. Then I used (L_1z_1)\otimes(L_2z_2) to calculate the Kronecker product of two vectors, where z_1\sim N(0,1), z_2\sim N(0,1).
My concern is: are these two methods identical? Numerically, I would say yes, but theoretically, I am not sure. If we expand the formula to (L_1z_1)\otimes(L_2z_2)=(L_1\otimes L_2)(z_1\otimes z_2), on the right side, z_1\otimes z_2 is supposed to be the same as z. But it is a tensor space, not a vector. I also have realized that if two variables a,b are Gaussian, the product ab is NOT Gaussian.
Please feel free to put comments below and let me know whether it is correct. Thank you.

bbbales2 · September 21, 2020, 4:29pm

Thanks for sharing your solution!

Ederdbs · November 6, 2021, 5:49pm

Hi @Zhanglong,
I am new user of stan, and have interesting in this model with AR1xAR1 structure.
in the begins of code has “// generated with brms 2.13.0”, you reproduce this model (AR1xAR1) in brms or only direct in stan?
is it possible to share the minimal example code for this example?
Thanks

Zhanglong · November 8, 2021, 1:37am

Hi @Ederdbs, thanks for the post. I reproduced the LMM in brms with the formula
form <- brmsformula(yield ~ 1 + x1 + (1 + x2 | rep ))
and
code<- make_stancode(form, data=df)
to produce the stand code. I modified the residual term by myself in rstan.

Topic		Replies	Views
Stan efficient! Modeling	7	1043	February 17, 2022
Improving efficiency for dynamic coevolutionary Stan model Modeling	12	134	August 8, 2024
Rstan much slower than self-coded MCMC Modeling performance	10	2416	January 14, 2022
Improving efficiency when modeling autocorrelation Modeling performance	8	1758	October 21, 2019
Elementwise multiplications with row sums appear to be slow Modeling fitting-issues , specification	32	1206	August 23, 2021

Slow computing speed in fitting AR1xAR1 model

Related topics