Question about custom function (vectorization possible?)

Dear All,

We have simulated datasets of log-normally distributed response times that are assumed to be the sum of two latent response time components that are also log-normally distributed, i.e. rt = rt_{1} + rt_{2}. Therefore, I have implemented the Fenton-Wilkinson-Approximation as a custom function (see code below). The custom function is based on Eq. 3-6 in Pirinen (2003) and adapted from a Python implementation by Nakamura (2015). The approximation works, and the whole model converges more often than not, in reasonable time (90 minutes for I = 20 items and J = 500 persons). Nevertheless, profiling the model shows that most of the time is spent in the custom function.

The custom function has to inputs: (1) a vector of two means and (2) a vector of two standard deviations, that are calculated/sampled in the model, and (3) an integer value specifying the number of response time components. My question is as follows: are there any possibilities to make the custom function more efficient, for instance by vectorizing it?

Here is the function:

fenton_stan <- "
  functions {
  
    row_vector fenton1(vector m, vector s, int D) {

      matrix[D,D] term1;
      matrix[D,D] term2;
      row_vector[2] out;
      
      for(i in 1:D) {
       for(j in 1:D) { 
        term1[i,j] = m[i] + m[j];
        term2[i,j] = s[i] + s[j]; 
       }
      }

      out = [2*log_sum_exp(m + 0.5 * s)-0.5*log_sum_exp(term1 + 0.5 * term2 + diag_matrix(s)), 
             sqrt(log_sum_exp(term1 + 0.5 * term2 + diag_matrix(s))-2*log_sum_exp(m + 0.5 * s))];
      
      return(out);
 
    }
  
 }
"

rstan::expose_stan_functions(rstan::stanc(model_code = fenton1_stan))

example in Cobb & Rumi (2012); result is correct!
fenton1(m = c(rep(0.69,5)), s = c(rep(1.074^2,5)), D = 5)

The loop is essentially equivalent to outer(m, m, FUN = '+') in R.

Any help/suggestions appreciated. If you need more information, please let me know.

Best wishes
Chris

1 Like

I checked out the Pirinen paper and see that the matrices term1 and term2 are symmetric. You only need to loop over the lower triangle. The sum can be completed in one step so we don’t need to construct two matrices.

Instead of constructing a matrix, which will create the unnecessary upper triangle (and since Stan doesn’t have lower-triangular matrices), we can use a vector.

A log(2) needs to be added to the log_sum_exp on the off-diagonal values to account for both triangles of the matrix.

functions {
     row_vector fenton3(vector m, vector s, int D) {
      int N_lower = (D * (D - 1)) / 2 + D;
      vector[N_lower] term;
      row_vector[2] out;
      real cache1;
      real cache2 = log_sum_exp(m + 0.5 * s);
      int iter = 1;
      
    term[iter] = (m[1] + m[1]) + 0.5 * (s[1] + s[1]) + s[1];

    for(i in 2:D) {
      for(j in 1:(i - 1)) { 
        iter += 1;
        term[iter] = (m[i] + m[j]) + 0.5 * (s[i] + s[j]) + log2();
      }
      iter += 1;
      term[iter] = (m[i] + m[i]) + 0.5 * (s[i] + s[i]) + s[i];
    }
      
      cache1 = log_sum_exp(term);

      return [2 * cache2 - 0.5 * cache1, 
             sqrt(cache1 - 2 * cache2)];
    }
  }

Benchmarking with your original, as well as the lower-triangle matrix version (mat), gives

> mbm = microbenchmark(
+   original =  fenton1(m = rnorm(n), s = rnorm(n, 1, 1), D = n),
+   mat <- fenton2(m = rnorm(n), s = rnorm(n, 1, 1), D = n),
+   vec <- fenton3(m = rnorm(n), s = rnorm(n, 1, 1), D = n),
+   times = 500
+ )
> mbm
Unit: microseconds
                                                    expr    min      lq     mean  median      uq    max neval
                                                original 13.741 15.8030 18.27840 16.7695 19.3365 59.603   500
 mat <- fenton2(m = rnorm(n), s = rnorm(n, 1, 1), D = n) 11.172 13.1315 15.61566 14.1235 17.1325 66.951   500
 vec <- fenton3(m = rnorm(n), s = rnorm(n, 1, 1), D = n) 10.569 12.3105 14.66244 13.0260 16.6530 49.638   500

Not a great deal of speed up for small matrices but the larger the matrix size the more speedup you’ll see, for e.g.,

> n <- 100
> mbm = microbenchmark(
+   original =  fenton1(m = rnorm(n), s = rnorm(n, 1, 1), D = n),
+   mat <- fenton2(m = rnorm(n), s = rnorm(n, 1, 1), D = n),
+   vec <- fenton3(m = rnorm(n), s = rnorm(n, 1, 1), D = n),
+   times = 500
+ )
> mbm
Unit: microseconds
                                                    expr     min       lq     mean  median       uq      max neval
                                                original 225.931 287.2425 593.5566 415.102 798.6665 5944.074   500
 mat <- fenton2(m = rnorm(n), s = rnorm(n, 1, 1), D = n) 121.637 158.8060 280.0016 196.691 360.1385 3080.141   500
 vec <- fenton3(m = rnorm(n), s = rnorm(n, 1, 1), D = n)  74.827  99.4935 187.6425 132.416 237.3530 2475.996   500
1 Like

Hi spinkney,

Thank you so much for looking into this, and the optimized code. I tested it in my model and it works like a charm. In the meantime, I adapted the code to work on lists in R (see below). Am I correct that, to make this work in Stan, I need to construct m and s to be arrays of vectors?

fenton3_stan <- '
functions {
  row_vector[] fenton3(vector[] m, vector[] s, int D) {
    
    int N = size(m);
    int N_lower = (D * (D - 1)) / 2 + D;

    row_vector[2] out[N];
    real cache1;
    real cache2;  

    for(n in 1:N) {
    int iter = 1;
    vector[N_lower] term;
    
    cache2 = log_sum_exp(m[n] + 0.5 * s[n]);
    
     term[iter] = (m[n][1] + m[n][1]) + 0.5 * (s[n][1] + s[n][1]) + s[n][1];
    
    for(i in 2:D) {
      for(j in 1:(i - 1)) { 
        iter += 1;
        term[iter] = (m[n][i] + m[n][j]) + 0.5 * (s[n][i] + s[n][j]) + log2();
      }
      iter += 1;
      term[iter] = (m[n][i] + m[n][i]) + 0.5 * (s[n][i] + s[n][i]) + s[n][i];
    }
    
    cache1 = log_sum_exp(term);
    
    out[n] =[2 * cache2 - 0.5 * cache1, 
            sqrt(cache1 - 2 * cache2)];
    }
    
    return(out);
  }
}'

rstan::expose_stan_functions(rstan::stanc(model_code = fenton3_stan))

On an unrelated note. I was wondering about the relationship between the generated quantities block and the rest of the Stan model specification. More specifically, I asked my self whether any quantities calculated in this block do have any regularizing influence on parameters sampled in the model?

Thanks and best wishes
Christoph

Yes, the arrays of vectors is necessary for it to work in Stan. The way you have it works though I’d break it out into 2 functions to clarify that this just looping over a function. It has the additional benefit of making it easier to update if D happens to vary across the N-array.

functions {
     row_vector fenton3(vector m, vector s, int D, int N_lower) {
      vector[N_lower] term;
      row_vector[2] out;
      real cache1;
      real cache2 = log_sum_exp(m + 0.5 * s);
      int iter = 1;
      
    term[iter] = (m[1] + m[1]) + 0.5 * (s[1] + s[1]) + s[1];

    for(i in 2:D) {
      for(j in 1:(i - 1)) { 
        iter += 1;
        term[iter] = (m[i] + m[j]) + 0.5 * (s[i] + s[j]) + log2();
      }
      iter += 1;
      term[iter] = (m[i] + m[i]) + 0.5 * (s[i] + s[i]) + s[i];
    }
      
      cache1 = log_sum_exp(term);

      return [2 * cache2 - 0.5 * cache1, 
             sqrt(cache1 - 2 * cache2)];
    }
  }

 row_vector[ ] fenton_array (vector[] m, vector[] s, int D) {
   int N = size(m);
   int lower = (D * (D - 1)) / 2 + D;
   row_vector[2] out[N];

   for (n in 1:N) 
         out[n] = fenton3(m[n], s[n], D, lower);

  return out;
 }
}

The generated quantities shouldn’t have any influence on parameters declared in the parameters block. It is run after each HMC iteration post-warmup, which is also why any issues in gq will typically be found after warmup.

1 Like