Hello,

I’m a novice to stan and have now gone through a couple iterations of coding to try to make things run faster. I’ve narrowed the bottleneck to a single routine that calculates the sum of two dependent random variables. I originally coded this routine with the integrate_1d function, then with the integrate_ode_rk45 function, but neither of these worked (errors from boost, presumably). The function I need to implement is:

f_C(t_C|t_O)=\int_m^{t_C}f_O(t_O)f_D(t_C-t_O|t_O)dt_O

So, I’ve implemented methods to use discrete approximations instead of integrate functions. The code runs now and gets reasonable answers, but the gradient evaluation is slow. Using only 1/4 of the dataset, I get the following time estimate:

```
Gradient evaluation took 0.987295 seconds
```

Ok, I’m trying to improve on that. I’m hoping there is a more efficient way to implement the code that is causing the bottleneck. It involves two nested for loops. I haven’t figured out how to “vectorize” or “matricize” these to use built-in vector and / or matrix operations due to the dependence of the D random variable on the O random variable. Each iteration requires the D value (i.e., t_C-t_O) to be updated based on the O value in order not to violate this dependence. Assuming independence results in significant inaccuracies. For this implementation, the random variables O and D are scaled and translated beta-distributed, although they don’t need to be.

Below is the stan code that implements this bottle-necking function. If someone would be willing to have a look and point out any ways this code could be made more efficient (computationally speaking), I would be very grateful.

```
array[] real fC_discretized(real m, real M, array[] real ps_fO, real sD1, real sD2, array[] real xs) {
int n = size(ps_fO);
if(n != size(xs)) {
reject("Number of fO bins doesn't match that of xs in fC");
}
array[n] real val;
for(j in 1:n) {
val[j] = 0.0;
real pfD = 0.0;
for(i in 1:j) {
if(xs[i] == M) {
pfD = 0.0;
}
else {
//xs[j] is C, xs[i] is O, xs[j]-xs[i] D
//M is max possible value, n is number of bins
real h = (M-xs[i]) / n;
real high = (xs[j]-xs[i])/(M-xs[i]);
real low = (xs[j]-xs[i]-h)/(M-xs[i]);
if(low<0) {
pfD = 0.0;
}
else {
pfD = beta_cdf(high, sD1, sD2) - beta_cdf(low, sD1, sD2);
}
}
val[j] += ps_fO[i] * pfD;
}
}
//renormalized - shouldn't need to, but seems to not sum to 1; missing scaling somewhere... otherwise, matches the theoretical distribution very well when scaled
real tot = sum(val);
for(i in 1:n) {
val[i] = val[i]/tot;
}
return val;
}
```

Thank you for your thoughts and help with this.

In case this is of use, the original code using integrate_1d was:

```
real fC_integrand(real x, real xc, array[] real theta, array[] real x_r, array[] int x_i ) {
real m = theta[1]; //minimum possible value
real M = theta[2]; //maximum possible value
real sO1 = theta[3]; //beta shape 1
real sO2 = theta[4]; //beta shape 2
real sD1 = theta[5]; //beta shape 1
real sD2 = theta[6]; //beta shape 2
real tC = theta[7]; //value of RV O
//fO and fD are scaled and translated beta pdfs
//x is value of RV O, tC-x is value of RV D
real res = fO(x, m, M, sO1, sO2) * fD(tC-x, x, M, sD1, sD2);
return res;
}
real fC(real tC,real m, real M, real sO1, real sO2, real sD1, real sD2) {
//not sure how to deal with x_r and x_i...
real res = integrate_1d(fC_integrand, m, tC, { m, M, sO1, sO2, sD1, sD2, tC }, {1.1}, {1});
return res;
}
```

Function fC is integrated in another function Pt, and Pt is integrated in another function fX, resulting in a triple integral that refused to cooperate. I’d prefer go the “integrate” route if at all possible. Has anyone had success handling triple and quadruple integrals in an efficient way?