# Efficient code to sum two dependent random variables

Hello,

I’m a novice to stan and have now gone through a couple iterations of coding to try to make things run faster. I’ve narrowed the bottleneck to a single routine that calculates the sum of two dependent random variables. I originally coded this routine with the integrate_1d function, then with the integrate_ode_rk45 function, but neither of these worked (errors from boost, presumably). The function I need to implement is:

f_C(t_C|t_O)=\int_m^{t_C}f_O(t_O)f_D(t_C-t_O|t_O)dt_O

So, I’ve implemented methods to use discrete approximations instead of integrate functions. The code runs now and gets reasonable answers, but the gradient evaluation is slow. Using only 1/4 of the dataset, I get the following time estimate:

Gradient evaluation took 0.987295 seconds


Ok, I’m trying to improve on that. I’m hoping there is a more efficient way to implement the code that is causing the bottleneck. It involves two nested for loops. I haven’t figured out how to “vectorize” or “matricize” these to use built-in vector and / or matrix operations due to the dependence of the D random variable on the O random variable. Each iteration requires the D value (i.e., t_C-t_O) to be updated based on the O value in order not to violate this dependence. Assuming independence results in significant inaccuracies. For this implementation, the random variables O and D are scaled and translated beta-distributed, although they don’t need to be.

Below is the stan code that implements this bottle-necking function. If someone would be willing to have a look and point out any ways this code could be made more efficient (computationally speaking), I would be very grateful.

        array[] real fC_discretized(real m, real M, array[] real ps_fO, real sD1, real sD2, array[] real xs) {
int n = size(ps_fO);
if(n != size(xs)) {
reject("Number of fO bins doesn't match that of xs in fC");
}
array[n] real val;
for(j in 1:n) {
val[j] = 0.0;
real pfD = 0.0;
for(i in 1:j) {
if(xs[i] == M) {
pfD = 0.0;
}
else {
//xs[j] is C, xs[i] is O, xs[j]-xs[i] D
//M is max possible value, n is number of bins
real h = (M-xs[i]) / n;
real high = (xs[j]-xs[i])/(M-xs[i]);
real low = (xs[j]-xs[i]-h)/(M-xs[i]);
if(low<0) {
pfD = 0.0;
}
else {
pfD = beta_cdf(high, sD1, sD2) - beta_cdf(low, sD1, sD2);
}
}
val[j] += ps_fO[i] * pfD;
}
}
//renormalized - shouldn't need to, but seems to not sum to 1; missing scaling somewhere... otherwise, matches the theoretical distribution very well when scaled
real tot = sum(val);
for(i in 1:n) {
val[i] = val[i]/tot;
}
return val;
}



Thank you for your thoughts and help with this.

In case this is of use, the original code using integrate_1d was:

real fC_integrand(real x, real xc, array[] real theta, array[] real x_r, array[] int x_i ) {
real m = theta; //minimum possible value
real M = theta; //maximum possible value
real sO1 = theta; //beta shape 1
real sO2 = theta; //beta shape 2
real sD1 = theta; //beta shape 1
real sD2 = theta; //beta shape 2
real tC = theta; //value of RV O
//fO and fD are scaled and translated beta pdfs
//x is value of RV O, tC-x is value of RV D
real res = fO(x, m, M, sO1, sO2) * fD(tC-x, x, M, sD1, sD2);
return res;
}

real fC(real tC,real m, real M, real sO1, real sO2, real sD1, real sD2) {
//not sure how to deal with x_r and x_i...
real res = integrate_1d(fC_integrand, m, tC, { m, M, sO1, sO2, sD1, sD2, tC }, {1.1}, {1});
return res;
}


Function fC is integrated in another function Pt, and Pt is integrated in another function fX, resulting in a triple integral that refused to cooperate. I’d prefer go the “integrate” route if at all possible. Has anyone had success handling triple and quadruple integrals in an efficient way?

2 Likes

The multiple nested integrals seem nasty and hard. I have no idea how to attack that part of the problem. However, there might be good enough analytically tractable (and cheap) approximations to the sum - which is where I have a little bit of experience.

But before we go there, I’d rather make sure I am not missing something important about your problem specification. First, in fC_discretized you assume that the shape parameters are shared between the two Beta distributions, while in fC_integrand they are allowed to differ. Second, it appears that your code assumes, that given the shape parameters, the two beta distributions are independent, so I am not sure where the dependency you speak of arises - is there something missing in the problem description?

If the two variables are indeed dependent only via the shape parameters, than two candidates for approximate distribution of the sum are what I’d call “moment matching” and the saddlepoint approximation. Using those for sums of negative binomial distributions is described in my blog at: Approximate Densities for Sums of Variables: Negative Binomials and Saddlepoint

For moment matching you can compute the mean and variance of C as sum of the mean and variance of O and D and then find a member of suitable family of two-parameter distributions that matches the mean and variance computed in this way. For the scaled & shifted Beta sum, I’d expect a Beta distribution scaled and shifted to the range of the sum to be a decent candidate. Both Beta and normal approximations to sum of (non-scaled) Beta are worked-out at probability - Sum of n i.i.d Beta-distributed variables - Mathematics Stack Exchange In my simulations with the negative binomial, this type of approximation was very cheap to compute and already mostly indistinguishable from the real deal in most cases.

The saddlepoint approximation is a bit more accurate and also more involved and I’d refer to the blog and other sources for details. I am not completely sure it would be tractable for the Beta distribution as the moment-generating function of the Beta distribution with shape parameters \alpha, \beta is M_X(t) = {}_1 F_1(\alpha,\alpha+\beta,t) which sounds like an annoying formula to work with. There’s a paper on the saddlepoint for beta: Nadarajah, S., Jiang, X., & Chu, J. (2014). A saddlepoint approximation to the distribution of the sum of independent non-identically beta random variables.. @andrjohns apparently has been working on support for the confluent hypergeometric function in Stan but it appears to not have been finished yet: add hypergeometric_1F1, log_hypergeometric_1F1 and their gradients · Issue #2764 · stan-dev/math · GitHub

I believe both approaches would generalize even to the case where the two distributions have some correlation (beyond sharing shape parameters), albeit with some difficulties, so I won’t be going there unless you are sure that’s what is needed.

Best of luck with your model!

3 Likes

Hi martinmodrak,

Thank you very (very!) much for your feedback and thoughts! I’ll look into the saddlepoint and moment matching approximations. The saddlepoint approximation is new to me, so that will be good to check.

The following is a bit more detail to give some clarifications. The scenario is that there is a finite period of time with a constant set start point, m, and a constant set endpoint M. Within this interval there are two events of interest. The first event, the onset, modeled by R.V. O, can occur at any point in time between m and M. The second event, the cessation, C, is constrained to occur before M but after O, so it occurs somewhere within a time interval of length M-O. The time between O and C is the duration, D, which is the difference C-O.

In short, there is an event at time O when a process starts, the process lasts an amount of time D, and finally the process ends at time C.

The events are dependent because m \le O \le M and 0 \le D \le M-O. The dependence between O and D comes through the fact that D is constrained to be \le M-O. When D is modeled by a beta distribution, the beta distribution is scaled to be between 0 and M-O (rather than between 0 and 1). This is the only dependence between O and D. So, the shape parameters for O and the shape parameters for D are not shared.

Here is the broader picture. There is a population of individuals. Each individual enters a state of interest at time O and leaves the state of interest at time C. This happens only once during the interval of time between m and M, or for some individuals, possibly not at all. The individuals are observed at random points in time, and it is recorded whether the individual is in the state of interest or not. Our data consist of a random draw from the distribution of observed times when individuals are in the state of interest. The goal is to infer the distribution of the onset times O and the distribution of the durations D using a Bayesian analysis. For this, one needs the likelihood function, derived below:

The PDF for the onset is f_O(t_O)=\frac{f_{O_{unscaled}}(\frac{t_0-m}{M-m}|\theta_O)}{M-m} where f_{O_{unscaled}} is here modeled as a standard beta PDF with shape parameters \theta_O=(aO,bO) (but any distribution with finite support could be used), and m and M are as above and t_O is the onset time. This is derived from a change of variable.

The PDF for the duration is f_D(t_D|t_O)=\frac{f_{D_{unscaled}}(\frac{t_D}{M-t_O}|\theta_D)}{M-t_O} where f_{D_{unscaled}} is here modeled a standard beta PDF with shape parameters \theta_D=(aD,bD) (but any distribution with finite support could be used), and m and M are as above and t_D is the duration of the state of interest. This is derived from a change of variable.

The PDF for the cessation, f_C(t_C|t_O)=\int_m^{t_C}f_O(t_O)f_D(t_C-t_O|t_O)dt_O, is derived by considering that C is just the sum of two random variables: C = O+D. In other words, any possible combination of onset times and durations that add up to the cessation time must be considered (these are disjoint events), then the “or” rule of probability is applied to get the (summation) integral for (discrete) continuous random variables. t_C is the cessation time. (This is the first integral)

In order for an individual to be in the state of interest at a randomly observed time, t, this time must be between O and C. In other words, the onset time must be earlier than t and the cessation time must be after t. Given CDFs F_*, apply the “and” rule of probability (or apply the relationship between the joint distribution and conditional distribution) to obtain the probability, P_t, that an individual that is observed at a random time, t, is in the state of interest:

P_t=F_O(t)(1-F_C(t|O))
(This is the second integral)

Normalizing P_t results in the PDF f_X(t) of observed times, t, when individuals are in the state of interest:

f_X(t) = \frac{P_t}{\int_m^MP_tdt}
(This is the third integral)

With the assumption that individuals enter and leave the state of interest independently of one another, the likelihood function is therefore:

L(m,M,\theta_O, \theta_D;\textbf{t})=P(\textbf{t}|m,M,\theta_O, \theta_D) = \prod_1^Nf_X(t_i)

To understand the factors that contribute to the onset times and durations, the mean of the beta distribution (e.g., m_O=\frac{aO}{aO+bO} in the case of the unscaled beta-distributed O_{unscaled}) is modeled as an affine function of the predictor variables X_i of interest (although, it could be an arbitrary polynomial rather than an affine function, and rjMCMC could be applied to infer a probability distribution over models):

m_O=O_{intercept} +\beta_O\cdot\textbf{X}
m_D =D_{intercept} +\beta_D\cdot\textbf{X}

\beta_* is a vector of the coefficients of the affine function.

The variance \sigma_O^2=\frac{m_O(1-m_O)}{aO+bO+1} for O_{unscaled} is a single parameter, and the variance \sigma_{D_{unscaled}}^2=\frac{m_D(1-m_D)}{aD+bD+1} for D_{unscaled} is a single parameter as well. The variances could be modeled as functions of the factors, but to keep things “simple”, I’m assuming homoscedasticity. Once values for the \beta_*, itercepts, and variances are obtained during the MCMC sampling, and the means are found by applying the affine functions for the means, then the shape parameters a_* and b_* are obtained by solving the system of equations \sigma_*^2=\frac{m_*(1-m_*)}{a*+b*+1} and m_*=\frac{a*}{a*+b*} for a_* and b_*.

The Bayesian analysis is then implemented, more or less, as a standard hierarchical Bayesian analysis. The priors are critical for identifiability and act as “penalizers”.

Ultimately, the goal is to make this inference process as computationally efficient as possible. I derived the above theory, am in the process of publishing it, and the journal editor requested a robust software tool to carry out the inference. Any thoughts that would make the implementation of this inference as computationally efficient as possible would be hugely appreciated. If you are able to contribute ideas that substantially increase efficiency, I’d be very willing to include you as co-author if you’d like. Thank you again for the two approaches you recommended. I’ve also considered using a Taylor series approximation, but I’m not quite sure how to do that…

Thank you again for your thoughts!

I find the setup somewhat suspicious: if I understand the description correctly, you assume, that whatever causes the time to onset to be longer causes the time from onset to cessation to be shorter. I am having a hard time imagining a real-world process that works this way (but hey, you said nothing about the real-world counterpart for the model, so I might easily be missing something). Some similar situations that would sound more generally plausible to me would be: a) the time from onset to cessation is in fact independent from the time to onset, but only measurment units that experienced both events end up in your dataset, effectively acting as truncation and b) time of both events is independently chosen within the given interval, but only measurement units where onset happens before cessation end up in your dataset. In both cases, there still would be correlation between the even time, but that would be an artifact of the measurement and not a causal relationship.

The last sentence strongly suggests that some form of truncation is taking place…

Just to be clear: so you don’t have any data on individuals that weren’t in the state of interest at the chosen time? Not even the time of measurement (with possibly the covaraites missing)?

With that said, if you are sure the description you have is correct, I am quite sure the computation can be substantially simplified. In the following, I’ll assume familiairty with censoring and its implementation in Stan as discussed at 4.3 Censored data | Stan User’s Guide

First a fun but important fact about Bayesian analysis: any sort of selective reporting (i.e. you observe the data and then decide whether to analyse it or whether to redo the experiment to get new data) has no effect on the posterior (a simple proof is at Rejection sampling in simulations • SBC). So we can imagine we take a random sample from all individuals, measure them at random times and record 0 or 1 depending on whether they are in the “state of interest”. If all individuals are in the state of interest, we analyze the data, if not, we redo the experiment until we get all individuals in the “state of interest”. As long as the probability of “all 1s” is > 0 this works, even it the probability is ridiculously low.

So what would we do if our observations actually consisted of 0 or 1 for individuals?

First, let’s imagine that we observed the unit at time t in the “state of interest” and we know the time of onset, t_O. Now we can compute the possible range for t_D and know it is left-censored at t - t_O, so the probability is simply 1 - F_D(t - t_O|t_O) which translates to a single call to beta_proportion_lccdf.

But we do not know t_O. So we have two options:

1. treat t_O for reach individual in the data as an explicit paramater in the model. It is right-censored (i.e. we know that t_O < t and can use it as upper bound) and beta-distributed (akin to the “Estimating censored values” case in the Stan manual referenced above).
2. use integrate_1d to integrate t_O out i.e.:
P(\text{state of interest} | t) = \int_m^t f_O(t_O) (1 - F_D(t - t_O|t_O) ) \mathrm{d} t_O

Since we never have to handle a unit that’s not in the state of interest, we can end here.

Since Stan is pretty good at integration, I’d guess that the latent-variable approach would work better for small to medium-sized datasets, but the computation effort scales super-linearly in the number of data points so for large datasets you might get better results with the explicit integration (which scales +/- linearly). Explicit integration would also allow you to get explicit point-wise likelihoods if you wanted to use stuff like loo.

Does that make sense?

2 Likes

Actually, there’s even a simpler reason why calculating P(\text{state of interest} | t) works. By Bayes theorem we have:

p(t | \text{state of interest}) \propto P(\text{state of interest} | t)p(t)

We don’t care about the normalizing constant, as Stan doesn’t care about constant terms. Since t is known, p(t) is a constant and drops out as well.

2 Likes

Hi martinmodrak,

Wow. Just wow. Not sure why that conceptualization didn’t come to me before. I think algebraically your P(state of interest|t) may be the same as my P_t (just based on comparing the graphs), but your function is literally >10 times faster to run for a fine-scale discretization. I did a bit of checking (far from an algebraic proof, but convincing given the arbitrariness of the parameters): The above graph has some actual data (histogram), I used MLE to get parameter values, then scaled and overlaid your function (black, solid) and my previous P_t function (red, dotted). Spot on!

Here is some basic R code to simulate the data and overlay your (scaled) function on top of the histogram.

simulate = function(n=10000,sO1,sO2,sD1,sD2) {
os = rbeta(n,sO1,sO2)
ds = rbeta(n,sD1,sD2)*(1-os)
ts = runif(n,0,1)
ts_soi = ts[ts>=os & ts<=(os+ds)]
}

Pt = function(t,sO1,sO2,sD1,sD2) {
val = integrate(function(x) {
dbeta(x, sO1, sO2) * (1 - pbeta((t-x)/(1-x),sD1,sD2))
},0,t)$value } Pt_v = Vectorize(Pt,"t") N=1000000 n=1000 data = simulate(n=N,sO1=90.28,sO2=244.83,sD1=42.8,sD2=233.62) hist.data = hist(data,prob=TRUE) x = seq(0,1,length=n) y=Pt_v(x,sO1=90.28,sO2=244.83,sD1=42.8,sD2=233.62) y = max(hist.data$density)*y/(max(y)-min(y))
points(x,y,type='l')


I’m going to respond to your follow-up message separately.

OK, so I would certainly count your contribution as “significant” improvement in efficiency, so, like I said, I’d be happy to include you as co-author, if you’d like, or, minimally, I’d like to include you in acknowledgements. I’m not sure the best way to share personal information in this forum…

Thank you, this is a game changer!

1 Like

Thanks for the follow up message.

Could you clarify about normalization of the posterior? I understand (at least with M-H MCMC) that the normalization constants cancel in the acceptance ratio. However, here, the P_t function (or your function) defines the likelihood. For my analyses, each data point will have a separate set of onset and duration distribution shape parameters. These shape parameters are defined through a function of multiple predictor variables, and it’s the coefficients of the predictor variables that the MCMC is sampling. The relationships between the coefficients and the onset and duration shape parameters are provided in my second post in this thread. In this case, I assume that the normalization constants are needed so that the P_t of each data point given the sampled parameters is scaled appropriately, resulting in the f_X(t) function in my previous post. I can see why no normalization would be needed if the shape parameters were the same for all data points, but unfortunately, that is not the case here… My confusion might simply be because I don’t know the details of stan implementation / hierarchical Bayesian inference. Could you please clarify?

Addendum: On the other hand, perhaps no normalization is needed, after all, if some sort of asymptotic equipartition property applies and the amount of data is enough… assuming some symmetry and ergodicity conditions. Here’s my thinking:

L=\prod_1^N\frac{P_t(t_i)}{c_i}=\frac{P_t(t_1)*P_t(t_2)*P_t(t_3)*P_t(t_4)*\ldots*P_t(t_N)}{c_1*c_2*c_3*...c_N}

Paying attention to the denominator, which is the product of normalization constants c_i:

c_1*c_2*c_3*\ldots* c_N\approx c_{k_1}^{N*f(c_{k_i})}c_{k_2}^{N*f(c_{k_2})}c_{k_3}^{N*f(c_{k_3})}\ldots

The c_{k_i} represent all the possible values that the normalization constants can take, binned into a countable set, and f(c_{k_i}) is the frequency (i.e., probabilities) that a normalization constant is in bin c_{k_i}. Taking the log:

log(c_1*c_2*c_3*\ldots*c_N)\approx\\ log(c_{k_1}^{Nf(c_{k_1})}c_{k_2}^{Nf(c_{k_2})}c_{k_3}^{Nf(c_{k_3})}*\ldots)=\\ N\sum_1^\infty f(c_{k_i})log(c_{k_i})\approx \\N\int_0^\infty f(c_k)log(c_k)dc_k=\\NE[log(c_k)]

Given a large enough random sample of size N, the average of log(c_{k_i}) should converge to E[log(c_k)] due to a LLN. In other words, independent of the particular sequence of normalization constants c_i, the log of the product of the normalization constants should “converge” to a constant value when scaled by the reciprocal of N.

For this constant to cancel out in the acceptance ratio so that the MCMC can ignore it, a strong symmetry property is required, i.e., f(*) is invariant with respect to the values of the coefficients of the affine functions that define the mean onset and mean duration. I doubt that will be true in general, although something about the ergodicity of the Markov chain gives me an intuition that “on average” it might be true. I don’t think, in general, it will be true, so I feel like we are stuck needing to calculate the normalization constants, c_i.

Does this make sense? Thoughts?

Thank you again!

1 Like

You are right, I was somewhat sloppy in my reasoning above. Let’s try again, but this time more carefully. We’ll denote \theta the set of all parameters in the model, then the conditional version of Bayes’ rule gives us:

p(t|\text{state = 1}, \theta) = \frac{p(\text{state = 1} | t, \theta) p(t | \theta)}{p(\text{state = 1}|\theta)}

Since t is chosen independent of \theta and known, we have p(t|\theta) = p(t) constant.

Now what is p(\text{state = 1}|\theta)? This the probability of the individual entering our sample (this is basically the logic of inverse probability weighting in survey analysis). If t_O is known and t is chosen uniformly over the whole range, this is simply the proportion of time spent in the state of interest, i.e. \frac{E(t_D|t_O, \theta)}{M - m}, which is just scaled m_D.

EDIT: So unless I am messing up, to get p(\text{state = 1}|\theta) we can actually ignore all the scaling to the [m,M] range, as we only care about proportions, which are readily expressed by the unscaled beta variables. My good friend Wolfram then assures me that in this case, assuming a_O,b_O, mu_O = \frac{a_O}{a_O + b_O} are the parameters of the Beta distribution for t_O and \mu_D = \frac{a_D}{a_D + b_D} are the parameters of the Beta distribution for t_D, we have:

p(\text{state = 1}|a_O, b_O, \mu_D) = \frac{b_O \mu_D}{a_O + b_O} = (1 - \mu_O)\mu_D

2 Likes

Once again, thank you so much! I’ve been trying to wrap my brain around this all day… I think your solution concerning p(state=1|\theta) will really simplify things, and based on it, I’ve come up with some results that, again, will make the MCMC run much quicker since yet another integration is no longer needed to calculate the normalization constant.

If I’m reasoning correctly, then my f_X(t)=\frac{P_t}{\int_m^M P_t dt} (post 3, Jul 6) equals your p(t|state=1,\theta), in which case \int_m^M P_t dt=(\frac{p(t|\theta)}{p(state=1|\theta)})^{-1}. If the times are uniformly randomly selected, then p(t|\theta)=\frac{1}{M-m}, and p(state=1|\theta)=\frac{b_Oa_D}{(a_O+b_O)(a_D+b_D)}. So, \int_m^M P_t dt=(M-m)\frac{b_Oa_D}{(a_O+b_O)(a_D+b_D)}.

Well, it works! I did several checks in R, and the formulae match. Below is R code to simulate the times of the state of interest along with the new f_X(t) function. Unlike the previous R code block that I posted (post 6, July 8) that had an arbitrary scaling to approximate the normalization, that is no longer needed. Much better! No more double and triple integration. Just a single integration.

#generate some data by Monte Carlo simulation
#assumes onsets and durations are scaled and translated beta distributions
#assumes times are randomly picked in the time interval M-m
simulate = function(n=10000,m=0,M=365,sO1,sO2,sD1,sD2) {
os = m+(M-m)*rbeta(n,sO1,sO2)
ds = (M-os)*rbeta(n,sD1,sD2)
ts = runif(n,m,M)
ts_soi = ts[ts>=os & ts<=(os+ds)]
}

#probability of observing a time t given an individual is in the state of interest and given the beta shape parameters onset: sO1, sO2, duration: sD1, sD2
fX = function(t,m=0,M=365,sO1,sO2,sD1,sD2) {
nc = (M-m)*sO2*sD1/((sD1+sD2)*(sO1+sO2))
val = integrate(function(x) {
(dbeta((x-m)/(M-m), sO1, sO2)/(M-m)) * (1 - pbeta((t-x)/(M-x),sD1,sD2))
},0,t)\$value
return(val/nc)
}
fXv = Vectorize(fX,"t")

m=0     #earliest start time
M=365   #last end time

N=1000000       #number of randomly selected times, not all of which are associated with individuals in the state of interest
n=1000  #number of points to evaluate for the plots
nIter=100 #number of iterations to test different combinations of parameter values and match to theory

#MLE values for some empirical data
sO1 = 90.27611
sO2 = 244.8296
sD1 = 42.84
sD2 = 233.6238

#make some data and graphs
for(i in 1:100) {
data = simulate(n=N,m=m,M=M,sO1=sO1,sO2=sO2,sD1=sD1,sD2=sD2)
hist(data,prob=TRUE,xlim=c(m,M))
x = seq(m,M,length=n)
s=Sys.time()
y = fXv(x,m=m,M=M,sO1=sO1,sO2=sO2,sD1=sD1,sD2=sD2)
e=Sys.time()
print(e-s)
points(x,y,type='l',col="red",lwd=3)
legend(0,max(y), legend=c("Simulated histogram", "Theory"), col=c("gray", "red"), lty=c(1,1), lwd=c(1,3),cex=0.8)
{
print("Click enter to continue")
graphics.off()
}

sO1 = runif(1,1,1000)
sO2 = runif(1,1,1000)
sD1 = runif(1,1,1000)
sD2 = runif(1,1,1000)
}



At this point, I think the procedure is about as good as can be hoped! I’m planning to try out different approaches to the integration in f_X(t) and find the fastest approach. It may well be fastest to use integrate_1d, but since f_X(t) is such a smooth function (at least when the underlying distributions for the onset and duration are beta-distributed), there might be a quicker approximation.

Thank you again! This has been an invaluable exchange.

Just one more little thing: you seem to be doing a lot of testing of performance in R. Note that while performance in R will be moderately correlated with performance in an actual Stan model, there are a couple reasons why those may actually differ substantially, so I’d advise to move to testing in Stan models sooner rather than later.

The two main reasons for performance differences is that a) Stan is implemented with very different tradeoffs than R, so some operations have vastly different speed in those languages (e.g. for loops are substantially cheaper in Stan) b) In most cases, the computation of the value of the density is a relatively minor part of the work done in Stan - the runtime is often dominated by the automatic differentiation of that density to get gradient (which obviously doesn’t happen in R). AFAIK the ratio of the “forward pass” (computing the density) vs. the “reverse pass” (computing the gradients) is quite uneven across operations.

I’ll also repeat, that I’d definitely try to avoid explicit integration in the model altogether and use latent variables instead (as described above) as that may be the fastest approach in many cases.

When comparing runtime of models (especially between the latent variable vs. explicit integration case), bear in mind that pure runtime is not the best metric, rather you want to get something like effective sample size (ESS) per second as some formulations of the model may lead to worse sampling and thus provide less information per iteration.

1 Like

Thanks very much for the additional thoughts!

I’ve moved to testing in Stan, although a lot of visualizations are quicker and easier in R for me. The run time of Stan seems to be pretty fast (~2 hours for ESSes in the high hundreds to low thousands), but thus far I haven’t figured out a way to deal with high divergent transition rates using the explicit integration route.

So, I’d be very interested in implementing the latent variable approach. I must admit, I am not clear about how to implement this approach. Just to make sure I’m following, was the latent variable approach your first option from the 4/10 Jul 7 post?:

First, let’s imagine that we observed the unit at time t in the “state of interest” and we know the time of onset, t_O. Now we can compute the possible range for t_D and know it is left-censored at t-t_O, so the probability is simply 1-F_D(t-t_O|t_O) which translates to a single call to beta_proportion_lccdf.

But we do not know t_O. So we have two options.

1. treat t_O for each individual in the data as an explicit parameter in the model. It is right-censored (i.e. we know that t_O<t and can use it as upper bound) and beta-distributed (akin to the “Estimating censored values” case in the Stan manual reference above.

I’m having trouble matching up the description at 4.3 Censored data | Stan User’s Guide. Could I ask a few questions for clarification? So, would the approach be to randomly sample the t_O from a beta with the current parameters for the onset distribution, but to reject values greater than t, then take the log of the beta density for that sampled value and add it to the beta_proportion_lccdf of the t, like a Monte Carlo simulation of the t_O nested within the HMC? So, no integrating out t_O, but rather a rejection sampling of the t_O to simulate its values, then calculation of the probability of the observed t based on the simulated value? I don’t think this is quite what you had in mind in your latent variable approach, but I’m just not quite connecting the dots… sorry. If you could provide a bit more detail, I’d really appreciate it.

Addendum: I’ve been trying to figure out the latent variable approach. I think I’m missing something very basic. If I’m understanding, the t_O's are the latent variables. There are N input t's. For each t_i, there is an associated t_{O_i}. How is the constraint that each t_{O_i} must be lower than t_i implemented in Stan? I can only see definitions in which the x in <lower=x> is a constant scalar, but the t_O's will depend on variably-valued t's. How would this be implemented in Stan?

Aside: An approach I was considering that uses importance sampling to approximate f_X won’t work since random number generation doesn’t seem to be allowed in parameters, transformed parameters, and model blocks. Is there a way to implement importance sampling to estimate integrals of arbitrary functions in Stan?

I think you are misunderstanding, really the approach is really directly treating t_{O,i} as unknown variables. Here’s a quick sketch how that might look like for a simplified version of the model (it compiles, but I didn’t test it much beyond that, hopefully it gets you started)

functions {
real log_p_t_given_tO(real t, real tO, real min_time, real max_time, real mu_D, real kappa_D) {
return beta_proportion_lccdf( (t - tO) / (max_time - tO) | mu_D, kappa_D);
}
}

data {
real min_time;
real<lower=min_time> max_time;
int<lower=0> N;
vector<lower=min_time, upper=max_time>[N] t;
}

parameters {
vector<lower=min_time, upper=t>[N] tO;

real<lower=0, upper=1> mu_O;
real<lower=0> kappa_O;
real<lower=0, upper=1> mu_D;
real<lower=0> kappa_D;
}

model {
target += beta_proportion_lpdf((tO - min_time) / (max_time - min_time) | mu_O, kappa_O) -
N * log(max_time - min_time); //technically, the last term is a constant that can be omitted
for(n in 1:N) {
target += log_p_t_given_tO(t[n], tO[n], min_time, max_time, mu_D, kappa_D);
}
// The prior is not well chosen, just something random
kappa_O ~ exponential(1);
kappa_D ~ exponential(1);
}


In recent Stan, it should be possible to pass a vector of an appropriate length as an upper/lower constraint (as I do in the sketch above). So if you are using cmdstanr, you should be good (the version bundled with rstan is unfortunately quite old at this point…).

Not easily - remember, Stan needs both the function value AND its gradient w.r.t. the parameters - and importance sampling doesn’t AFAIK really directly provide the latter. You could in principle implement importance sampling in C++ with the Stan math library (which you can then plugin into Stan) and let Stan’s autodiff to differentiate through the whole importance sampling step, but I’d be surprised if it worked well. Stan really doesn’t like noise in the density or gradients so you’d need to do the sampling step to very high precision…

1 Like

Thank you! This is all super helpful. Good to learn that one can pass a vector as constraints.

Unfortunately, I’m still a bit stuck. I would like to find the factors that contribute to the onset time. In particular, in my full model in which t_O is integrated out, the variance of the onset distribution is constant (1 parameter, homoscedastic), but the mean of the onset distribution is defined as a function O_{mean}= \alpha + \beta \cdot x where the parameters are the intercept \alpha and \beta coefficients and x is the “design matrix” of observations associated with the predictor variables. So, I am unclear how to constrain the t_O values when they are defined in terms of a function of predictor variables. This is a bit like a simplex in which the set of values is constrained so that the weighted sum of them is less than t. But in this case, I fear the manifold may be a bit more complex than a simplex, but I really haven’t figured this out. I alluded to this issue in my first post, although my thinking has progressed a lot since that time, but the core issue persists. Is there a way to put “joint” constraints on parameters so that a function of the parameters is constrained to be less than a particular value? Or, is there a better way to implement this?

I don’t think you need to constrain the mean (and thus the regression parameters) - some true values will be above mean and some below, there is no reason to constrain. Note that in the sketch of the Stan code I posted above, there is no constrain on mu_O (beyond the natural limit to [0,1]). Since predictors act to replace the single mu_O with different values per observation, there is no need to constrain the predictors as well.

Another way to see that is that the latent variables (and their constraints) are in place to model that we don’t directly observe the onset time, we only know that it is within a certain range. If you continually shrunk the range, you would in the limit get a direct observation of the onset time and the integration over the latent variables would be replaced with just the beta density - once again, if you directly observed draws from the beta distribution, you wouldn’t need to constrain the predictors in any way…

Does that make sense?