I have a bit of a conceptual question that I’m thinking of before endeavoring on using the loo package. Briefly, I’m trying to figure out how to specify the point-wise log-likelihoods in the generated quantities block (to be analysed post-hoc with the loo package in R) when my model includes data augmentation for censored observations. More details below.
I’m working with a dataset where a subset of observations are left-censored. More specifically, I (assume that I) know the observation is under some threshold concentration (a detection limit of a lab assay) but I don’t know where between zero and that threshold. My model does however predict a specific concentration. For computational efficiency, I use data augmentation to model left-censored observations, meaning that for each censored observation, I define a parameter y_log_true in the parameter block that is bounded to be <=log(threshold). Then, in the model block, I include a statement y_log_true ~ normal(y_log_hat, sigma);, where y_log_hat is the model-predicted concentration for that data-point and sigma is the measurement error of the lab assay. Works like a charm and I find it is more computationally efficient than integrating out the observation with a CDF.
Now for calculating the point-wise log-likelihoods (to be used with loo later on), for censored data points, should I calculate the log-likelihood as either:
the log-probability of the observation being censored, given expectation y_log_hat and the measurement error sigma (using the CDF as one would do in the model block when integrating out censored observations), or
the log-probability density equivalent to the sampling statement in the model block y_log_true ~ normal(y_log_hat, sigma); (including normalising constants, of course)?
The model particulars are described in another post here, although I don’t think they are particularly relevant to the question at hand.
The only thing I can think of is that for some censored data points, the log-lik will be zero or very close to zero (probability 1.0) for most if not all samples. Not sure whether that will be a problem for the loo procedure, but I guess I’ll find out.
Also thanks for inviting me to look into the efficiency matter more thoroughly. I ran my model with and without data augmentation on the same data set (4 chains, 500 warm-up, 500 sampling). As it turns out, the computational efficiency in terms of n_eff per second was very similar between the two approaches. To calculate the efficiency, I took the average of the n_eff of all 120 parameters in my model, and divided that either by the sum of chain times or the net time that stan needed to complete the job:
DATA AUGMENTATION OF CENSORED OBSERVATIONS
INTEGRATING OUT CENSORED OBSERVATIONS WITH NORMAL CDF
Average n_eff
1588
1761
chain 1 time
123.00
142.00
chain 2 time
130.00
152.00
chain 3 time
136.00
154.00
chain 4 time
177.00
168.00
net time (running chains in parallel + processing)
188.00
178.00
Average efficiency (n_eff/sec)
Based on sum of chain times
2.81
2.86
Based on net stan time (parellel chains)
8.45
9.90
Interesting note: the approach based on integrating out censored observations was more prone to negative infinity log-prob at initialisation. I think this may have been part of my original motivation to use data augmentation (can’t remember, honestly).
Thanks for the report. Sometimes the efficiency of Stan dynamic HMC is surprising (this is not the only case). Can you also check the number of log density evaluations in each case? That is, is it possible that normal_cdf happens to have that much higher computational cost that reduced number of log density evaluations doesn’t help?
I re-ran both analysis (data augmentation and integrating out with cdf), but now with 8 chains each as I was under the impression that there was quite a lot of variation in computation time between chains. Turns out that my previous run of four chains each was a bit of a lucky hit. Integrating out seems to be more efficient after all.
DATA AUGMENTATION OF CENSORED OBSERVATIONS
INTEGRATING OUT CENSORED OBSERVATIONS WITH NORMAL CDF
Average n_eff
3138
3414
chain 1 time
182.00
153.00
chain 2 time
187.00
164.00
chain 3 time
191.00
167.00
chain 4 time
200.00
173.00
chain 5 time
209.00
175.00
chain 6 time
218.00
178.00
chain 7 time
239.00
188.00
chain 8 time
267.00
205.00
net time (running chains in parallel + processing)
278.00
217.00
Average efficiency (n_eff/sec)
Based on sum of chain times
1.85
2.43
Based on net stan time (parellel chains)
11.29
15.73
Number of leapfrog steps per iteration (freq distr)
Ah, so the difference you had in the beginning might have been the difference in warmup time? Anyway, the conclusion now is that integrating out is better as we would expect a priori. Integrating out makes the posterior easier to sample which leads to slightly less leapfrog steps per iteration. It is also possible that reduced memory usage improves the running time.