Probability of latent TB infection from observed active TB infection?



My general problem is as follows:

  1. We used to vaccinate for TB from (say) 2000-2010, then stopped vaccinating
  2. We want to see if this was a bad decision (e.g. has TB infection risen since then?)
  3. A person is first infected with latent TB (LTBI) [unobservable in my data], which is then dormant for 1-40 years, and then activates into active TB infection [observable endpoint]

So I’ve been trying to gradually build up my models to answer this problem.

My first model:
No vaccination effect here. We get a sample of people in 2017, and can test to see if they have ever had LTBI or not. We then calculate the annual risk of LTBI. [This works]

My second model:
Vaccination stops after 10 years. We get a sample of people in 2017, and can test to see if they have ever had LTBI or not. We then calculate the annual risk of LTBI. [This works].

My third model:
No vaccination effect. People get directly infected with ATBI (no LTBI middle step), and we can directly count them in the year that they get infected. This requires survival analysis to estimate the annual risk of ATBI. [This works].


My fourth model:
No vaccination effect. People get infected with LTBI, and then later on can get infected with ATBI (conditional upon having LTBI).

I said that time-to-LTBI was a geometric distribution, and time-to-ATBI (after getting LTBI) was a geometric distribution. I then modelled “observed ATBI cases” as “time-to-LTBI + time-to-ATBI”, and tried to do this by calculating the PDF and the survival function, and then putting it into stan. The problem is that the estimated annual risk of LTBI and ATBI is not correct (doesn’t match with my simulated data) and I can’t figure out why. Any help with understanding why they don’t match would be greatly appreciated!

A more detailed explanation, with all of the formulas and statistics is available here:

The code, data, and results are available here:

(“tb_vaccination.Rmd” is the important file)

Thanks for your time!

What do you observe for the fourth model?

ltbi: 0.035 (should be 0.05)
atbi: 0.037 (should be 0.025)

I just tried it with a different set of parameters and got:

ltbi: 0.074 (should be 0.10)
atbi: 0.073 (should be 0.05)

I meant what goes as data into the fourth model? Depending on what the model is you might just expect this kind of bias.

On second thought I’m pretty sure that’s what’s going on since the product of your parameters comes out to be about the same (for estimated and expected). Have you looked at plotting ltbi vs. atbi in pairs plots? You’ll probably see multiplicative non-identifiability.

This is my data:

    timeNoVax timeWithVax atbi    R
 1:        17           0    0 8726
 2:        16           0    0 8803
 3:        15           0    0 8923
 4:        14           0    0 9027
 5:        13           0    0 9146
 6:        12           0    0 9292
 7:        11           0    0 9357
 8:        10           0    0 9433
 9:         9           0    0 9580
10:         8           0    0 9644
11:         7           0    0 9683
12:         6           0    0 9757
13:         5           0    0 9833
14:         4           0    0 9897
15:         3           0    0 9923
16:         2           0    0 9961
17:         1           0    0 9987
18:         1           0    1  193
19:         2           0    1  398
20:         3           0    1  502
21:         4           0    1  662
22:         5           0    1  685
23:         6           0    1  741
24:         7           0    1  715
25:         8           0    1  788
26:         9           0    1  723
27:        10           0    1  722
28:        11           0    1  652
29:        12           0    1  621
30:        13           0    1  509
31:        14           0    1  460
32:        15           0    1  334
33:        16           0    1  210
34:        17           0    1  113
    timeNoVax timeWithVax atbi    R

So for people who did not get ATBI (atbi=0) I have time at risk (timeNoVax), and then for people who get get ATBI (atbi=1) I have the time at which they received ATBI. The number of people for which this occurs is R.

Ok you just made me think of a good solution!

I actually don’t care about the probability of LTBI progressing to ATBI. I already know this from natural history models.

All I care about is estimating the risk of LTBI. So I can then hard-code in the ATBI part, to have an annual risk of 0.10 (from the literature), and then it estimates properly.

Looks like you were right about the identifiability issue.

Thank you! :-)

Great, glad it helped. If you want to go more Bayes than just hardcoding the ATBI part you could just place a really strong prior on it (or use literature values as though it was a meta-analysis). Either way sounds good but if you have a literature to draw from it’s a perfect application.

Yep, a prior like normal(0.1,0.05) or something fit to data in the literature would be the way to go for the full Bayesian treatment of that conversion rate.

Thanks for the prior suggestion. I just tried it, and it gave me:

ltbi: 0.059 (should be 0.05)
atbi 0.0894 (should be 0.10)

So it still seems to be trying to come together. I think for my purposes, it is safer to hard-code it, as then I can get better estimates of the LTBI annual risk, which is the key focus here.

(When I hardcode ATBI, I get LTBI risk = 0.05 as expected)

Thanks again!

When you hardcode the value, you will get “better estimates of the LTBI annual risk” only if the hard-coded value is actually exactly correct. If it’s not exactly correct you get biased values for LTBI that reflect this assumption about ATBI.

Choosing a range for ATBI that is consistent with the literature and broad enough to cover all values you are willing to consider will give you a wider spread of LTBI risk values which accurately reflects the uncertainty

If you run your model with ATBI as a parameter with some tight prior, and you want to make statements about the case where ATBI=0.1 you can do this by simply taking a sub-sample of your Stan output where ATBI is in the range say 0.09 to 0.11 and describing the LTBI for this sub-sample. Taking a largish number of iterations will help you make this information more accurate.

So, in some sense, the prior over ATBI includes the ATBI=0.1 as a sub-model and you can get that information out of it if you simply ask Stan for a large enough sample, while at the same time, you can compare this value to the range of values found when ATBI ranges over the full prior range. To make this work you might well want say iter=10000 or something like that so that there are several hundred samples in your sub-range.

Hmm, I agree with you on principle, except for the fact that in my simulations it seems to be dragging ATBI towards LTBI (and vice versa), and I can see that it doesn’t accurately reflect the uncertainty in this simulated data. So if I can’t get it to be correct when I know the truth, then it seems like a bad idea to run it with real data.

Yes, but as I’m seeing with my simulations, when I don’t specific ATBI, I get biased results anyway.

But you do bring up a good point, and I think it’s probably best if I run multiple models, with ATBI fixed at 0.08, 0.09, 0.1, 0.11, 0.12, and present all the results for each ATBI value. I think that is better than running one large simulation with ATBI allowed to vary and then taking subsamples. Because I know that a subsample with ATBI restricted from 0.09 to 0.11 will actually have an ATBI mean of 0.095 (not 0.10) as it is biased towards the LTBI value.

This is an area where is hard to do things right so your approach of hardcoding a range of values is appropriate. In my experience it’s common in models with weak identifiability to get this problem and afaik there’s no general solution.

Take your simulation with the range for ATBI specified in terms of a tight prior, then run it and look at the LTBI estimates when ATBI has a very tight range around the correct value for your simulation say 0.098 to 0.102 . If you get the correct LTBI value when you hard code ATBI, you’ll also get the correct LTBI when you subset your draws to those that have ATBI very close to the true value. The output from the Stan program with a prior is in essence the same as the weighted combination of the outputs of Stan programs with individual hard coded values, with the weights coming from the prior, and specifying how many draws to take from each run.

Of course you’ll need more draws to get more in your tight ranges, but it will take less time and make more sense to do one big run than run Stan in a loop. Also you may choose to use a uniform prior in a small range uniform(0.08,0.12) or something like that.

Your approach of hard coding several values and re-running does work ok in this type of case, but a Bayesian model with a real prior is automatically the kind of sensitivity analysis you’re talking about doing with hard coded values.

The other question of course is whether there is a more appropriate model which doesn’t have this identifiability problem… but I assume this requires significant substantive research.

In any case, your approach of doing several hard coded runs isn’t wrong per se, it just may be more trouble than its worth compared to saying iter=30000 or something and walking off to get some coffee to get enough points densely in each tiny range so you can pull out the results you want from one big run.

1 Like