How to specify likelihood when you don't have proper data points (but have bounds instead)

vmargato · May 10, 2019, 4:18pm

Say I have a set of data points that I think follow a lognormal distribution with parameters mu, sigma.
So far, so good. Only now, instead of actual data points (e.g., x [1] = 3, x [2] = 10), I only have lower and upper bounds for each data point (e.g., x [1] > 1 & x [1] < 5, x [2] > 6 & x [2] < 20). How can I specify that in Stan? The model below is almost but not quite correct:

data {
    int <lower = 1> n;
    int <lower = 0> lower_bound [n];
    int <lower = 0> upper_bound [n];
}

parameters {
    real mu;
    real <lower = 0> sigma;
}

model {
    for (i in 1: n) {
        target += lognormal_lcdf (upper_bound [i] | mu, sigma);
        target += lognormal_lccdf (lower_bound [i] | mu, sigma);
    }
    mu ~ normal (3, 0.5);
    sigma ~ lognormal (log (0.5), 0.4);
}

It’s not correct because:
p (x > LB & x < UB) = p (x > LB) * p (x < UB | x > LB)
Which is generally (and, in this case, definitely) not equal to p (x > LB) * p (x < UB), which is what my model implies. If the “original” distribution is a lognormal, I think p (x < UB | x > LB) is basically a truncated (and of course re-normalized) lognormal, right? It’d be truncated at LB. But I don’t know how to specify that, and a search for “truncated” in the documentations generates results that are on truncated data (i.e., data that is reported only if it’s within fixed bounds), which is not exactly what I have here.
Can anyone help? Thanks a lot!

jjramsey · May 10, 2019, 5:48pm

It’s hard to say without knowing more details. Some questions:

Questions about your data:
- Why do you only have upper and lower bounds?
- Are these results from a measurement (and if so, what are you measuring)?
- Are these bounds somehow centered around some quantity?
Questions about your model:
- Why do you think your data follows a lognormal distribution?
- Why do you think the mean trend of your data is a constant rather than a function of some other variables?

vmargato · May 13, 2019, 1:08pm

@jjramsey, thanks for your reply! I don’t think the answers to your questions are directly relevant to the question at hand. I thought more about this and finally got to the right answer:

model {
    for (i in 1:n) {
        target += weight [i] * log (lognormal_cdf (upper_bound [i], mu, sigma) - lognormal_cdf (lower_bound [i], mu, sigma));
    }
    mu ~ normal (3, 0.5);
    sigma ~ lognormal (log (0.5), 0.4);
}

By the way, my data is similar (in fact, structurally identical) to binned data, and I found out that someone had already thought of a solution very much identical to mine: https://www.reddit.com/r/rstats/comments/b05su0/estimating_continuous_distribution_params_from/

potash · May 13, 2019, 9:30pm

FYI this is more commonly called (interval) censoring and is covered in the Stan user’s guide section 4.3. Also you may be interested to know that you can fit regression models with censored outcomes very easily in brms without writing your own Stan code using the cens() function.

Topic		Replies	Views
Looking for a way to specify a truncated lognormal prior to improve performance Modeling techniques	5	2772	September 10, 2018
Lognormal model with only summary statistics Modeling specification	6	570	June 18, 2018
How are bounds handled by Stan? General	1	408	March 23, 2021
Log normal distribution in Stan General	4	1276	January 15, 2021
Declaring constrained parameters General stanc	18	6947	September 13, 2019

How to specify likelihood when you don't have proper data points (but have bounds instead)

Related topics