How to specify likelihood when you don't have proper data points (but have bounds instead)

Say I have a set of data points that I think follow a lognormal distribution with parameters mu, sigma.
So far, so good. Only now, instead of actual data points (e.g., x [1] = 3, x [2] = 10), I only have lower and upper bounds for each data point (e.g., x [1] > 1 & x [1] < 5, x [2] > 6 & x [2] < 20). How can I specify that in Stan? The model below is almost but not quite correct:

data {
    int <lower = 1> n;
    int <lower = 0> lower_bound [n];
    int <lower = 0> upper_bound [n];
}

parameters {
    real mu;
    real <lower = 0> sigma;
}

model {
    for (i in 1: n) {
        target += lognormal_lcdf (upper_bound [i] | mu, sigma);
        target += lognormal_lccdf (lower_bound [i] | mu, sigma);
    }
    mu ~ normal (3, 0.5);
    sigma ~ lognormal (log (0.5), 0.4);
}

It’s not correct because:
p (x > LB & x < UB) = p (x > LB) * p (x < UB | x > LB)
Which is generally (and, in this case, definitely) not equal to p (x > LB) * p (x < UB), which is what my model implies. If the “original” distribution is a lognormal, I think p (x < UB | x > LB) is basically a truncated (and of course re-normalized) lognormal, right? It’d be truncated at LB. But I don’t know how to specify that, and a search for “truncated” in the documentations generates results that are on truncated data (i.e., data that is reported only if it’s within fixed bounds), which is not exactly what I have here.
Can anyone help? Thanks a lot!

It’s hard to say without knowing more details. Some questions:

  • Questions about your data:

    • Why do you only have upper and lower bounds?
    • Are these results from a measurement (and if so, what are you measuring)?
    • Are these bounds somehow centered around some quantity?
  • Questions about your model:

    • Why do you think your data follows a lognormal distribution?
    • Why do you think the mean trend of your data is a constant rather than a function of some other variables?

@jjramsey, thanks for your reply! I don’t think the answers to your questions are directly relevant to the question at hand. I thought more about this and finally got to the right answer:

model {
    for (i in 1:n) {
        target += weight [i] * log (lognormal_cdf (upper_bound [i], mu, sigma) - lognormal_cdf (lower_bound [i], mu, sigma));
    }
    mu ~ normal (3, 0.5);
    sigma ~ lognormal (log (0.5), 0.4);
}

By the way, my data is similar (in fact, structurally identical) to binned data, and I found out that someone had already thought of a solution very much identical to mine: https://www.reddit.com/r/rstats/comments/b05su0/estimating_continuous_distribution_params_from/

FYI this is more commonly called (interval) censoring and is covered in the Stan user’s guide section 4.3. Also you may be interested to know that you can fit regression models with censored outcomes very easily in brms without writing your own Stan code using the cens() function.

1 Like