Modelling dependent variable with some kind of rounding

Hey, I’m trying to model a relationship where the dependent variable (income) is discretized in the data.

My model:
y_{latent} = \alpha + \beta*x
where y_{latent} is unobserved. Instead y is observed and takes on a determined integer value depending on the value of y_{latent} .

My problem:
The code is not compiling. It seems I’m missing something from the model, but I’m not sure what.

My code is here: https://gist.github.com/credpath/3e1a7125ce9a643a763a6f8a4aef758e

1 Like

Check out the “Truncated or Censored Data” section of the manual (page 186 in the 2.17.0 manual). That’s what you’re looking for :P.

There’s also some threads around here where various people do stuff like this: https://discourse.mc-stan.org/t/gaussian-mixture-model-with-interval-censored-data-where-do-the-divergences-come-from

The key words you’re looking for are truncation/censored if you wanna search around. Hope that helps!

Thanks bbbales2!

I don’t have truncated data, I’m just trying to reverse engineer my categories/integers into real values in a probabilistic framework (to avoid imputing averages), then use those data to do a simple linear regression in a Bayesian way.

latent_y = alpha + beta*x; 

You cannot assign to parameters. Parameters are controlled by the inference mechanism.

latent_y ~ normal(mu, sigma); // income is log normal

This is not going to make latent_y log normal.

Hmm, I’m not familiar enough with what’s going on to parse that properly. But can you write out how the parameters generate the data? If you can do that, then you’ll have your model and we can see if it’s easy to code up in Stan.

For instance, for a super simple linear regression, you have a slope (a) and intercept (b) and a noise (sigma).

You assume your measurements are normally distributed with standard deviation sigma about the line a * x + b, where x is your measured covariate.

The usual truncation model is that the data was still normally distributed, but you were only able to measure up to some precision. You’re saying something else is happening at this point in your model? What happens after the linear regression that gives you integers?

Model: yLatent_i = \alpha + \beta * x_i where yLatent_i is unobserved; and
y_i = 1 if yLatent_i < log(5000) ,
y_i = 2 if log(5000) \leq yLatent_i < log(7500) , …,
y_i = 14 if log(75 000) \leq yLatent_i < log(100 000) ,
y_i = 15 if log(100 000) \leq yLatent_i , and y_i is observed \forall i .

In words:
So, I have income categories y_i \in \{ 1, \dotsc, 15 \} , I observe y_i \quad \forall i .
For each value y_i, I know exactly which income interval it corresponds to. For example, if y_i = 1 then individual i has income less than 5000 USD.
Let yLatent_i be the logarithm of the income of individual i, I do not observe yLatent_i .
(We do know that log of income is normally distributed, hence yLatent \sim normal .)
For each i, I know the precise interval where yLatent_i is located. However, I don’t want to merely use averages in the regression, hence trying to set it up in this Bayesian framework.

I think I’m trying to do something simpler than truncation, although the solution evades me.

From what I’m getting, this is exactly the truncation use case.

You have, for instance, that

p(y = 1 | \mu, \sigma) = \text{log_normal_lcdf}(5000 | \mu, \sigma) etc…

So in your model section you’ll write:

for(n in 1:N) {
  if(y[n] == 1) {
    target += log_normal_lcdf(5000 | mu, sigma);
  } else if(y[n] == 2) {
    target += log_diff_exp(log_normal_lcdf(7500 | mu, sigma), log_normal_lcdf(5000 | mu, sigma));
  } else { // etc...
  }
}

Exactly like you wrote in the uncensor_me.stan code you wrote except you’re incrementing the log density of your model with this. That make sense at all?

Thanks! I think I’ve got the correct code now: https://gist.github.com/credpath/6d4ce21bac3cf6d1b1dab267eed2120f

At least, now it compiles, but it seems like I have a problem with initialization, I keep getting:
“Rejecting initial value:
Log probability evaluates to log(0), i.e. negative infinity.
Stan can’t start sampling from this initial value.”

Do I need to add the log Jacobian to the target here or something?

Using cdfs puts you well into the realm of finicky numerics.

Using large numbers in CDFs doubly so.

You should probably do some print debugging just to figure out exactly what’s blowing up.

For instance, just add the line:

print(mu, sigma);

To your for loop and take a couple of the mus and sigmas for where you’re getting that error and try to evaluate:

log_diff_exp(normal_lcdf(7500 | mu, sigma) - normal_lcdf(5000 | mu, sigma));

in R (of course convert it to R first) and see what’s happening.

Things to try:

  1. Rescale your data so it’s around zero with standard deviation 1.
  2. Try Phi_approx (you’ll still have to normalize your stuff to work with it). It’s an approximation, but it’s probly good enough and it’s easier on the numerics.