Hey, I’m trying to model a relationship where the dependent variable (income) is discretized in the data.
My model: y_{latent} = \alpha + \beta*x
where y_{latent} is unobserved. Instead y is observed and takes on a determined integer value depending on the value of y_{latent} .
My problem:
The code is not compiling. It seems I’m missing something from the model, but I’m not sure what.
I don’t have truncated data, I’m just trying to reverse engineer my categories/integers into real values in a probabilistic framework (to avoid imputing averages), then use those data to do a simple linear regression in a Bayesian way.
You cannot assign to parameters. Parameters are controlled by the inference mechanism.
latent_y ~ normal(mu, sigma); // income is log normal
This is not going to make latent_y log normal.
Hmm, I’m not familiar enough with what’s going on to parse that properly. But can you write out how the parameters generate the data? If you can do that, then you’ll have your model and we can see if it’s easy to code up in Stan.
For instance, for a super simple linear regression, you have a slope (a) and intercept (b) and a noise (sigma).
You assume your measurements are normally distributed with standard deviation sigma about the line a * x + b, where x is your measured covariate.
The usual truncation model is that the data was still normally distributed, but you were only able to measure up to some precision. You’re saying something else is happening at this point in your model? What happens after the linear regression that gives you integers?
Model: yLatent_i = \alpha + \beta * x_i where yLatent_i is unobserved; and y_i = 1 if yLatent_i < log(5000) , y_i = 2 if log(5000) \leq yLatent_i < log(7500) , …, y_i = 14 if log(75 000) \leq yLatent_i < log(100 000) , y_i = 15 if log(100 000) \leq yLatent_i , and y_i is observed \forall i .
In words:
So, I have income categories y_i \in \{ 1, \dotsc, 15 \} , I observe y_i \quad \forall i .
For each value y_i, I know exactly which income interval it corresponds to. For example, if y_i = 1 then individual i has income less than 5000 USD.
Let yLatent_i be the logarithm of the income of individual i, I do not observe yLatent_i .
(We do know that log of income is normally distributed, hence yLatent \sim normal .)
For each i, I know the precise interval where yLatent_i is located. However, I don’t want to merely use averages in the regression, hence trying to set it up in this Bayesian framework.
I think I’m trying to do something simpler than truncation, although the solution evades me.
Exactly like you wrote in the uncensor_me.stan code you wrote except you’re incrementing the log density of your model with this. That make sense at all?
At least, now it compiles, but it seems like I have a problem with initialization, I keep getting:
“Rejecting initial value:
Log probability evaluates to log(0), i.e. negative infinity.
Stan can’t start sampling from this initial value.”
Do I need to add the log Jacobian to the target here or something?
in R (of course convert it to R first) and see what’s happening.
Things to try:
Rescale your data so it’s around zero with standard deviation 1.
Try Phi_approx (you’ll still have to normalize your stuff to work with it). It’s an approximation, but it’s probly good enough and it’s easier on the numerics.