Modelling dependent variable with some kind of rounding

Connor_Redpath · August 31, 2018, 11:58pm

Hey, I’m trying to model a relationship where the dependent variable (income) is discretized in the data.

My model:
y_{latent} = \alpha + \beta*x
where y_{latent} is unobserved. Instead y is observed and takes on a determined integer value depending on the value of y_{latent} .

My problem:
The code is not compiling. It seems I’m missing something from the model, but I’m not sure what.

My code is here: https://gist.github.com/credpath/3e1a7125ce9a643a763a6f8a4aef758e

bbbales2 · September 1, 2018, 12:21pm

Check out the “Truncated or Censored Data” section of the manual (page 186 in the 2.17.0 manual). That’s what you’re looking for :P.

There’s also some threads around here where various people do stuff like this: https://discourse.mc-stan.org/t/gaussian-mixture-model-with-interval-censored-data-where-do-the-divergences-come-from

The key words you’re looking for are truncation/censored if you wanna search around. Hope that helps!

Connor_Redpath · September 3, 2018, 10:58pm

Thanks bbbales2!

I don’t have truncated data, I’m just trying to reverse engineer my categories/integers into real values in a probabilistic framework (to avoid imputing averages), then use those data to do a simple linear regression in a Bayesian way.

gist.github.com

https://gist.github.com/credpath/6d4ce21bac3cf6d1b1dab267eed2120f

uncensor_me.stan

# trying out a rounding thing


functions
{
    // Define log probability density function for an interval censored normal
  real intervalCensoredNormal_lpmf(int y, real mu, real sigma, vector breaks) 
  {
    // This is a discrete log probability mass function defined for integers between 1 and rows(breaks)+1.
    // Assigns the integral of the normal probability between breaks y and y+1 to y.

This file has been truncated. show original

bbbales2 · September 4, 2018, 9:53am

latent_y = alpha + beta*x;

You cannot assign to parameters. Parameters are controlled by the inference mechanism.

latent_y ~ normal(mu, sigma); // income is log normal

This is not going to make latent_y log normal.

Hmm, I’m not familiar enough with what’s going on to parse that properly. But can you write out how the parameters generate the data? If you can do that, then you’ll have your model and we can see if it’s easy to code up in Stan.

For instance, for a super simple linear regression, you have a slope (a) and intercept (b) and a noise (sigma).

You assume your measurements are normally distributed with standard deviation sigma about the line a * x + b, where x is your measured covariate.

The usual truncation model is that the data was still normally distributed, but you were only able to measure up to some precision. You’re saying something else is happening at this point in your model? What happens after the linear regression that gives you integers?

Connor_Redpath · September 4, 2018, 11:31pm

Model: yLatent_i = \alpha + \beta * x_i where yLatent_i is unobserved; and
y_i = 1 if yLatent_i < log(5000) ,
y_i = 2 if log(5000) \leq yLatent_i < log(7500) , …,
y_i = 14 if log(75 000) \leq yLatent_i < log(100 000) ,
y_i = 15 if log(100 000) \leq yLatent_i , and y_i is observed \forall i .

In words:
So, I have income categories y_i \in \{ 1, \dotsc, 15 \} , I observe y_i \quad \forall i .
For each value y_i, I know exactly which income interval it corresponds to. For example, if y_i = 1 then individual i has income less than 5000 USD.
Let yLatent_i be the logarithm of the income of individual i, I do not observe yLatent_i .
(We do know that log of income is normally distributed, hence yLatent \sim normal .)
For each i, I know the precise interval where yLatent_i is located. However, I don’t want to merely use averages in the regression, hence trying to set it up in this Bayesian framework.

I think I’m trying to do something simpler than truncation, although the solution evades me.

bbbales2 · September 5, 2018, 12:59pm

From what I’m getting, this is exactly the truncation use case.

You have, for instance, that

p(y = 1 | \mu, \sigma) = \text{log_normal_lcdf}(5000 | \mu, \sigma) etc…

So in your model section you’ll write:

for(n in 1:N) {
  if(y[n] == 1) {
    target += log_normal_lcdf(5000 | mu, sigma);
  } else if(y[n] == 2) {
    target += log_diff_exp(log_normal_lcdf(7500 | mu, sigma), log_normal_lcdf(5000 | mu, sigma));
  } else { // etc...
  }
}

Exactly like you wrote in the uncensor_me.stan code you wrote except you’re incrementing the log density of your model with this. That make sense at all?

Connor_Redpath · September 5, 2018, 9:27pm

Thanks! I think I’ve got the correct code now: https://gist.github.com/credpath/6d4ce21bac3cf6d1b1dab267eed2120f

At least, now it compiles, but it seems like I have a problem with initialization, I keep getting:
“Rejecting initial value:
Log probability evaluates to log(0), i.e. negative infinity.
Stan can’t start sampling from this initial value.”

Do I need to add the log Jacobian to the target here or something?

bbbales2 · September 6, 2018, 1:24pm

Using cdfs puts you well into the realm of finicky numerics.

Using large numbers in CDFs doubly so.

You should probably do some print debugging just to figure out exactly what’s blowing up.

For instance, just add the line:

print(mu, sigma);

To your for loop and take a couple of the mus and sigmas for where you’re getting that error and try to evaluate:

log_diff_exp(normal_lcdf(7500 | mu, sigma) - normal_lcdf(5000 | mu, sigma));

in R (of course convert it to R first) and see what’s happening.

Things to try:

Rescale your data so it’s around zero with standard deviation 1.
Try Phi_approx (you’ll still have to normalize your stuff to work with it). It’s an approximation, but it’s probly good enough and it’s easier on the numerics.

Topic		Replies	Views
Modeling truncated vs. rounded observations Modeling	2	461	July 26, 2021
Gaussian mixture model with interval censored data -- where do the divergences come from? Modeling	9	1780	July 11, 2018
Clarification on Censored Data Models Modeling	2	949	August 22, 2018
Latent integer vector Modeling specification	8	647	November 22, 2018
Truncated data, log probability increment Modeling truncated	8	115	August 14, 2024

Modelling dependent variable with some kind of rounding

Related topics