Modeling noisy indicator of a ratio

I have a survey dataset where individuals i have various sources of non-wage income (Y_i), including unemployment benefits (U_i).

I do get to observe Y_i, but not U_i. But there is a question on the survey asking for people’s main source of other income, so in theory I could make a boolean outcome u_i = 1_{\{ U_i/Y_i \ge 0.5 \}}.

But I want to make this noisy because I don’t trust people answering this precisely. I thought of the following: let

r_i = U_i/Y_i \in [0,1]

be a latent variable, and define the mapping

w(y) = \frac{y / \sqrt{1 + y^2} + 1}{2}

which maps (-\infty, \infty) to (0, 1), then define

\Pr(u_i = 1; \kappa) = w(\mathrm{logit}(r_i) / \kappa)

This gives me a smooth approximation to the Heaviside function on a finite domain:

(I have \dots/\kappa because I want to put a finite-domain prior on \kappa).

I can code this in Stan just fine (I have to special-case the edges for 0 and 1, but it works), but it seems to be a pretty ad-hoc approach of transformations I just cobbled together, so I am wondering it there is something more canonical.

Can you use the inverse logit in place of w? Then kappa is just acting directly on the log-odds of r.

2 Likes

That’s a great idea, thanks! Then I don’t have to special-case the edges either in the code. The corresponding plot is

1 Like

Did the survey ask people if they earned more than half as much from unemployment benefits as they earned from non-wage income?

Or did you just ask them that and you’re somehow interested in this cutoff for some other reason? Do you need that indicator exactly or is it just a convenient summary?

I believe the advice that @andrewgelman usually provides in this situation is to just plot the raw fits, e.g., U vs. Y in a scatterplot with a line corresponding to the 0.5 ratio you care about.

Hi, I’m not following all the details here, but if I’m reading things right, my suggestion would just be to model the unobserved variable u. There’s no need for the logits or this other stuff, you can just do inference for u, get your posterior simulations, and then summarize however you want.