Is it ok to use the same independent outcome variable twice in a model?

What are the implications of using the same outcome variable in two likelihoods in the same framework? Is it ok to do this?

Here is a specific example:

model{
	y ~ bernoulli( (1-psi) * mu);
	y ~ bernoulli( (1-psi) * theta);

	u ~ bernoulli( theta);
}

The only way I have been able to make this work is by using y in two separate likelihoods. That model formulation recovers simulated parameters perfectly.

Doing this in a joint likelihood (mixture model, log_mix) results in biased estimates of mu. It only works when the likelihoods are independent.

Any help and background information would be much appreciated.

Thank you,

Benny

My hunch is that you will need to normalize the denominator of the likelihood for y. The total probability of y=0 + y=1 is not constant for different values of theta and mu. Eg if both equal 0.5 then the total probability is 0.5 (0…50.5 + 0.50.5), whereas the total probability is only 0.18 if one equals 0.1 and the other equals 0.9. But in both cases, y=0 and y=1 are given equal probabilities. This could make the model favor values that arbitrarily maximize that total probability rather than match the data.

It’s possible that tying theta to u, as you’ve done, solves this issue. But I’d want to confirm that

You could instead write a custom likelihood function that normalizes the denominator. So the probability y=1 would be (theta * mu) / (theta * mu + (1-theta) * (1-mu)). I’m not sure that would achieve your specific goals but it would “solve” the denominator issue.

Of course, any more details about your model and goals might reveal a more appropriate solution.

2 Likes

Can you write out how you simulate the data from the parameters? I don’t yet understand what your generative model is. If you can show a simulation, we can figure out how to recapitulate that generative process in Stan.

Yes absolutely, here is simulation code (R), and the model.
My actual model is more complicated and involves more components, but this captures the essence of my problem.

set.seed(2)
n = 1000
psi = 0.23    # false negative proportion
alpha = -0.5

X = rnorm(n)  
beta = 0.01
mu = 1/(1+exp(-(alpha + beta * X)))
y = rbinom(n, size = 1, prob = (1 - psi) * mu)

theta = 1/(1+exp(-(alpha)))
u = rbinom(n, 1, theta)
data {
              int N; 
              int y[N];
              int u[N];
              vector[N] X;
       }

       parameters {
              real alpha;  
              real beta;
              real<lower=0,upper=1> theta;  
              real<lower=0,upper=1> psi;
       }
       
       transformed parameters {
              vector<lower=0,upper=1>[N] mu;

              mu = inv_logit(alpha + X * beta);
       }

       model {
              alpha ~ normal(0,5);
              beta ~ normal(0,5); 
              theta ~ beta(1,1);
              psi ~ beta(1,1);
              
              y ~ bernoulli((1-psi) * mu);
              y ~ bernoulli((1-psi) * theta);
              
              u ~ bernoulli(theta);
       }

In your simulation, y is given as

I don’t understand why in your model you have the line y ~ bernoulli((1-psi) * theta); in addition to the line y ~ bernoulli((1-psi) * mu);.

Moreover, in your simulation theta is a deterministic transformation of alpha, and I don’t understand why in your model you declare alpha and theta both as parameters.

Is this the model you intend?

parameters {
  real alpha;
  real beta;
  real<lower = 0, upper = 1> psi;
}
transformed parameters {
  vector[N] mu = inv_logit(alpha + X * beta);
  real theta = 1/(1+exp(-(alpha)));
}
model {
// priors
  alpha ~ normal(0,5);
  beta ~ normal(0,5);
  psi ~ beta(1,1);
// likelihood
  y ~ bernoulli((1-psi) * mu);
  u ~ bernoulli(theta);
}
1 Like

That is exactly my initial reasoning when I started modeling this, but it turns out that this does not work.
Psi needs to estimated using the “correct probability”, which in this case is theta as informed by u (which is assumed not to have false negatives).
I only managed to do this by adding y ~ bernoulli((1-psi) * theta).

That correct estimate of psi is then used in y ~ bernoulli((1-psi) * mu) to ensure the appropriate correction.

Thanks for this @simonbrauer.
It sounds like what you’re saying might indeed be what is happening.

I am trying to understand how to try out your suggestion, to see if that solves the problem, but am not quite following, would it perhaps be possible to add some more detail on that? Would this result in similar estimates, but with a “corrected” likelihood that solves the problem of defining y twice?

I’m trying to make sense of this, but it’s definitely not what you have in your simulation.

From your verbal description, here is what I think I understand:

  • You have data u which are sampled from a binomial distribution with underlying success probability theta and no false negatives.
  • You have data y which are sampled from a bernoulli distribution with probability mu, but with probability psi a true one gets recorded as a zero.
  • The following two identities hold: mu = 1/(1+exp(-(alpha + beta * X))) and theta = 1/(1+exp(-(alpha)))

Is that right?

1 Like

Happy to clarify now that I’m at a computer. Though I’ll add that I came to the same conclusion as @jsocolar. I don’t understand how the data generating code and the model align, and so it doesn’t seem necessary to use the approach you’re describing (though I haven’t run your code, so maybe there’s something we’re missing).

(and, as a repeat of my caveat: I don’t know whether this would actually address your specific issue. Just that it would address the denominator issue)

model{
    y ~ bernoulli(mu * theta / (mu * theta + (1.0 - mu) * (1.0 - theta)) );
}
2 Likes

Your bullet point list is useful to see where I caused confusion!

The first two points are correct.
The third is where I made things more confusing instead of clarifying them…

Here is a better verbal description (and an updated simulation below):

  • y and u are sampled from the same population, with the same underlying success probability.
  • u has no false negatives and no covariate data.
  • y has false negatives and covariate data.
  • when disregarding the covariates, mu and theta should both be 1/(1+exp(-(alpha))).
  • when adding in covariates, the identities change to, and importantly the intercepts aren’t the same anymore: mu = 1/(1+exp(-(beta0 + beta * X))) and theta = 1/(1+exp(-(alpha))).
    However I did use the same alpha to simulate the data to keep it simple, but I see how that instead made it confusing.

Here is a version that hopefully clarifies it:

set.seed(2)
n = 1000
psi = 0.23    # false negative proportion
beta0 = -0.5
X = rnorm(n)
beta = 0.01

y = rbinom(n, size = 1, prob = (1 - psi) * 1/(1+exp(-(beta0 + beta * X))))

# calculate "true" simulated probability of success, taking into account beta * X
# a more thorough way to simulate this would be to first generate y values using only beta0, 
# and then simulating the X values, but I am trying to keep this code block concise.
p.alpha = mean(1/(1+exp(-(beta0 + beta * X))))
alpha = log(p.alpha / (1-p.alpha))  # added to make the link with the posted solution clear

u = rbinom(n, 1, 1/(1+exp(-alpha)))

Thanks again for the suggestion.
I tested it, and it seems like this is a clever alternative to using a mixture model approach, is that possible?
It gave me very similar results to:

target += log_mix(0.5, 
                            bernoulli_lpmf(y[i] | (1-psi) * mu),
                            bernoulli_lpmf(y[i] | (1-psi) * theta));

While this fixes the issue of defining y twice, it is unfortunately not what I need, because that approach does not correctly recover psi.

Got it! This is definitely clarifying. So here’s my new understanding of the problem:

  • We have response data u that enable estimation of an intercept term alpha in a straightforward way.

  • We have response data y that are sampled at random from the same population as u. (Crucially, the distribution of covariates associated with u must be the same as the distribution of covariates associated with y)

  • The response data y are contaminated with some unknown proportion of false negatives, completely at random with respect to the covariates conditional on the true outcome being positive.

  • We also have covariate data associated with y, and we want to estimate the covariate relationships, while accounting for false negatives.

The problem, and the reason this is so challenging and confusing, is that in the above list we made a pair of incompatible assumptions. First we assumed that the distribution of the counts of outcomes in u is binomial. Second, we assumed that u has some set of unknown covariates associated with it, which implies that the elements of u are not sampled from iid bernoullis and the sampling distribution of u is not binomial.

The formal way around this is to make some set of distributional assumptions about the covariates. A sketch of one possible way forward is below:

  • Assume that the distribution of X is Gaussian, with mean x_mu and sd x_sigma.
  • Include in your likelihood the statement X ~ normal(x_mu, x_sigma).
  • Construct a vector parameter X_unknown for the unknown covariate values associated with u
  • Include in your prior X_unknown ~ normal(x_mu, x_sigma)
  • In transformed parameters, include mu_unknown = inv_logit(alpha + X_unknown * beta);
  • In the likelihood, use y ~ bernoulli((1-psi) * mu); and u ~ bernoulli(mu_unknown);
2 Likes

I think we’re on the same page!
That’s a really interesting suggestion, I will give that a try and report back.

1 Like

That seems to do the trick, beautiful.
Fantastic, I’ve been struggling with this problem for months, and this means a lot for the progress of our work.
I am posting the updated Stan code here, and if I remember I will post a link to the paper when it’s out.

data {
              int N; 
              int y[N];
              int u[N];
              vector[N] X;
       }

       parameters {
              real alpha;  
              real beta;
              real<lower=0,upper=1> psi;
              vector[N] X_unknown;
              real X_mu;
              real X_sigma;
       }
       
       transformed parameters {
              vector<lower=0,upper=1>[N] mu;
              vector<lower=0,upper=1>[N] mu_unknown;

              mu = inv_logit(alpha + X * beta);
              mu_unknown = inv_logit(alpha + X_unknown * beta);
       }

       model {
              alpha ~ normal(0,5);
              beta ~ normal(0,5); 
              psi ~ beta(1,1);
              X_unknown ~ normal(X_mu, X_sigma);
              
              y ~ bernoulli( (1-psi) * mu);
              u ~ bernoulli(mu_unknown);
       }

One last question: could your suggested approach be described as integrating out the covariates?

Update:

@simonbrauer’s suggestion led me to find an alternative, simpler solution.
Adapting the likelihood function of a logistic regression into a joint likelihood that includes \theta allows estimation of all parameters: L(\mu_i, \theta | u, y) = y_i * ln((1-\psi)\mu * (1-\psi)\theta) + (1-y_i)*ln((1-(1-\psi)\mu)*(1-(1-\psi)\theta)))

Full model:

data {
              int N; 
              int y[N];
              int u[N];
              vector[N] X;
       }

       parameters {
              real alpha;  
              real beta;
              real<lower=0,upper=1> psi;
              real<lower=0,upper=1> theta;
       }
       
       transformed parameters {
              vector<lower=0,upper=1>[N] mu_psi;
              real<lower=0,upper=1> theta_psi = (1-psi) * theta;

              mu = (1-psi) * inv_logit(alpha + X * beta);
       }

       model {
              alpha ~ normal(0,5);
              beta ~ normal(0,5); 
              theta ~ beta(1,1);
              psi ~ beta(1,1);

              for(i in 1:N){
                 target += y[i] * log(mu_psi[i] * theta_psi) + (1-y[i]) * log( (1-mu_psi[i]) * (1-theta_psi));
              }
              u ~ bernoulli(theta);
       }

Just to clarify the logic of the three approaches (mixture, unadjusted double-likelihood, and adjusted double-likelihood).

A mixture model would be as if for each observed value y…

  1. Flip coin A, with probability p1 heads.
  2. If A is heads, flip coin B with probability p2 heads
  3. If A is tails, flip coin C with probability p3 heads

You only collect the result of B and C. You don’t know the result of A, and you don’t know whether you used coin B or C. The probability heads is p1 * p2 + (1-p1) * p3. The probability tails is 1 minus that. So the total sample space of observable outcome is 1.

Imagine instead that for each observation y you…

  1. Flip coin B with probability p2 of heads
  2. Flip coin C with probability p3 of heads

You have 4 possible outcomes (HH, HT, TH, TT) whose probabilities sum to 1. In your initial unadjusted double-liklihood scenario, it’s as if you could have observed HT or TH but you didn’t (all flips from coin B equal the corresponding flips from coin C). So in theory the total probability of all possible outcomes is 1 but in practice is much lower.

My adjusted approach is as if we uses the same sampling approach but decided that we would throw out all of the HT and TH outcomes and reflip until we got HH or TT. So the total probability of all possible outcomes is 1, but it is a different process than the mixture model.

@jsocolar’ s model treats the covariates as parameters to be estimated. If you wanted to integrate out the covariates, you would need to calculate the average value of mu_unknown given the distribution of covariates. See for example the SEM literature, where you can either directly estimate the values of latent variables (treating them as parameters, as here) or integrating over the assumed distributions of the latent variables to arrive at a multivariate normal distribution. In your case, it would be integrating over a normal distribution (of X) through the inverse-logit function.

This is still the unadjusted likelihood. So you’re implicitly assuming that you could observe “HT/TH” combinations but didn’t. I think @jsocolar 's suggestion more effectively gets at what you’re trying to do. You could try the marginalization approach described above, though I think you would need to use integrate_1d to get it. So the most expedient solution is to treat the missing covariates as parameters.

1 Like

For example, here’s the probability of “Success” (HH in my example) for the three different models, given different values of mu and theta (assuming equal mixture in the mixture model). Notice how in the mixture model, the probability of success is constant for a constant sum of mu + theta (from the upper-left corner to the bottom-right). In the adjusted double-likelihood, we get a curve because it’s now dependent on the product of mu and theta. The Unadjusted double-likelihood also has a curve, but it’s noticeable different.

We can look at the total probability of success + failure for the different models, as seen in the plot below. Here, mu is held constant at 0.25 and we adjust theta along the x-axis. The blue portion shows the probability of success while the red portion shows the probability of failure (where the two are stacked on top of one another). You can see that theta linearly affects the probability of success in the mixture model and a non-linear effect in the adjusted model. But when the denominator is unadjusted, the combination of success and failure does not sum to 1.

(but, of course, this is all just for fun and to address your original question. I don’t think any of these three are truly what you want given your data generating process.)

1 Like

Thanks for the quick lesson and illustrations @simonbrauer, that’s helpful (and generous).

One more question if you don’t mind…
In your first reply you said this:

It seems like the model may indeed be doing this, as using the unadjusted double likelihood approach has been working very well (while others like mixture model or adjusted likelihood haven’t). Based on what you say about wanting to confirm this, does this mean showing that the model recovers all parameters correctly, or would it need more than that, and is it nevertheless still improper to define y twice (or use that unadjusted likelihood)?

(I’m still trying to wrap my head around why parameter estimation works so well when using the unadjusted double likelihood, while if I understand correctly that is supposed to be an inappropriate approach.)

It’s not surprising why the mixture and adjusted models fit poorly, since they’re unrelated to your original data generation. As to the puzzle of why the unadjusted model fits well… My original hunch (which I lost while in the weeds exploring your original question) was that

  1. u ~ bernoulli(theta) leads to an unbiased estimate of theta
  2. y ~ bernoulli((1 - psi) * theta) leads to an unbiased estimate of psi, conditional on theta being unbiased
  3. y ~ bernoulli((1 - psi) * mu) leads to an unbiased estimate of mu, given psi is unbiased

So that might lead to unbiased estimates overall, even if the unadjusted double-likelihood is conceptually and statistically odd. From that perspective, it’s not surprising that you’d get unbiased results.

But I bet you’d also get results that do not have adequate coverage. Have you checked to see whether 95% of your (95%) credible intervals include the true values? The unadjusted model is as if you actually drew two y values per y. So an analogy would be if you ran a logistic regression and just doubled every observation. You’d still get unbiased results, but your intervals would be too narrow.

Edit: I’ve been running some simulations. From what I’m seeing, phi (and, to a lesser-extent, theta) don’t have sufficient coverage when the sample size is “small” (say, 60 or fewer). But I wasn’t seeing any immediate issues when using 1,000 observations. Other than sample size, I used the same parameters values as you specified, though used the integrate function to find the value of theta given the distribution of X.

inv_logit <- function(x){
  1 / (1 + exp(-x))
}

mean_logit <- function(alpha, beta, X_mu = 0, X_sigma = 1){
  # probability of y given X (inv_logit) weighted by likelihood of X (dnorm)
  f <- function(x){
    dnorm(x, X_mu, X_sigma) * inv_logit(alpha + beta * x)
  }

  # Integrate to find average probability of y given distribution of X
  integrate(f, -Inf, Inf)
}

generate_data <- function(n = 1000, psi = 0.23, alpha = -0.5, beta = 0.01, X_mu = 0, X_sd = 1){
  # Observations recorded without error (but without covariate data)
  theta = mean_logit(alpha, beta, X_mu, X_sd)$value
  u = rbinom(n, 1, theta)

  # Observations recorded with covariate data (but also with error)
  X = rnorm(n, X_mu, X_sd)
  mu = inv_logit(alpha + beta * X)
  y = rbinom(n, size = 1, prob = (1 - psi) * mu)

  list(
    params = list('n' = n,
                  'theta' = theta,
                  'psi' = psi,
                  'alpha' = alpha,
                  'beta' = beta,
                  'X_mu' = X_mu,
                  'X_sd' = X_sd),

    data = list(
      N = n,
      y = y,
      u = u,
      X = X)
  )
}
1 Like