Inferring selection distorted distribution [Rewritten with Stan code]

(Rewritten with Stan code for my attempted solution to help clarify what the problem is).

As part of a larger model I’m building I have an inference problem that I hope someone can help me with.

Background:

A substance X increases your risk of being involved in a traffic crash. We have a dataset consisting of substance-positive drivers who have been involved in crashes, and we believe there are dose-response effects. The shape of the dose-response effect is “known” from other studies, and the effect at any dose is modelled as the product of a “max-effect” and a logit specifying the share of the full effect that is present at any dose. We also have evidence on the log average odds-ratio of a crash for all positive drivers.

The dose-response effect means that high-dose individuals will be oversampled relative to their presence on the road. We want to infer the max-effect, as well as the parameters of the lognormal distribution of doses in the dose-positive population on the road (i.e. correcting for the oversampling of more impaired individuals into the crash sample).

Data to illustrate:

Using R-code:

library(data.table)

max_effect <- log(3)

sim_data_original <- data.table(dose = rlnorm(1000, 1, 1))
sim_data_original[, effective_dose := exp(-2.85 + 1.4 * dose)/(1 + exp(-2.85 + 1.4 * dose))]
sim_data_original[, or := exp(effective_dose * max_effect)]
sim_data_selected <- sim_data_original[sample.int(nrow(sim_data_original), replace = T, prob = or)]

Inference on max_effect parameter alone (which works):

Taking the parameters of the “dose-effect shape function” as known, we can recover the max_effect in a Stan script as follows:

data {
  int<lower=1> n_obs;
  vector<lower = 0>[n_obs] dose;
  real mean_or_act;
}
parameters {
  real<lower=0> dose_coef;
}
transformed parameters {
    real implicit_log_or;
    vector[n_obs] sampling_weight;
  
    {
    vector[n_obs] effective_dose;  

    effective_dose = inv_logit(-2.85 + 1.4 * dose);
    sampling_weight = exp(-dose_coef * effective_dose); 
  
    
  implicit_log_or = log(n_obs/sum(exp(-dose_coef * effective_dose)));
}    
}
model {
  dose_coef ~ normal(1,0.1);
  
  implicit_log_or ~ normal(log(mean_or_act), 0.2);

}

This consistently recovers the max_effect parameter, although Stan gives a warning (“Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.”). My logic here is that each crash observation has a known OR (given the parameter dose-coef and the known dose-response shape function) which we can refer to as OR_n for n = 1, …, N. By weighting each crash-observation by 1/OR_n, we adjust for oversampling. Since OR_n/OR_n = 1, we simply sum the observations and divide by the sum of the weights (i.e., the sum of 1/OR_n over all observations). This gives us the implied average OR in the pre-crash road distribution, which we relate to the evidence on average crash ORs.

Inference on the pre-crash lognormal distribution alone (which works):

I also want to recover the parameters of the lognormal distribution on the road. My hunch was that this could be achieved by similarly weighting the lognormal distribution by the inverse of the individual ORs. Using the true OR values from the simulated data, this works using the following Stan code:

data {
  int<lower=1> n_obs;
  vector<lower = 0>[n_obs] dose;
  vector[n_obs] or_act;
}
parameters {
  real<lower=0> dist_mu;
  real<lower=0> dist_sigma;
}
transformed parameters {
    real implicit_log_or;
    vector[n_obs] sampling_weight;
  
    {
    sampling_weight = rep_vector(1.0, n_obs) ./ or_act; 
  
    
  implicit_log_or = log(n_obs/sum(sampling_weight));
}    
}
model {
  
  dist_mu ~ normal(0,1);
  dist_sigma ~ normal(0,1);


  for (i in 1:n_obs){
    target += sampling_weight[i] * lognormal_lpdf(dose[i] | dist_mu, dist_sigma);
  }

}


Simultaneous inference on the max_effect and the pre-crash lognormal distribution (which does not work):

Putting both pieces together, I hoped that I would both identify the max_effect parameter and the underlying log_normal distributional parameters, but this does not hold. When the individual ORs from the simulated data are replaced by their estimated values (which are determined by the max_effect parameter that I was able to infer above), neither the max_effect or the lognormal parameters are correctly returned.

data {
  int<lower=1> n_obs;
  vector<lower = 0>[n_obs] dose;
  real mean_or_act;
}
parameters {
  real<lower=0> dist_mu;
  real<lower=0> dist_sigma;

  real<lower=0> dose_coef;
}
transformed parameters {
    real implicit_log_or;
    vector[n_obs] sampling_weight;
  
    {
    vector[n_obs] effective_dose;  

    effective_dose = inv_logit(-2.85 + 1.4 * dose);
    sampling_weight = exp(-dose_coef * effective_dose); 
  
    
  implicit_log_or = log(n_obs/sum(exp(-dose_coef * effective_dose)));
}    
}
model {
  dist_mu ~ normal(1, 0.5);
  dist_sigma ~ normal(1, 0.5);
  dose_coef ~ normal(1,0.1);
  

  for (i in 1:n_obs){
    target += sampling_weight[i] * lognormal_lpdf(dose[i] | dist_mu, dist_sigma);
  }

  implicit_log_or ~ normal(log(mean_or_act), 0.02);
  

}

This, however, does not work. Not only does it return the wrong parameters for the lognormal distribution, but it messes up the estimate of the max_effect parameter: The max_effect parameter becomes too high, and the mean of the distribution too low.

I have the feeling that this problem should be pretty easy to solve but that there is something “obvious” I am missing. If anyone should see what it is or might be, please let me know.

It looks like you’re missing the Jacobian adjustment for implicit_log_or. This is a problem in the first example, too.

The scale of implicit_log_or might be a problem for adaptation and initialization depending on dose_coef and effective_dose and n_obs.

Thanks. I still haven’t fully understood this business with Jacobian adjustments, but in this case I thought an adjustment was unneccessary given the arguments in this brief blog-post by Daniel Lakeland. Along those lines, I viewed the “implicit_log_or ~ normal(log(mean_or_act), 0.02)” line as a modification of the priors imposed on the parameters directly.

FWIW I gave up identifying the counterfactual “road population” distribution in this instance and proceeded without it, as it was not of primary interest in any case.

The explanation at the beginning of the reference manual section that details the actual transforms and Jacobians may help. And the first few are easy. Or you can think about how the lognormal is defined. If you try to write

parameters {
  real<lower = 0> y;
}
model {
  log(y) ~ normal(0, 1);
}

you don’t get a lognormal distribution on y without the Jacobian adjustment. The point is that if you declare a parameter and then impose a distribution on a nonlinear transform, you need to include the log absolute derivative of the transform (in general, for multivariate transforms, the log absolute determinant of the Jacobian).

I found it help understand these things building up from simple examples.

Another one I like is thinking about a variable y ~ uniform(0, 1) for a probability and then a second variable logit(y) for the log odds. What’s the distribution on logit(y)? The way to compute that (and learn it’s the logistic distribution) is through the Jacobian.

parameters {
  real logit_y;
}
model {
  inv_logit(logit_y) ~ uniform(0, 1);
}

will not work. You need to include the Jacobian for the inverse-logit transform, which works out to being a standard logistic distribution on logit_y.