Modeling censored data whose probability of observation is proportional to the observation value

I’m trying to model something that seems like it should be simple, but I can’t figure out the right way to specify it.

Let’s say the true data is exponentially distributed. However only some of the data is observed and the probability of observation is linearly correlated with the observation value. The density looks like the green dots in the plot below:

2018-08-10%2009_59_26-Clipboard

I want to learn the underlying parameter of the purple exponential distribution.

I’m trying to convert this exposure into a probability with inv_logit, but I don’t think that retains the proportionality of the observation.

Any ideas? I’ve been struggling to understand this problem for a ridiculously long time and I would be grateful for any help!


data {
   int<lower=1> N; //Number of observations
   vector<lower=0>[N] y_obs; //Observations
}

parameters {
   real<lower=0> latent_mu;
}

transformed parameters {
   
   vector[N] proportional_observed;
   proportional_observed = exponential_lpdf(y_obs | latent_mu) * y_obs;
}

model {
   vector[N] p_observed;
   
   //PRIOR
   latent_mu ~ exponential(1);
   
   target += inv_logit(proportional_observed) + exponential_lpdf(y_obs | latent_mu);
}

I don’t think what you have written is specific enough to imply a corresponding Stan program.

What code would you use to simulate data from the prior predictive distribution?

Thanks Ben. This is how I generate the data in R. I guess my situation is called observer bias. I’m trying to deduce the actual_Y from the observed_Y.

X=seq(0, 10, by=0.1)
actual_Y = dexp(X, rate=1)
observed_Y = dexp(X, rate=1)*X

Hold on, I misunderstood what you meant by simulating the data. I’m trying to figure out some Stan code to accomplish this now.

OK, here is the simulated data Stan code. That’s a good exercise to clarify the data generating process, thanks for the suggestion! The thing I’m still stuck on is the categorical_rng(normalized_latent_Y) because I don’t actually have these normalized_latent_Y values so I don’t know how to properly normalize my data.

data {
   
   //NUMBER OF LATENT DATAPOINTS TO SIMULATE
   int N;
   //NUMBER OF OBSERVATIONS
   int O;
   
   //THE PARAMETER I WANT TO LEARN
   real latent_mu;
}

parameters {}
model {}

generated quantities {

   vector[N] latent_Y;
   vector[O] observed_Y;
   vector[N] normalized_latent_Y;

   
   //SIMULATE THE LATENT SAMPLES DRAWN FROM AN EXPONENTIAL DISTRIBUTION
   for (i in 1:N) {
      latent_Y[i] = exponential_rng(latent_mu);
   }
   
   // I DONT KNOW HOW TO MODEL THIS IN STAN BECAUSE I DONT ACTUALLY HAVE THESE LATENT VALUES
   normalized_latent_Y = latent_Y / sum(latent_Y);
   
   //GENERATE THE OBSERVATIONS
   for (i in 1:O) { 

      //CHOSEN OBSERVATION IS PROPORTIONAL TO ITS VALUE
      int observation = categorical_rng(normalized_latent_Y);
      
      //DRAW THE OBSERVED VALUE
      observed_Y[i] = latent_Y[observation];
      
   }
   
}

Running the code in R

stan_data <- list(N=10000, O=1000, latent_mu=1.0)

fit <- stan(file = 'exposure_data_generation.stan', data = stan_data,
               iter = 1, chains = 1, algorithm = "Fixed_param")  

Visualizing the Stan simulated results

2018-08-10%2013_45_30-Clipboard

Yes, that does clarify things. So, basically you have O known Y values (Are they unique in practice?) and N - O unknown Y values and some selection mechanism by which O are revealed and N - O are not revealed. For fixed N and O, I think you could do something like

data {
  int<lower=1> O;
  int<lower=O> N;
  vector<lower=0>[O] y_obs;
}
parameters {
  real<lower=0> mu;
  vector<lower=0>[N - O] y_miss;
}
model {
  vector[N] y = append_row(y_obs, y_miss);
  target += exponential_lpdf(y | inv(mu));
  
  // prior on mu

  // clogit for the selection mechanism
}

There are some functions for the clogit log-likelihood at


but it is going to be slow at best with such a large O and N.

Another way to incorporate continuous selection functions is to explicitly model the distribution induced by the selection. If \pi(x \mid \alpha) is the density of the original observation distribution and P(x, \beta) is probability of being selected, then the density for the observations after selection is

\pi'(x) = \frac{ P(x, \beta) \pi(x, \alpha) }{ \int \mathrm{d} x' \, P(x', \beta) \pi(x', \alpha) }

The computational challenge is computing the normalization constant when \alpha or \beta are not known.

Note that if P(x, \beta) is a step-function at x_{0} then this reduces to the usual censored model,

\pi'(x) = \frac{ \pi(x, \alpha) }{ \int_{x_{0}}^{\infty} \mathrm{d} x' \, \pi(x', \alpha) }.
1 Like

Thanks Mike and Ben. For my problem the value of N is basically infinity, or close to it. So Mike’s approach with the normalizing constant is the right direction; sorry for not clarifying that before.

I figured out that there’s an analytical solution for the normalizing constant when the density of the original observation is exponentially distributed: the gamma distribution with shape parameter = 2. So my whole Stan model becomes:

model {
   
   //PRIOR
   latent_mu ~ exponential(1);
   
   target += gamma_lpdf(y_obs | 2, latent_mu);
}

and it fits the latent_mu perfectly.

Let me know if I’m misunderstanding something here. Also, if you have any tips on doing this for observation distribution ~ lognormal distribution I’m working on that next.