Modeling censored data whose probability of observation is proportional to the observation value

chris · August 10, 2018, 6:25pm

I’m trying to model something that seems like it should be simple, but I can’t figure out the right way to specify it.

Let’s say the true data is exponentially distributed. However only some of the data is observed and the probability of observation is linearly correlated with the observation value. The density looks like the green dots in the plot below:

2018-08-10%2009_59_26-Clipboard

I want to learn the underlying parameter of the purple exponential distribution.

I’m trying to convert this exposure into a probability with inv_logit, but I don’t think that retains the proportionality of the observation.

Any ideas? I’ve been struggling to understand this problem for a ridiculously long time and I would be grateful for any help!

data {
   int<lower=1> N; //Number of observations
   vector<lower=0>[N] y_obs; //Observations
}

parameters {
   real<lower=0> latent_mu;
}

transformed parameters {
   
   vector[N] proportional_observed;
   proportional_observed = exponential_lpdf(y_obs | latent_mu) * y_obs;
}

model {
   vector[N] p_observed;
   
   //PRIOR
   latent_mu ~ exponential(1);
   
   target += inv_logit(proportional_observed) + exponential_lpdf(y_obs | latent_mu);
}

bgoodri · August 10, 2018, 7:19pm

I don’t think what you have written is specific enough to imply a corresponding Stan program.

What code would you use to simulate data from the prior predictive distribution?

chris · August 10, 2018, 7:25pm

Thanks Ben. This is how I generate the data in R. I guess my situation is called observer bias. I’m trying to deduce the actual_Y from the observed_Y.

X=seq(0, 10, by=0.1)
actual_Y = dexp(X, rate=1)
observed_Y = dexp(X, rate=1)*X

chris · August 10, 2018, 7:40pm

Hold on, I misunderstood what you meant by simulating the data. I’m trying to figure out some Stan code to accomplish this now.

chris · August 10, 2018, 8:55pm

OK, here is the simulated data Stan code. That’s a good exercise to clarify the data generating process, thanks for the suggestion! The thing I’m still stuck on is the categorical_rng(normalized_latent_Y) because I don’t actually have these normalized_latent_Y values so I don’t know how to properly normalize my data.

data {
   
   //NUMBER OF LATENT DATAPOINTS TO SIMULATE
   int N;
   //NUMBER OF OBSERVATIONS
   int O;
   
   //THE PARAMETER I WANT TO LEARN
   real latent_mu;
}

parameters {}
model {}

generated quantities {

   vector[N] latent_Y;
   vector[O] observed_Y;
   vector[N] normalized_latent_Y;

   
   //SIMULATE THE LATENT SAMPLES DRAWN FROM AN EXPONENTIAL DISTRIBUTION
   for (i in 1:N) {
      latent_Y[i] = exponential_rng(latent_mu);
   }
   
   // I DONT KNOW HOW TO MODEL THIS IN STAN BECAUSE I DONT ACTUALLY HAVE THESE LATENT VALUES
   normalized_latent_Y = latent_Y / sum(latent_Y);
   
   //GENERATE THE OBSERVATIONS
   for (i in 1:O) { 

      //CHOSEN OBSERVATION IS PROPORTIONAL TO ITS VALUE
      int observation = categorical_rng(normalized_latent_Y);
      
      //DRAW THE OBSERVED VALUE
      observed_Y[i] = latent_Y[observation];
      
   }
   
}

Running the code in R

stan_data <- list(N=10000, O=1000, latent_mu=1.0)

fit <- stan(file = 'exposure_data_generation.stan', data = stan_data,
               iter = 1, chains = 1, algorithm = "Fixed_param")

Visualizing the Stan simulated results

2018-08-10%2013_45_30-Clipboard

bgoodri · August 10, 2018, 9:36pm

Yes, that does clarify things. So, basically you have O known Y values (Are they unique in practice?) and N - O unknown Y values and some selection mechanism by which O are revealed and N - O are not revealed. For fixed N and O, I think you could do something like

data {
  int<lower=1> O;
  int<lower=O> N;
  vector<lower=0>[O] y_obs;
}
parameters {
  real<lower=0> mu;
  vector<lower=0>[N - O] y_miss;
}
model {
  vector[N] y = append_row(y_obs, y_miss);
  target += exponential_lpdf(y | inv(mu));
  
  // prior on mu

  // clogit for the selection mechanism
}

There are some functions for the clogit log-likelihood at

github.com

stan-dev/rstanarm/blob/master/src/stan_files/functions/bernoulli_likelihoods.stan#L76


    for (n in 1:N) ll[n] = bernoulli_logit_lpmf(y | eta[n]);
  }
  else if (link <= 5) {  // link = probit, cauchit, log, or cloglog 
    vector[N] pi = linkinv_bern(eta, link); // may not be stable
    for (n in 1:N) ll[n] = bernoulli_lpmf(y | pi[n]);
  }
  else reject("Invalid link");
  return ll;
}


/** 
 * Log-normalizing constant in the clogit case
 *
 * @param N_j Integer number of observations in the j-th group
 * @param D_j Integer number of successes in the j-th group
 * @param eta_j Vector of linear predictions in the j-th group
 * @return A scalar that normalizes the probabilities on the log-scale
 */
real log_clogit_denom(int N_j, int D_j, vector eta_j);
real log_clogit_denom(int N_j, int D_j, vector eta_j) {
  if (D_j == 1 && N_j == rows(eta_j)) return log_sum_exp(eta_j);

but it is going to be slow at best with such a large O and N.

betanalpha · August 10, 2018, 10:03pm

Another way to incorporate continuous selection functions is to explicitly model the distribution induced by the selection. If \pi(x \mid \alpha) is the density of the original observation distribution and P(x, \beta) is probability of being selected, then the density for the observations after selection is

\pi'(x) = \frac{ P(x, \beta) \pi(x, \alpha) }{ \int \mathrm{d} x' \, P(x', \beta) \pi(x', \alpha) }

The computational challenge is computing the normalization constant when \alpha or \beta are not known.

Note that if P(x, \beta) is a step-function at x_{0} then this reduces to the usual censored model,

\pi'(x) = \frac{ \pi(x, \alpha) }{ \int_{x_{0}}^{\infty} \mathrm{d} x' \, \pi(x', \alpha) }.

chris · August 10, 2018, 11:56pm

Thanks Mike and Ben. For my problem the value of N is basically infinity, or close to it. So Mike’s approach with the normalizing constant is the right direction; sorry for not clarifying that before.

I figured out that there’s an analytical solution for the normalizing constant when the density of the original observation is exponentially distributed: the gamma distribution with shape parameter = 2. So my whole Stan model becomes:

model {
   
   //PRIOR
   latent_mu ~ exponential(1);
   
   target += gamma_lpdf(y_obs | 2, latent_mu);
}

and it fits the latent_mu perfectly.

Let me know if I’m misunderstanding something here. Also, if you have any tips on doing this for observation distribution ~ lognormal distribution I’m working on that next.

Topic		Replies	Views
Clarification on Censored Data Models Modeling	2	916	August 22, 2018
Censored regression Modeling	2	392	August 8, 2022
Censored data is not required in modelling Modeling	6	489	February 4, 2019
Integrating out Censored Values - Error following manual's example Modeling	4	581	October 31, 2018
Ragged data structure causing initialization problems for interval censored model Modeling fitting-issues	9	542	June 29, 2018

Modeling censored data whose probability of observation is proportional to the observation value

Related Topics