I’m trying to model something that seems like it should be simple, but I can’t figure out the right way to specify it.

Let’s say the true data is exponentially distributed. However only some of the data is observed and the probability of observation is linearly correlated with the observation value. The density looks like the green dots in the plot below:

I want to learn the underlying parameter of the purple exponential distribution.

I’m trying to convert this exposure into a probability with inv_logit, but I don’t think that retains the proportionality of the observation.

Any ideas? I’ve been struggling to understand this problem for a ridiculously long time and I would be grateful for any help!

OK, here is the simulated data Stan code. That’s a good exercise to clarify the data generating process, thanks for the suggestion! The thing I’m still stuck on is the categorical_rng(normalized_latent_Y) because I don’t actually have these normalized_latent_Y values so I don’t know how to properly normalize my data.

data {
//NUMBER OF LATENT DATAPOINTS TO SIMULATE
int N;
//NUMBER OF OBSERVATIONS
int O;
//THE PARAMETER I WANT TO LEARN
real latent_mu;
}
parameters {}
model {}
generated quantities {
vector[N] latent_Y;
vector[O] observed_Y;
vector[N] normalized_latent_Y;
//SIMULATE THE LATENT SAMPLES DRAWN FROM AN EXPONENTIAL DISTRIBUTION
for (i in 1:N) {
latent_Y[i] = exponential_rng(latent_mu);
}
// I DONT KNOW HOW TO MODEL THIS IN STAN BECAUSE I DONT ACTUALLY HAVE THESE LATENT VALUES
normalized_latent_Y = latent_Y / sum(latent_Y);
//GENERATE THE OBSERVATIONS
for (i in 1:O) {
//CHOSEN OBSERVATION IS PROPORTIONAL TO ITS VALUE
int observation = categorical_rng(normalized_latent_Y);
//DRAW THE OBSERVED VALUE
observed_Y[i] = latent_Y[observation];
}
}

Running the code in R

stan_data <- list(N=10000, O=1000, latent_mu=1.0)
fit <- stan(file = 'exposure_data_generation.stan', data = stan_data,
iter = 1, chains = 1, algorithm = "Fixed_param")

Yes, that does clarify things. So, basically you have O known Y values (Are they unique in practice?) and N - O unknown Y values and some selection mechanism by which O are revealed and N - O are not revealed. For fixedN and O, I think you could do something like

data {
int<lower=1> O;
int<lower=O> N;
vector<lower=0>[O] y_obs;
}
parameters {
real<lower=0> mu;
vector<lower=0>[N - O] y_miss;
}
model {
vector[N] y = append_row(y_obs, y_miss);
target += exponential_lpdf(y | inv(mu));
// prior on mu
// clogit for the selection mechanism
}

There are some functions for the clogit log-likelihood at

but it is going to be slow at best with such a large O and N.

Another way to incorporate continuous selection functions is to explicitly model the distribution induced by the selection. If \pi(x \mid \alpha) is the density of the original observation distribution and P(x, \beta) is probability of being selected, then the density for the observations after selection is

Thanks Mike and Ben. For my problem the value of N is basically infinity, or close to it. So Mike’s approach with the normalizing constant is the right direction; sorry for not clarifying that before.

I figured out that there’s an analytical solution for the normalizing constant when the density of the original observation is exponentially distributed: the gamma distribution with shape parameter = 2. So my whole Stan model becomes:

Let me know if I’m misunderstanding something here. Also, if you have any tips on doing this for observation distribution ~ lognormal distribution I’m working on that next.