I am modeling income data from the US Census CPS AESC survey as a two component exponential-log normal mixture model, where the bulk of the data (including the right tail) is in the log-normal component. The data is top coded at 150000 and the data contains survey weights which are non-integer values. The stan model I’m using is:
data {
int<lower=0> N_obs;
int<lower=0> N_cens;
real y_obs[N_obs];
real<lower=max(y_obs)> U;
vector<lower=0>[N_obs] weights; // survey weights
}
parameters {
real<lower=U> y_cens[N_cens];
real<lower=0,upper=1> lambda; // mixing proportions
real<lower=0> mu; // location of lognormal
real<lower=0> sigma; // scale of lognormal
real<lower=0> alpha; // scale of exponential
}
model {
for (n in 1:N_obs) {
target += weights[n] * log_sum_exp(log(lambda) + exponential_lpdf(y_obs[n] | alpha),
log1m(lambda) + lognormal_lpdf(y_obs[n] | mu, sigma));
}
y_cens ~ lognormal(mu, sigma);
}
The survey weight for income \geq 150000 is 465.6025 which I’m rounding up to an integer value for N_cens such that the data list is:
data_list = list(y_obs = WSAL_VAL, N_obs = length(WSAL_VAL), U=150000, N_cens = 466, weights=MARSUPWT)
I’m getting expected results, but am curious if anyone has a suggestion for a better way of approaching the problem of survey weights with censored data or if this is a reasonable approach. Thanks.