Hi All
I am a student in Simon Maskell’s (@s.maskell) research group and have been using Stan in a project with Evergreen Life. I am running into issues forming a model that is both able to recover parameters from simulated data and fit to real data without generating diagnostic errors.
To give some context: Evergreen Life created a health and wellness app and at the start of the pandemic, they used their app to ask users questions pertinent to the pandemic such as “are you experiencing symptoms of COVID today?”. Ideally, users of this app would respond to a questionnaire each day which would give us a complete timeline of when a user exhibited symptoms. Unfortunately, user responses are sporadic which means we instead consider bounds on when we think a user exhibited symptoms. For example, if a users response timeline was AXXXSSSSXXA - where A = asymptomatic, X = no response, S = symptomatic - the lower bound is the number of days between the first and last symptomatic response and the upper bound is the number of days between the first and last asymptomatic response. We then use the following Stan model to infer the parameters of the population from the extracted bounds.
data {
int<lower=0> N;
vector<lower=0>[N] lower_bound;
vector<lower=0>[N] upper_bound;
}
parameters {
real<lower=0> alpha;
real<lower=0> beta;
}
model {
alpha ~ exponential(1);
beta ~ exponential(1);
for (n in 1:N) {
target += log_diff_exp(gamma_lcdf(upper_bound[n] | alpha, beta), gamma_lcdf(lower_bound[n] | alpha, beta));
}
}
generated quantities {
real gamma_mu = alpha/beta;
real gamma_sigma = sqrt(alpha/(beta*beta));
}
When fitting to simulated data (data generation process described below), the model fails to recover parameters and yields diagnostic errors. When fitting to real data, the model yields diagnostic errors and sometimes fails to initialise. My suspicion is that the issues manifest from multimodality within the extracted bounds (users tend to fall into one of four categories: zero lower bound, infinite upper bound; zero lower bound, finite upper bound; non-zero lower bound, infinite upper bound; non-zero lower bound, finite upper bound. There is also a mix of individuals with “ordinary” and “long” COVID, and those that don’t have COVID at all). I also noticed that, when fitting to real data and analysing the trace plots, chains seem to get stuck in two distinct regions. Based on this, I tried to incorporate outlier detection into the model and tried to fit a mixture model but both of these models (listed below) fail to initialise.
// Outlier detection
data {
int<lower=0> N;
vector<lower=0>[N] lower_bound;
vector<lower=0>[N] upper_bound;
}
parameters {
real<lower=0> alpha;
real<lower=0> beta;
real<lower=0> lambda1;
real<lower=0> lambda2;
real<lower=0, upper=1> rho;
}
model {
alpha ~ exponential(1);
beta ~ exponential(1);
for (n in 1:N) {
real p1 = log_diff_exp(gamma_lcdf(upper_bound[n] | alpha, beta), gamma_lcdf(lower_bound[n] | alpha, beta));
real p2 = exponential_lpdf(upper_bound[n] | lambda1) + exponential_lpdf(lower_bound[n] | lambda2);
target += log_sum_exp(p1 + log(rho), p2 + log(1 - rho));
}
}
// Mixture model
data {
int<lower=0> N;
vector<lower=0>[N] lower_bound;
vector<lower=0>[N] upper_bound;
}
parameters {
real<lower=0> alpha[2];
real<lower=0> beta[2];
real<lower=0, upper=1> rho;
}
model {
alpha ~ exponential(1);
beta ~ exponential(1);
for (n in 1:N) {
target += log_mix(rho,
log_diff_exp(gamma_lcdf(upper_bound[n] | alpha[1], beta[1]), gamma_lcdf(lower_bound[n] | alpha[1], beta[1])),
log_diff_exp(gamma_lcdf(upper_bound[n] | alpha[2], beta[2]), gamma_lcdf(lower_bound[n] | alpha[2], beta[2])));
}
}
I have attached a set of simulated bounds to the post. To simulate the bounds, we assume that user interactions with the app follow a Poission process where the time between each report is exponentially distributed. The reports of users who regularly interact with the app follow exp(\lambda_{1}) whereas those who interact sporadically follow exp(\lambda_{2}). It is also assumed that the time between the user first interacting with the app and subsequently reporting that they are symptomatic is exponentially distributed, exp(\lambda_{3}). Finally, we assume that the duration of COVID follows a Gamma distribution with shape 4 and rate . The simulated response timeline, time to first symptomatic report and duration of COVID can then be used to determine a pair of lower and upper bounds.
I’d be very grateful for any suggestions on what to do to get the model working.
Cheers
Matthew
simulated_bounds.csv (14.4 KB)