Hi everyone,
I am working on a model with continuous parameters that creates a discrete “intermediate variable” which causes some problems. I asked regarding this issue a while ago here: Slow sampling for phenological model but since my questions are now more specific I thought it is better to open a new thread with better tags and problem description. Please let me know if that’s right or if I should have added my new questions in the old thread.
Here is a brief summary of the model:
The model predicts the day of year (integer) on which a plant reaches a new stage in its development (like getting leaves). The plant accumulates daily “development units” until a threshold is reached. This threshold is the parameter of interest. The data are daily values of the development unit and observed days of year when a new stage in plant development is reached. There are several stations and years. In the end this will be a hierarchical model but for now, full pooling is also fine.
Here is a bad, small, unmarginalized example:
data {
int<lower=1> J; // Number of stations
int<lower=1> K[J]; // Number of years per station
int time_span; // Time span in the data (last year of any station - first year of any station)
int<lower=0> N[J, time_span]; // Number of days observed per year per station
real DU[366, J, time_span]; // Daily cumulative development units for each station-year
int event_day[J, time_span]; // Observed event day for each station-year
}
parameters {
real DU_threshold;
real<lower=0> sigma;
}
model {
DU_threshold ~ normal(1300, 40);
sigma ~ lognormal(0, 1);
for (j in 1:J) {
for (k in 1:time_span) {
if (N[j, k] > 0) {
real predicted_event_day = 0;
for (i in 1:N[j,k]) {
if (DU[i,j,k] >= DU_threshold && predicted_event_day == 0) {
predicted_event_day = i;
}
}
if (predicted_event_day > 0 && event_day[j, k] != -999) {
event_day[j, k] ~ normal(predicted_event_day, sigma);
}
}
}
}
}
In my previous post Aki pointed out that a way forward would be to make predicted event day continuous (which I did and it works and makes perfect sense to me) or to marginalize it out. The marginalization is what I want to try now. In all examples I could find it is always a discrete parameter which is marginalized out, however my parameter DU_threshold is continuous, but together with data on development units it creates the discrete predicted event day along the way.
If I understand correctly then the idea is to not estimate one particular predicted event day but rather to calculate and sum up the probabilities of all possible values (from 1 to 365) it can take, in the light of the data and parameters. So that I want to construct a log_likelihoods vector of length 365 and eventually use target += log_sum_exp(log_likelihoods). I struggle with incorporating the data on development units and the DU_threshold parameter into this.
Here is a try:
data {
int<lower=1> J; // Number of stations
int<lower=1> K[J]; // Number of years per station
int time_span; // time span in the data (last year of any station - first year of any station)
int<lower=0> N[J, time_span]; // Number of days observed per year per station
real DU[366, J, time_span]; // Daily cumulative development units for each station-year
int event_day[J, time_span]; // Observed event day for each station-year
}
parameters {
real<lower=100> DU_threshold;
real<lower=0.5> sigma;
real<lower=0.1> sigma2;
}
transformed parameters{
vector[365] log_likelihoods;
log_likelihoods = rep_vector(0, 365);
for (j in 1:J) {
for (k in 1:time_span) {
if (N[j, k] > 0) {
for (i in 1:365) {
log_likelihoods[i] = log_likelihoods[i] + normal_lpdf(event_day[j, k] |i, sigma) + normal_lpdf(DU[i,j,k]|DU_threshold, sigma2);
}
}
}
}
}
model {
mu_DU ~ normal(1300, 20);
sigma ~ normal(15, 5);
sigma2 ~ normal(0, 5);
target += log_sum_exp(log_likelihoods);
}
What I don’t like is the + normal_lpdf(DU[i,j,k]|DU_threshold, sigma2)
part. I somehow need to get the DU data and the DU_threshold parameter into the log_likelihoods but this approach does not seem too nice, plus it adds the sigma2
parameter. I would prefer not to add a new parameter if possible. I would be grateful for any comments and advice. I guess this is a rather simple issue but my mathematical training is unfortunately lacking.
A following question would be, once the calculations are set up for all possible 365 predicted_event_days, does that open up the possibility of using in-chain parallelization since these 365 terms are independent? Is this then a case where reduce_sum() can be used to spread these calculations across several cores?
Thank you!