Marginalize out discrete intermediate variable and possibility of in-chain parallelization

Hi everyone,
I am working on a model with continuous parameters that creates a discrete “intermediate variable” which causes some problems. I asked regarding this issue a while ago here: Slow sampling for phenological model but since my questions are now more specific I thought it is better to open a new thread with better tags and problem description. Please let me know if that’s right or if I should have added my new questions in the old thread.
Here is a brief summary of the model:
The model predicts the day of year (integer) on which a plant reaches a new stage in its development (like getting leaves). The plant accumulates daily “development units” until a threshold is reached. This threshold is the parameter of interest. The data are daily values of the development unit and observed days of year when a new stage in plant development is reached. There are several stations and years. In the end this will be a hierarchical model but for now, full pooling is also fine.

Here is a bad, small, unmarginalized example:

data {
  int<lower=1> J;                       // Number of stations
  int<lower=1> K[J];                    // Number of years per station
  int time_span;                        // Time span in the data (last year of any station - first year of any station)
  int<lower=0> N[J, time_span];         // Number of days observed per year per station
  real DU[366, J, time_span];           // Daily cumulative development units for each station-year
  int event_day[J, time_span];          // Observed event day for each station-year
}
parameters {
  real DU_threshold;                            
  real<lower=0> sigma;                  	
}
model { 
  DU_threshold ~ normal(1300, 40);        
  sigma ~ lognormal(0, 1);    
  for (j in 1:J) {
    for (k in 1:time_span) { 
      if (N[j, k] > 0) {
        real predicted_event_day = 0;
        for (i in 1:N[j,k]) {
          if (DU[i,j,k] >= DU_threshold && predicted_event_day == 0) {
            predicted_event_day = i;
          }
        }
        if (predicted_event_day > 0 && event_day[j, k] != -999) {
          event_day[j, k] ~ normal(predicted_event_day, sigma);
        }
      }
    }
  }
}

In my previous post Aki pointed out that a way forward would be to make predicted event day continuous (which I did and it works and makes perfect sense to me) or to marginalize it out. The marginalization is what I want to try now. In all examples I could find it is always a discrete parameter which is marginalized out, however my parameter DU_threshold is continuous, but together with data on development units it creates the discrete predicted event day along the way.
If I understand correctly then the idea is to not estimate one particular predicted event day but rather to calculate and sum up the probabilities of all possible values (from 1 to 365) it can take, in the light of the data and parameters. So that I want to construct a log_likelihoods vector of length 365 and eventually use target += log_sum_exp(log_likelihoods). I struggle with incorporating the data on development units and the DU_threshold parameter into this.
Here is a try:

data {
  int<lower=1> J;                       // Number of stations
  int<lower=1> K[J];                    // Number of years per station
  int time_span;                        // time span in the data (last year of any station - first year of any station)
  int<lower=0> N[J, time_span];         // Number of days observed per year per station
  real DU[366, J, time_span];           // Daily cumulative development units for each station-year
  int event_day[J, time_span];          // Observed event day for each station-year
}

parameters {
  real<lower=100> DU_threshold;                                        
  real<lower=0.5> sigma;    
  real<lower=0.1> sigma2;    
}
transformed parameters{
  vector[365] log_likelihoods; 
  log_likelihoods = rep_vector(0, 365);
    for (j in 1:J) {
      for (k in 1:time_span) {
        if (N[j, k] > 0) {
          for (i in 1:365) {
          log_likelihoods[i] = log_likelihoods[i] + normal_lpdf(event_day[j, k] |i, sigma) + normal_lpdf(DU[i,j,k]|DU_threshold, sigma2);
        }
      }
    }
  }
}
model { 
  mu_DU ~ normal(1300, 20);        
  sigma ~ normal(15, 5); 
  sigma2 ~ normal(0, 5); 
  target += log_sum_exp(log_likelihoods);
}  

What I don’t like is the + normal_lpdf(DU[i,j,k]|DU_threshold, sigma2) part. I somehow need to get the DU data and the DU_threshold parameter into the log_likelihoods but this approach does not seem too nice, plus it adds the sigma2 parameter. I would prefer not to add a new parameter if possible. I would be grateful for any comments and advice. I guess this is a rather simple issue but my mathematical training is unfortunately lacking.
A following question would be, once the calculations are set up for all possible 365 predicted_event_days, does that open up the possibility of using in-chain parallelization since these 365 terms are independent? Is this then a case where reduce_sum() can be used to spread these calculations across several cores?
Thank you!

1 Like

Yes, you can use a reduce_sum() on a partial sum version of the innermost operation of your iterative loop structure, and you can spread it across multiple CPU assuming all of the CPU are on the same node/machine.

1 Like

Thank you Corey for taking the time to answer. Good to hear that you think reduce_sum() is applicable here. I will rewrite my code to include it and see what kind of gains this will bring. Do you have an opinion on the + normal_lpdf(DU[i,j,k]|DU_threshold, sigma2) part that I am not so sure about?

Have you looked at the latent change point section of the manual? Latent Discrete Parameters

That seems to be the most relevant to this model.

1 Like

Hi Conor,
thanks for your suggestion. I looked at the examples in the manual and I agree the change point model seems most similar. What is different in my opinion is that in the example from the manual, the parameter itself is discrete and needs to be marginalized. In my model however, the parameter is continuous but together with the data ‘DU’ it forms the discrete ‘predicted event day’. I guess my problem is that I struggle with the actual math of the marginalization. In the change point example we marginalize s by:
p(D|e,l) = \sum_{s=1}^{T} p(s,D|e,l)
~~~~~~~~~~~~~~=\sum_{s=1}^{T} p(s)p(D|s,e,l)
~~~~~~~~~~~~~~=\sum_{s=1}^{T} uniform(s|1,T) \prod_{t=1}^{T} Poisson(D_t|t<s?e:l)
With the described change point model and the explanations from the manual, this makes sense to me. However, when I try to apply this to my case I frankly get confused. I would be very grateful for any starting point on how to do this with the two data sets and the intermediate ‘predicted event day’ I have. I am sure I did not do it correctly in the code I provided. I think no new parameter should appear and I would like to derive a solution on paper first and then translate it to code and not translate some gut feeling into code, as I did here.
Maybe @avehtari can shed some light on this since you originally suggested the marginalization. Sorry for pointing the finger at you, I’m just a little lost right now. If you don’t have time for this I could totally understand and if anybody else has an idea please share it with me.
Thanks a lot, I promise I will work on my math skills :)

Hi @buf. I took another look at your problem and, yes, I think the latent change point model was not quite correct. It might be useful to add more information about the generating process as, the way it’s currently described, seems a little confusing to me. In other words, I’m not sure there is a clear data-generating process. Is this an established model? If so, can you link to any literature?

If we think generatively, and you wanted to marginalize both the predicted event day and threshold value, then you’d start with a model such as:

e_i \sim \mathrm{Normal}(\mu_i, \sigma)\\ \mu_i \sim \mathrm{Categorical}(d_i > \gamma \text{ ? } \omega_{1} : \omega_{2})\\ \gamma \sim \mathrm{Normal}(0, 1)\\ \omega_{1:2} \sim \mathrm{Dirichlet}(\alpha)

e_i is the event day for station/year i, \mu_i is the latent discrete day variable you’re predicting, and that is itself a changepoint model that depends on d, the observed development unit, being greater than some unknown threshold \gamma. The joint density would be written/factored as:

p(e, \mu, \gamma, \omega) = p(e \mid \mu, \gamma, \omega) p(\mu) p(\gamma \mid \omega) p(\omega)

If we wanted to marginalize \mu, we’d need:

p(e, \gamma, \omega) = p(e \mid \gamma) p(\gamma \mid \omega) p(\omega) = p(\gamma \mid \omega) p(\omega) \sum_{i=1}^{K=365} p(\mu = i) p(e \mid \mu = i, \sigma)

and you could follow the latent changepoint model example.

I’m not sure you’d want to start marginalizing more than that. I believe Aki was referring to marginalizing out the latent day variable, not the continuous threshold variable. If you wanted to have a go at marginalizing the threshold variable, you’d have to use the CDF or inverse CDF functions, I believe. Something like:

p(e, \omega) = p(e \mid \omega) p(\omega) = p(\omega) \int_{d}^{U} p(\gamma \mid \omega) d d \sum_{i=1}^{K=365} p(\mu = i) p(e \mid \mu = i, \sigma)

I wrote this quickly, apologies, but you could look at how, e.g. censored variables are marginalized out of likelihoods in the Stan manual for some potential directions. Basically, you want a term that integrates (i.e. sums) over the potential values of \gamma, which for any particular d is the inverse CDF from d to the max possible value of U.

1 Like

Hi @cgoold thank you very much for taking the time to have a closer look! Yes, I don’t want to marginalize the threshold value, only the discrete predicted event day. Having
y_i \sim~ Categorical(d_i > \gamma ? \omega_1 : \omega_2)
and then
\omega_{1:2} \sim~ Dirichlet(\alpha)
is a cool idea. I will take the weekend to figure out if this is actually what I’m looking for.
Regarding literature this here is probably the closest to what I want to do: https://www.sciencedirect.com/science/article/pii/S1161030116301800 . From table 1 it is evident that development unit thresholds are also parameters to be estimated in the model, along some parameters that govern the constitution . I think everything is nicely explained in the paper and the observed event days (y_{jl}) in this study also follow a normal distribution according to predicted event days (f_i(\theta_{ij})). But the code does not seem to be available.

@buf Thanks. I skimmed through that paper, but honestly found their presentation of the model overall a little opaque. But, yes, I think the problem ultimately here is that the their function f looks to transform the development units into days to heading deterministically from the phenology models/time heading using some threshold. This doesn’t really form a clear generative model that we can marginalize out the discrete heading variable, IMO, and which leads to the discontinuities that Aki was referring to. To do that, we’d need to specify the days-to-heading variable as some random variable, like I did with \mu above.

Stepping back, the way I think about this problem is that we have some random variable X, which is the event day, that we are trying to predict from some phenology model’s predictions. X is a discrete random variable because it, at least for our data collection purposes, it takes as outcomes the integers 1, 2, ..., 365. Therefore, we should probably start with assuming some discrete model of the data, such as:

x_{i} \sim \mathrm{Categorical}(\pi_{i})

where x_i are realizations of X, and \pi_{i} are the per-station category probabilities of each possible event day. We now need a way of transforming our phenology model results into these probabilities. For instance, we might assume something akin to an ordinal probit regression, such as:

\pi_{i} = \Phi(\frac{\theta_{k} - \mu_i}{\sigma}) - \Phi(\frac{\theta_{k - 1} - \mu_i}{\sigma})

where \Phi is the normal CDF, \mu_i is the mean predicted day of heading on the latent, normally distributed scale underlying the possible event days, and \sigma is the latent standard deviation. \mu_i would be the output of the phenology model.

The phenology model itself would then have its own generative process, call it \mu_i = f(d_ij, \gamma), where d is the development unit and \gamma is the threshold to be estimated. This is the part that may be inefficient when f is equivalent to \mu_i = d_{ij} > \gamma ? j : 0, or whatever, where j indicates day of year. Contrast this with something like a ordinary differential equation model where f will the ODE system, or even a joint time-to-event model, where the relationship between a longitudinal marker (i.e. development units) and an event (i.e. day to heading) use continuous relationships (i.e. association structures).