Using loo-ic to justify a power prior a0 under a hierarchical normal model?

Hi,

I have searched the internet high and low for insight here and found examples. But my experience here is limited and I may be misinterpreting the examples I have found. I greatly value and respect the expertise present in this Forum. Any thoughts on this use of loo-lc and in the way log_lik is generated would be greatly appreciated. I have some QUESTIONS below.

Background
Drug product stability with potency measurements over time: 3 batches of new data. 13 batches of historical data used to specify a power prior with coefficient a0. The stability profile for each batch is modeled as a straight line over time with a batch specific intercept and slope. Batch slopes and intercepts are assumed independently normally distributed. We refer to this as a “random slope and intercept model”. It is not much different than the simpler models often used with the radon example. What is different here is the use of a power prior.

Description of the Stan Model (given below):

  1. Hopefully the shape of the data is clear. 96 observations (y values, batch identifier, storage time t). The first 12 observations are from the 3 “new” batches. The next 84 observations are from 13 “historic/old” batches (the source of the power prior).
  2. At the population level there are 4 parameters: beta0(intercept) and beta1(slope) with respective SDs sigma0 and sigma1. At the batch level there are latent intercepts (b0) and slopes (b1) for each of the 16 batches. Measurements (y) are obtained with a residual SD of sigma. All the distributions are normal.
  3. Some very diffuse priors are assumed for the 5 parameters. These really have little information content. The bulk of prior information is meant to come from the “historical/old” batches expressed as a power prior likelihood discounted by a0 exponent (0 to 1).
  4. The goal is to determine/justify a value for a0 based on minimizing loo-ic.
  5. The likelihoods for the “new” data and “old” data are identical except that the likelihood for the “old” data is discounted by the a0 exponent (a multiplier of the log likelihood). Note that the “old” likelihood is really the power prior. QUESTION: Is this the proper way to express the likelihood?
  6. In order to obtain loo-ic (using the loo package) the generated quantities block is used. Notice that log_lik only makes use of the likelihood of the “new” data values. It seems that the “old” data should not be included here since the “old” data is really providing prior information. QUESTION: Is this a correct view?
  7. I have no experience with estimating the log_likelihood this way. QUESTION: Is the expression in the Stan model a correct way to get at the predictive reliability of the fitted model?
  8. I have verified that the model fits correctly when a0=1. What I find with this data set is that loo-ic decreases smoothly as a0 goes from 0 to 1. There may be a minimum slightly below 1. QUESTION: Am I correct that the model (i.e., choice of a0) giving the minimum loo-ic is “optimal” in a predictive sense?

Again, any criticisms, corrections, thoughts would be very much appreciated!!

Thank you,

Dave

data { 
  int <lower=1> n; // total number of batches (new+old = 16) 
  int <lower=1> N; // total number of y measurements (new+old = 96)
  int <lower=1> M; // number of new y measurements (=12 y observations)
  int <lower=1> m; // last new data batch (3 new batches listed first)
  int <lower=1,upper=n> batch[N]; // batch index (1...16)
  vector[N] t; // sampling times 
  vector[N] y; // measurements 
  real<lower=0,upper=1> a0; // power prior coefficient 
} 

parameters { 
  real beta0; // population mean intercept
  real beta1; // population mean slope
  vector[n] b0; // batch mean intercepts
  vector[n] b1; // batch mean slopes
  real<lower=0> sigma0; // inter-batch intercept SD
  real<lower=0> sigma1; // inter-batch slope SD
  real<lower=0> sigma; // measurement SD 
} 

model{ 
  vector[N] mu; // temporary mean 

  // Priors 
  
  beta0 ~ normal(2, 100); 
  beta1 ~ normal(0, 10); 
  sigma0 ~ student_t(2, 1, 5);
  sigma1 ~ student_t(2, 1, 5);
  sigma ~ student_t(2, 1, 5); 

  // Likelihood for new data
  for(i in 1:m){
    target += normal_lpdf(b0[i] | beta0,sigma0);
    target += normal_lpdf(b1[i] | beta1,sigma1);
  }
  for(k in 1:M){
    mu[k] = b0[batch[k]] + b1[batch[k]]*t[k];
    target += normal_lpdf(y[k] | mu[k], sigma );
  }

  // Power prior based on historical data
  // discounted based on a0
  for(i in (m+1):n){
    target += a0*normal_lpdf(b0[i] | beta0,sigma0);
    target += a0*normal_lpdf(b1[i] | beta1,sigma1);
  }
  for(k in (M+1):N){
    mu[k] = b0[batch[k]] + b1[batch[k]]*t[k];
    target += a0*normal_lpdf(y[k] | mu[k], sigma );
  }
}  
  
generated quantities {
  vector[M] log_lik; // calculate log-likelihood (new data values only)
  real mu_hat;

  for(k in 1:M){
    mu_hat = b0[batch[k]] + b1[batch[k]]*t[k]; // generate mu predicted value
    log_lik[k] = normal_lpdf(y[k] | mu_hat, sigma );
  }

} 

This is one for @avehtari, but I believe he’s on vacation or otherwise engaged at the moment.

It can help a lot here to model them as bivariate normal. If your covariates are all positive, for example, then you know there will be correlation between the slopes and intercepts. Everything can be random in Bayes, so we tend to just say “varying slopes and intercepts”.

This can be problematic and if you don’t have much data, come to dominate your inferences. I would recommend using weakly informative priors that inform the scale of what you’re doing.

Why discount the old data? Is the idea that your model is misspecified and there’s some unmodeled lack of stationarity in the process? If so, you might want to consider moving to something like a time-series model.

This depends on what you’re trying to cross-validate. In an i.i.d. model, you can just leave one out and it makes sense. In something like what you’re talking about, it seems like there’s time series information. In that case, it’s more standard to leave the future out, not just do random leave one out. So if you have N observations predict the first from the prior, the second from the first, and so on.

I’m not sure what you mean here. The log likelihood is a function supplied by the user, not something that gets estimated.

I find it easier to work with maximizing expected log predictive density (ELPD) which is -1/2 times the LOO information criterion (LOOIC)? If the 1 case is where you don’t discount, that makes sense if the model’s well specified.

You have implicitly defined a centered parameterization. You will probably get better behavior with a non-centered parameterization, which you can achieve as follows:

vector<offset=beta0, multiplier=sigma0>[n] b0;
vector<offset=beta1, multiplier=sigma1>[n] b1;

You can vectorize this

to

target += normal_lpdf(b0 | beta0, sigma0);
target += normal_lpdf(b1 | beta1, sigma1);
target += normal_lpdf(y | b0[batch] + b1[batch] .* t, sigma);

Ooops—forgot to cite the leave future out paper:

Paul-Christian Bürkner, Jonah Gabry, Aki Vehtari. 2020. Approximate leave-future-out cross-validation for Bayesian time series models. JSCS.

Thank you so much Bob for your helpful comments.

Re:“model them as bivariate normal”
Yes indeed. But are data are weak (often as few as 3 batches). When we try MVN the MCMC has trouble. But this point is well taken.

Re: “weakly informative priors”
Good point. We will tighten our priors.

Re: “Why discount the old data?”
Old data comes from different (but similar) products. Regulators (FDA) do not trust that old data are 100% exchangeable. We use scientific arguments to justify similarity but quantification is always subjective here. So we hedge - discount. Time series is an interesting approach - need to look into that.

Re: “This depends on what you’re trying to cross-validate”
This makes me think. Our process is longitudinal (independent measurements over time) not time series. We want to use loo as a metric for how well we can predict the measurements on future batches of product A, given early measurements on a few batches of product A and much more prior information about many more batches from related products B, C, D,… I guess I need to know the proper way to construct log_lik so the resulting loo does what we want.

Please forgive the crude way I’m responding. Have not learned all the Forum tricks…

Re:“I’m not sure what you mean here [ how to estimate the log_lik].”
Attempt to clarify my ignorance: The likelihood in our model block includes data from our target batch (A) as well as (discounted) data from prior batches (B,C,D,…). But I felt the log_lik created in the generated quantities should only include contributions from target batch (A) data since we want loo to measure how well we can predict future measurements from batch A (not from batches B,C,D…). Does that make sense?

Re: “I find it easier to work with maximizing expected log predictive density (ELPD)”
OK. We can switch to ELPD then. Thanks.

Re: “You will probably get better behavior with a non-centered parameterization”
Thank you for that and for the example code. I will try NCP. Currently we have do have convergence issues, funnels, and maybe that will fix those. Very cool vectorization!! :-)

Re: “leave future out paper”
Thanks for the link to this paper. I will read up on this and be better off!! Thanks again Bob!!

Sorry for not seeing this sooner, but yes, you can treat the earlier data as part of the prior.

Thank you Bob. This forum has really been very helpful to me!!

Thanks, Dave