Why not put a prior on the entire model?

I was reading this Yes, you can include prior information on quantities of interest, not just on parameters in your model | Statistical Modeling, Causal Inference, and Social Science and it got me to thinking, why don’t you just put a prior on the entire model formula for the mean? For example, let’s say you have count data and a Poisson model, and there is absolutely no way possible that there can be a mean count greater than 20 (i.e. it’s physically not really within the realm of the natural world to have lambda > 20). Say your model looks something like this:

for (t in 1:T) {
      target += poisson_log_lpmf(y[t] | mu[t] + beta * x[t]); 
    }

where mu is a vector of parameters. Now normally you put a prior on mu and prior on beta. That’s fine. But given the information that I gave above, why not put a prior on lambda, say something like this:

for (t in 1:T) {
      target += normal_lpdf( exp(mu[t] + beta * x[t]) | 6, 5) - 1 * normal_lccdf(0 | 6, 5);
    }

where the density goes from zero on the lower end to about only 1% above 17.5 or so.

I don’t really remember seeing anyone do this sort of thing. Why not? Why not do it all the time? Why not always have a prior for the formula for the mean that is within the realm of possibility? It sure is easier to think about than all the individual parameters that make up the model formula.

My motivation for the question is that I am fitting a structural time series model to some count data (a local level model). I want to do something similar to that discussed CausalImpact (google.github.io) and Combining spline and state space model - Modeling - The Stan Forums (mc-stan.org) except the key difference is that I actually have predictors of interest (components of the intervention) in my model. What I did to try to estimate the effect of the intervention was to make predictions from the full model and predictions for the model without the intervention predictors and take the difference. Something like this in generated quantities block:

vector[T] preds;
vector[T] preds_cf;
real sum_effect;
    for (t in 1:T) {
      preds[t] = poisson_log_rng(mu[t] + beta * x[t]);  // predictions from full model
      preds_cf[t] = poisson_log_rng(mu[t]);  // counterfactual predictions (i.e. the model without beta*x)
      intervention_effect[t] = preds_cf[t] - preds[t];  // effect of the intervention is the difference
      }
      sum_effect = sum(intervention_effect[t]);  // full effect of the intervention by summing effect at all time points

Of course, if the intervention period is very long, then the predictions for the counterfactual blow up with values well outside the range of what is plausible in the real world. I placed a prior on these predictions for the counterfactual, and this really seemed to do the trick. Like this in the model block:

for (t in 1:T-T0) {
      target += normal_lpdf( exp(mu[t-T0]) | 6, 5) - 1 * normal_lccdf(0 | 6, 5);  //T0 is last timepoint before intervention; notice I leave the effect of x out of the counterfactual prediction prior
    }

But then I got to thinking…am I really influencing the results here by only placing the prior on the counterfactual lambda? Maybe I need to place a prior on the full model lambda as well.

for (t in 1:T) {
      target += normal_lpdf( exp(mu[t] + beta * x[t]) | 6, 5) - 1 * normal_lccdf(0 | 6, 5);  //  prior on lambda
for (t in 1:T-T0) {
      target += normal_lpdf( exp(mu[t-T0]) | 6, 5) - 1 * normal_lccdf(0 | 6, 5);  //T0 is last timepoint before intervention; notice I leave the effect of x out of the counterfactual prediction prior
    }

Has anyone done this before? Thoughts?
@andrewgelman @avehtari @betanalpha

1 Like

Cool, I don’t have an answer to your question just that you shoold cache that normal_lccdf(0 | 6, 5) before the loop

1 Like

Thanks for the correction! Could you explain or provide the correct version of the code for the prior? Not quite sure I know exactly how you mean…

Something like replacing

for (t in 1:T) {
      target += normal_lpdf( exp(mu[t] + beta * x[t]) | 6, 5) - 1 * normal_lccdf(0 | 6, 5);  //  prior on lambda
}

with

real normal_lccdf_cache = normal_lccdf(0 | 6, 5); // you can calculate this in transformed data 
vector[T] mu = exp(mu + beta * x); 
target += normal_lpdf( mu | 6, 5) - T * normal_lccdf_cache;  //  prior on lambda
1 Like

Thanks! Is this fundamentally different, or does it simply make it more efficient?

It’s the same just more efficient.

1 Like

You effectively have two prior models for \lambda here – one implied by the prior on \mu and \beta, and one directly specified on \lambda. There are a lot of ways to think about it, some of them might include:

  • You could keep it as is, and have you joint model decompose to p(Y \mid \lambda)p(\lambda)p(\lambda \mid \mu, \beta)p(\mu, \beta), the danger being you can accidentally set too restrictive a prior for \lambda.
  • You could merge the models in some way, possibly by pooling the two marginal distributions for \lambda, expanding the joint model to somehow include both sources of information in a consistent way (possibly by building a hierarchical model).
  • Your knowledge about \lambda in this simple observation model (Poisson) directly relates to prior information about the prior distribution of Y. You could instead start from this prior knowledge, and “reverse engineer” a prior for (\mu, \beta) that is consistent with this prior knowledge. Kinda like one would do when doing prior predictive checks, but you can get more precise if you wish (either via histroy matching or, for some shameless self promotion, an idea I’ve been working on lately to do it with Bayesian optimisation).

Some reasons I’ve seen people object to this kind of thing include:

  • The model is no longer generative nor possesses a DAG representation, and the former means you typically have to sample the prior using MCMC when doing a prior predictive check.
  • There are a lot of priors for \mu and \beta that are consistent with your prior knowledge for \lambda – there are many mappings from high dimensional space to low dimensional space, and not all of them will be consistent with the covariates/observations you have.
  • I get a feeling that the risk of overfitting / overconfidence is increased here, but I find it difficult to articulate exactly why. I’m not sure it’s over and above the usual risks/benefits of informative priors – more thinking required.
6 Likes

Thanks for the response and links!

Why is there the danger of it being too restrictive? In the example above, the prior pretty much covers any plausible value for lambda. Does having two prior models somehow make it more restrictive?

I always have felt this seems difficult if there are many predictors in the model and little prior knowledge about which ones will have much influence. Thanks for the links. I’ll check them out.

This is a good point for the general question I had.

For what it’s worth, in my actual data, the prior on lambda actually improved the fit according to LOO (although not by much) and perhaps more importantly, gave predictions for the counterfactual (i.e. predictions from the model without the intervention predictors) that actually made sense and weren’t wildly too large.

My main goal was just that I needed some way to put a prior on the generated quantities for the counterfactual of the structural time series model, and the only way I could think to do it was as described above. But after doing so, I have doubts about how that influences the values for coefficients for the parameters… I am not sure if what I describe is really that different to what Gelman is saying on the blog post I linked to. And it led to the larger question of the prior on lambda (or the formula for the mean in any model in general).

@hhau’s answer is excellent. I add that if we have wide prior on coefficients but narrow prior on the mean of outcome, it can be bad in case of collinearity as then one coefficient can have big magnitude and be positive and the coefficient of the correlating another predictor can have big magnitude and be negative, and thus the prior on outcome is not really regularizing. If we compare to R2D2 priors which also set prior on outcome but then corresponding joint prior is derived for the coefficients (and residual in case of normal model) and the joint prior regularizes the sum of magnitudes (instead of sum of coefficients).

1 Like

Thanks. That makes sense! Good point. In my own use case, I put pretty narrow priors, normal(0, 0.5), on coefficients also because I was already worried about collinearity due to the nature of these predictors.

I guess these responses give some good insights into the general question about prior on the sum of coefficients (i.e. the model formula for the mean). But specific to the case that I give, where I place a prior on the sum of coefficients for the counterfactual scenario, is this really any different than what is suggested in Gelman’s blog post?

Yes, having multiple prior models makes the prior more restrictive. As a limiting example, think of applying the same standard normal prior twice to some quantity of interest. Although “both priors” cover the full range of values that are typical of the standard normal, the actual realized prior will be proportional to the product of the two standard normals, which will be narrower. (It turns out that the product of two gaussian densities is itself proportional to another gaussian density, and in the case where the two gaussians are standard normals, their product is proportional to a gaussian with mean zero and variance one half.)

1 Like

Ahh yes of course! That makes sense.

That makes it more difficult to implement priors on combinations of parameters in practice. Although the idea still makes a lot of sense to me, at least as described in the blog post and some of the comments.

@dlakelan not sure if you still interact with this forum, but given your comments on Gelman’s blog post that I linked, I’d like to hear your thoughts on putting a prior on the counterfactual scenario as defined in my code. Others have already given great comments on the general question of why not to put a prior on the entire mean formula. Given the blog post, I’m still not sure what to make of putting it on the counterfactual mean, but if I am interpreting the blog post and your comments correctly, what I am doing by putting a prior on the counterfactual predictions like

for (t in 1:T-T0) {
      target += normal_lpdf( exp(mu[t-T0]) | 6, 5) - 1 * normal_lccdf(0 | 6, 5);  //T0 is last timepoint before intervention; notice I leave the effect of x out of the counterfactual prediction prior
    }

in my example in the op, is in line with what the blog post and you are saying in comments. Essentially I am trying to “mask out” regions of space that are inappropriate based on generated quantities.
Based on the responses here, I have a good idea why we don’t put a prior on the entire mean formula, but I don’t think anyone addressed whether a prior on the counterfactual was a good idea or not. Thanks!

If helpful, this is the full Stan code for my model for my actual data:

functions {
   int neg_binomial_2_log_safe_rng(real eta, real phi) {
    real gamma_rate = gamma_rng(phi, phi / exp(eta));
    if (gamma_rate > exp(20.7)) gamma_rate = exp(20.7); // Jonah Gabry thinks this is the max value before overflow but hasn't double checked
    return poisson_rng(gamma_rate);
  }
}
data {
  int<lower=0> T; //total time of time series
  int<lower=0> T0; //last timepoint prior to intervention
  int y[T];
  vector[T] x1;
  vector[T] x2;
  vector[T] x3;
  vector[T] x4;
  int x5[T];
  vector[T] x6;
  vector[T] x7;
  int<lower=1> N_month;
  int month[T];
  real TR;
  real T0R;
}
parameters {
  vector[T] mu_err;
  real<lower=0> s_level;
  real<lower=0> shape;
  real beta_x1;
  real beta_x2;
  real beta_x3;
  real beta_x4;
  real beta_x5;
  real beta_x6;
  real beta_x7;
  real<lower=0> sd_mo;
  vector[N_month] z_1;
}
transformed parameters {
  vector[T] mu;
  vector[T] beta_int;
  vector[N_month] r_1;
  mu[1] = mu_err[1];
  for (t in 2:T) {
    mu[t] = mu[t-1] + s_level * mu_err[t]; // local level model
  }
  for (t in 1:T) {
    beta_int[t] = beta_x1 * x1[t] + beta_x2 * x2[t] + beta_x3 * x3[t] + beta_x4 * x4[t]; // intervention effect
  }
  r_1 = sd_mo * z_1;
}
model {

  for (t in 1:T) {
      target += neg_binomial_2_log_lpmf(y[t] | mu[t] + r_1[month[t]] + beta_x5 * x5[t] + beta_x6 * x6[t] + beta_x7 * x7[t] + beta_int[t], shape); 
    }

  for (t in 1:T-T0) {
      target += normal_lpdf(mu[t+T0] + r_1[month[t+T0]] + beta_x5 * x5[t+T0] + beta_x6 * x6[t+T0] + beta_x7 * x7[t+T0] | 1.8, 1);  // prior on counterfactual (mean formula without beta_int)
    }

  target += gamma_lpdf(shape | 0.01, 0.01);
  target += normal_lpdf(sd_mo | 0, 0.15) - 1 * normal_lccdf(0 | 0, 0.15);
  target += normal_lpdf(s_level | 0, 0.1);
  target += normal_lpdf(beta_x1 | 0, 0.5);
  target += normal_lpdf(beta_x2 | 0, 0.5);
  target += normal_lpdf(beta_x3 | 0, 0.5);
  target += normal_lpdf(beta_x4 | 0, 0.5);
  target += normal_lpdf(beta_x5 | 0, 0.5);
  target += normal_lpdf(beta_x6 | 0, 0.5);
  target += normal_lpdf(beta_x7 | 0, 0.5);
  target += std_normal_lpdf(mu_err);
  target += std_normal_lpdf(z_1);
}
generated quantities {
  vector[T] preds;
  vector[T] preds_cf;
  vector[T] mus;
  vector[T] mus_cf;
  vector[T] intervention_effect;
  vector[T] intervention_effect_n;
  real int_saved;
  real int_saved_n;
    for (t in 1:T) {
      preds[t] = neg_binomial_2_log_safe_rng( mu[t] + r_1[month[t]] + beta_x5 * x5[t] + beta_x6 * x6[t] + beta_x7 * x7[t] + beta_int[t], shape);
      preds_cf[t] = neg_binomial_2_log_safe_rng( mu[t] + r_1[month[t]] + beta_x5 * x5[t] + beta_x6 * x6[t] + beta_x7 * x7[t], shape);
      mus[t] = exp( mu[t] + r_1[month[t]] + beta_x5 * x5[t] + beta_x6 * x6[t] + beta_x7 * x7[t] + beta_int[t]);
      mus_cf[t] = exp( mu[t] + r_1[month[t]] + beta_x5 * x5[t] + beta_x6 * x6[t] + beta_x7 * x7[t]);
      intervention_effect[t] = preds_cf[t] - preds[t];
      intervention_effect_n[t] = mus_cf[t] - mus[t];
      }
    int_saved = sum(intervention_effect);
    int_saved_n = sum(intervention_effect_n);
}
"

It would be worth comparing (plotting) the prior and posterior for the counterfactual under the model with the prior directly on \lambda. Abstractly, I would worry that the prior entirely dictates the posterior for the counterfactual. But overall I don’t immediately see why it isn’t a reasonable idea – you probably know something about the size of the effect of interest, which seems equivalent to knowing something about the counterfactual given the observed outcome/series.

1 Like

Well dang, I’m having trouble with NaN sampling from the priors for this model, so using bayesplot mcmc_areas isn’t working… in meantime, here is a snip of a summary of the samples from the prior for the last 10 timepoints for the mean for the counterfactual:
image
And below is the posterior from model fit to data:
image
Doesn’t seem like the prior is dictating the posterior for the counterfactual… the prior values seem long tailed (note they’re exponentiated so on the actual outcome scale). 5th and 9th columns are 2.5 and 97.5 quantiles.
Not sure how helpful it is without plots though… and not sure why so many NaN…

Well, I don’t know the size of the effect of interest, the intervention, which is what I want to measure, vs the contributions from the other predictors. What I do know is a plausible range for predictions both for the counterfactual and the real scenario, which is between 0 - 20 counts as a mean (and that is super generous, actually more likely 0 - 10) for either the counterfactual or actual scenario.

Hi @jd_c I do get notifications when mentioned, but I’m not here on a regular basis.

In general by adding statements like

You’re multiplying some other prior by a “distortion factor” or “mask” as you say. This can work fine, but it doesn’t always work fine. Sometimes it can induce some issues in sampling when the resulting manifold is weird.

In the general case, imagine you have some parameters a,b,c,d for example, and some formula for predictions f(a,b,c,d,x) where x is some observed covariates.

You can write a prior like:

p(a) p(b) p(c) p(d) q(f(a,b,c,d,x1)|a,b,c,d) q(f(a,b,c,d,x2) | a,b,c,d) ...

where x1,x2 are some “hypothetical data” you might imagine and q is a non-negative bounded function which is peaked around the region where you think f should be for that x1, x2, … xn values.

the q functions don’t need to be interpretable as probabilities, just think of them as “distortion” of the density. So long as the entire expression is normalizable then the whole thing is a valid prior. It’s a prior with a generalized dependency structure. the product of the q values is the dependency structure.

1 Like

Thanks @dlakelan that makes a lot of sense.

Thanks for the response then!

In this case, when I add this prior to the model, it runs faster, and has better diagnostics. I think the prior actually helps with sampling.

I’m going to make some plots comparing prior and posterior as suggested by @hhau and then think for a while on this and look at some of the links posted above. This is a rather expensive observational study, and I want to make sure I’m not overlooking something. As far as I can tell based on the comments here, the model fit, and the estimates, this kind of prior makes a lot of sense to me. Somewhat annoyingly, it does change the uncertainty interval on the effect of the intervention though, so that it crosses a ‘null’ boundary, and since we live in a world ridiculously fixated on ‘significance’ and hard cutoffs, I don’t want to look like I’m doing something untoward. It’s irritating that the difference between (-10, 200) and (10, 220) for example, is somehow magic in peoples’ eyes, especially when intervals are wide already! My reason for starting down this road is that I simply find it difficult to trust models with stupid predictions, and the best way I could think to fix that was this kind of prior.

We often have most of our information in “outcome space” and when that’s the case, it makes good sense to put that information into the model. This is one good way to do that.

I have used this kind of thing multiple times when the parameters are coefficients of a basis expansion of a function, and the information is about what the function values or derivatives or such should be like… you put a wide independent normal prior on the coefficients just to constrain them from going to infinity, and then “mask” out any regions of space that result in functions that behave “wrong” in outcome space.

It’s really important to try to plot the prior implications before accepting them though. it’s not always obvious what prior you’re getting.

2 Likes

@dlakelan Yes this makes a lot of sense to me. It’s much easier to think about the outcome space.

Will do! This was suggested in another comment, and point well taken.
Thanks!