# Putting priors on transformed parameters

In a lot of situations it’s easier to put priors on parameter transformations than on the immediate model parameters. When one does so, one is always greeted with the Stan warning that one should include, “a target += statement with the log absolute determinant of the Jacobian of the transform.” Now I’ve read on other threads that this particular warning is sometimes warranted and sometimes not. My question is: when the warning is warranted, just HOW exactly does one augment the += statement? What does one write exactly? Here’s my example using logistic regression.

The following generates some dummy data in R where I’ve got one continuous (age) and one categorical (group) variable.

n <- 10; beta0 <- -1.6; beta1 <- 0.03; beta2 <- 1
age <- runif(n=n, min=18, max=60)
x2 <- runif(n=n, min=0, max=1)
pi <- exp(beta0 + beta1*age + beta2*x2) / (1 + exp(beta0 + beta1*age + beta2*x2))
y <- rbinom(n=n, size=1, prob=pi)
group <- as.integer(cut(x2, breaks=3))
mylist <- list(age=age,group=group,y=y)
mylist$N <- length(y) mylist$N_group <- 3


And here’s the Stan code where I’m trying to model the Bernoulli outcome y. Note that I’m putting a somewhat loose prior on the immediate group parameter b but I’m putting rather strong priors on the three group proportions defined in the transformed parameters block. That is, I have a hard time thinking in terms of prior beliefs about parameters expressed inside a logistic transformation, let alone ones in concert with other variables inside a logistic model. It’s much easier to elicit priors regarding the summary proportions or shares. Bless Stan for the handy transformed parameters block that lets me do just that:

stan.code ='
data{
int<lower=1> N;
int<lower=1> N_group;
int<lower=0, upper=1> y[N];
int group[N];
real<lower=0, upper=100> age[N];
}
parameters{
real a;
vector[N_group] b;
real intercept;
}

transformed parameters {
real<lower = 0, upper = 1> Group1Share;
real<lower = 0, upper = 1> Group2Share;
real<lower = 0, upper = 1> Group3Share;
Group1Share = inv_logit(intercept + b[1]);
Group2Share = inv_logit(intercept + b[2]);
Group3Share = inv_logit(intercept + b[3]);
}

model{
vector[N] p;
intercept ~ normal(0, 2);
b ~ normal(0, 5);
a ~ normal(0, 2);
Group1Share ~ normal(.333, .1);
Group2Share ~ normal(.333, .1);
Group3Share ~ normal(.333, .1);
for (i in 1:N) {
y[i] ~ bernoulli_logit(intercept + a*age[i] + b[group[i]]);
}
}
'


Now I can fit this model and examine the results with:

model = stan(model_code=stan.code, data=mylist, iter=2000, thin=10, chains=4)
traceplot(model)
pairs(model)
print(model, digits = 3)


Where Stan gives me the warning I’d expect:

Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If it does, you need to include a target += statement with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement


Now in my case, I suspect the warning is warranted. (Am I right?) Here’s the result of a pairs plot:

It’s clear in the panels towards the top-left that the three immediate group parameters (b1, b2, b3) are almost perfectly inversely correlated with the intercept parameter. I may not be saying this quite right, but it appears that there’s some underdetermination going on. There are an infinite number of combinations wherein these 4 parameters can produce the same 3 prior share beliefs. I think this is what the Stan warning is trying to tell me. And according to this deeper warning explanation, I may not be sampling from the posterior density that I expect. By contrast, the pairs plot for the three b-parameters doesn’t look at all like the above if I drop the transformed parameters altogether.

But then again, the three resulting transformed parameter estimates that come from this model do exactly what I expect. They’re all around .33 and reflect my prior beliefs. All the other parameters make entirely intuitive sense. And the traceplot shows no cause for alarm. If it weren’t for the strong parameter correlations, I’d think that the model is just fine.

So, my questions are… First, is my understanding indeed correct? Is the Stan warning in this case valid and am I not sampling the density I expect? Do I need to append my += target in some way to properly sample from the posterior density reflecting my prior beliefs about the group proportions? Second, if so, what exactly should I append? For example, if I re-write my model block to use the the following form (which right now produces the same results as above)…

  for (i in 1:N) {
lps[i] = bernoulli_logit_lpmf(y[i] | (intercept + a*age[i] + b[group[i]]));
}
target += sum(lps);


…Then what’s the syntax I should be adding to the target += statement above? Would I need to add the sum of the log derivatives of my transformed parameters?

1 Like

The right-hand side of the target += must be the logarithm of the absolute value of the determinant of the Jacobian matrix of the mapping from the parameters to the transformed parameters. In your case, the Jacobian matrix is diagonal, so you can write it the sum of the logarithms of the absolute values of the derivatives of the mapping from each parameter to the corresponding transformed parameter.

However, you are making things more difficult on yourself by putting normal priors on the Group[1-3]Share transformed parameters, because the normal distribution puts mass outside the (0,1) interval that the Group[1-3]Share transformed parameters are constrained to lie on. You could truncate, but let’s assume instead that you put a beta prior on the Group[1-3]Share transformed parameters. Then, note that the derivative of the inv_logit transformation is the standard logistic PDF and its logarithm is implemented by the logistic_lpdf function in Stan. So, you would have

target += logistic_lpdf(intercept + b | 0, 1);


You should be able to work this out for yourself in R:

foo <- function(x, shape1, shape2) dbeta(plogis(x), shape1, shape2) * dlogis(x)
integrate(foo, lower = -Inf, upper = Inf, shape1 = sqrt(2), shape2 = pi) # 1.00000

3 Likes

Thank you tons, Ben. Great suggestion on using beta priors instead of normal. I can easily replace the priors for the above transformed parameters as follows:

  Group1Share ~ beta(50, 100);
Group2Share ~ beta(50, 100);
Group3Share ~ beta(50, 100);


As for the next part, I may muff this up but let me try to re-phrase what you wrote to make sure I understand. If I’ve got prior beliefs about transformations involving the direct model parameters (e.g., b1, b2, b3), then I need to map or re-express those beliefs in a new function involving only the direct model parameters. I need to take the logged derivative of this new function and add it to the target log density of my model. In my case, Stan’s logistic_lpdf (equivalent to dlogis(..., log = TRUE) in R) is the correct new function. You wrote foo and integrate() as tools to help with the mapping.

Assuming I’ve not totally misunderstood the above, I find that a logistic_lpdf with shape and scale parameters -.69649 and .10, respectively, yields share estimates closely centered around .33 and much like beta(50, 100) and ought to work. If so, does the following model statement now correctly sample what I’m intending?

model{
vector[N] lps;
vector[N] p;
intercept ~ normal(0, 2);
b ~ normal(0, 5);
a ~ normal(0, 2);
Group1Share ~ beta(50, 100);
Group2Share ~ beta(50, 100);
Group3Share ~ beta(50, 100);
for (i in 1:N) {
lps[i] = bernoulli_logit_lpmf(y[i] | (intercept + a*age[i] + b[group[i]]));
}
target += sum(lps) +
logistic_lpdf(intercept + b | -.69649, .10);
}


The above at least runs well and generates intuitive parameters. In fact, it almost runs too well. The new parameters are all very similar to those from my original model on top and the lp__ value is only slightly higher (i.e., slightly less negative). It almost makes it seem like the correction doesn’t do very much. But then again, this is a toy example with just 10 observations and/or I may have misunderstood your explanation.

Thanks in advance for any more guidance.

If you are transforming intercept + b[i] with inv_logit — which is equivalent to logistic_cdf(intercept + b[i] | 0, 1), then your log-Jacobian is the sum of three terms involving the derivative of its logarithm, which is logistic_lpdf(intercept + b | 0, 1).

Also, a beta prior with shapes 50 and 100 have a mean of 1/3 but way less dispersion than your original normal priors with mean 1/3 and standard deviation 1/10. A beta with shapes of 7 and 14 would be a lot closer.

Also, you can do the likelihood in one line

target += bernoulli_logit_lpmf(y | intercept + a * age + b[group]);


I’m sorry if I’m sounding dense here, Ben, but just to be sure…

1. When you wrote, logistic_lpdf(intercept + b | 0, 1) that’s just an example, right? One needs to replace the shape=0 and scale=1 parameters with values mapped from one’s prior transformed beliefs, yes? In my case I wound up with logistic_lpdf(intercept + b | -.69649, .10). I chose these values because my prior belief on inv_logit(intercept + b) is beta(50,100) which has a mean and sd of.333 and .04. In addition, the values .333 and .04 are (very roughly) the mean and sd of the derivative of inv_logit with shape and scale parameters -.69649 and .10. …Or have I got the logic behind this mapping idea and the Jacobian all goofed up?

2. In the model block one needs to include in the target accumulator both the likelihood AND the above lpdf, correct? So, in my case my modeling block needs to include not just what you last wrote above but should have the two lines:

target += bernoulli_logit_lpmf(y | (intercept + a*age + b[group]));
target += logistic_lpdf(intercept + b | -.69649, .10);

There’s a chapter in the manual with examples on how to deal with Jacobians for non-linear transforms and when you’ll need them.

Yes.

Yes, what you’re doing is reasonable for mapping. You don’t need a Jacobian if you generate an unconstrained parameter than constrain it. It’s when you put a distribution on a transformed parameter that you’ll need the Jacobian.

You want to include both prior and likelihood if that’s what you’re asking. And any terms for Jacobian corrections (but it looks like you won’t need any going this way around).

1 Like

This is a fascinating discussion. I have a need to specify normal priors with varying \sigma for a user-given number c (including zero) of linear combinations of parameters in the model. Perhaps this is best handled by having the user (in R) specify a c \times p matrix of contrast coefficients, c \geq 0, p= number of model regression coefficients and intercepts and to pass this matrix to Stan along with c and p. Has anyone written Stan code like this that I can benefit from? It’s over my head in Stan programming.

3 Likes

Something like this?

data {
int          p;      // number of coefficients
int          c;      // number of combinations
matrix[c, p] C;      // combinations
vector[c]    mus;    // prior means
vector[c]    sigmas; // prior std. devs
}

parameters {
vector[p]    beta;
}

model {
C*beta ~ normal(mus, sigmas);
}


This puts an independent normal prior on each component of C\beta, which are the specified constrasts. The operation on the lhs of \sim is linear in each \beta_i so no Jacobian adjustment is necessary.

2 Likes

Wow is it really that easy? Thanks!

The one trick I’ve had to do is to keep a subset of the parameters outside of the QR decomposition of the design matrix so that the priors can be mapped to individual parameters not linear combinations of them according to collinearities. I assume there is no easy way around that.

When you specify priors on contrasts of parameters I assume these priors are “added” to the priors specified for the individual parameters and the that priors for the two should not have some kind of major conflict. Any more insight about that would be welcomed.

I am not sure what you mean regarding the QR decomposition. Can you describe your issue and how you want to use Stan in more detail?

Since the priors on contrasts involve multiple parameters simultaneously, it’s best to think in terms of the joint prior on the whole parameter vector. This prior is the product of the densities that you specify with all your prior statements. So it would indeed be an issue if that was zero everywhere, or otherwise degenerate. I would also be careful that the priors you put on the parameters individually and the priors on contrasts represent separate domain knowledge. Otherwise the combined prior may be too strong.

1 Like

Thanks for the nice words and for this great help. The description about the multiple priors makes sense. Regarding QR, I use a standard Stan approach for accelerating sampling by getting rid of collinearities up front. The user specifies priors on the original parameters, not those implied by QR. I do the QR in R (and pass the result to Stan) so that I can keep some of the predictors outside of the QR process, so I’m doing partial QR so priors can be specified for a subset of the parameters, on the original scale. I reverse QR in R after the posterior draws are created. See the selectedQr function in https://github.com/harrelfe/rmsb/blob/master/R/stanMisc.r .

This makes sense. In this situation I see no issue, because the held-out parameters do not enter the contrasts. You can specify any prior you want on them separately and there can be no conflict.

In this case I would hold out parameters because they do enter contrasts. I want priors to be for contrasts on the original pre-QR parameter space.

Marek my current model block looks like

model {
target += log_lik;
target += normal_ldf(beta | 0, sds);
...
}


How would I bring in the prior for C*beta in this form? Is it as easy as target += normal_ldf(C * beta | mus, sds) ?

Yes, only it should be normal_lpdf, not normal_ldf. You can mix the ~ and the target += syntax by the way, so you can have

model {
target += log_lik;
beta ~ normal(0, sds);
C*beta ~ normal(mus, sds);
}


if you prefer.

1 Like

Thanks a bunch Marek.

I have implemented priors on contrasts in the upcoming new release of the rmsb package, in a way that makes it easy for the user to specify the contrasts. It’s written up here: Regression Modeling Strategies - 2  General Aspects of Fitting Regression Models

4 Likes