ICDF and Jacobian

Dear Stan veterans,

I would like to understand Jacobians, when they are required and when I should worry about them even though Stan may not complain. Here’s a super simple example of a prior defined via built-in distribution and then again, defined via ICDF.

I read the manual and the user guide and I am still a little unsure that I absolutely understand when Jacobians are not required. One thing I want to confirm with you that this here is not that case

library(rstan)
library(magrittr)

stanmodelcode1 <- "
data {
  int<lower=0> N;
  int<lower=0> y[N];
}
parameters {
  real<lower=0.00001> Theta;
}
model {
  Theta ~ rayleigh(0.1);
  for (n in 1:N)
    y[n] ~ exponential(Theta);
}
generated quantities{
  real y_pred;
  y_pred = exponential_rng(Theta);
}
"

stanmodelcode2 <- "
functions{
 real rayleigh_icdf(real u, real s){
  return s*sqrt(-2*log1m(u));
 }
}
data {
  int<lower=0> N;
  int<lower=0> y[N];
}
parameters {
  real<lower=0, upper=1> u;
}
transformed parameters{
  real QTheta = rayleigh_icdf(u, 0.1);
}
model {
  u ~ uniform(0,1);
  for (n in 1:N)
    y[n] ~ exponential(QTheta);
}
generated quantities{
  real y_pred;
  y_pred = exponential_rng(QTheta);
}
"
claims.obs <- c(100, 950, 450)
N <- length(claims.obs)
dat <- list(N = N, y = claims.obs); 

# fit the model with regual prior
stanDso1 <- stan_model(model_code = stanmodelcode1) 
fit1 <- sampling(stanDso1, data = dat, iter = 10000, warmup=2000) 
fit1
draws1 <- posterior::as_draws_array(fit1) %>% 
  posterior::subset_draws(variable="Theta")

# fit the model with ICDF prior
stanDso2 <- stan_model(model_code = stanmodelcode2)
fit2 <- sampling(stanDso2, data = dat, iter = 10000, warmup=2000) 
fit2
draws2 <- posterior::as_draws_array(fit2) %>% 
    posterior::subset_draws(variable="QTheta")

posterior::bind_draws(draws1, draws2) %>% 
 # bayesplot::mcmc_dens()
  posterior::as_draws_df() %>% 
  tidyr::pivot_longer(-c(.iteration, .chain, .draw)) %>% 
  ggplot()+
  geom_density(aes(x=value, color=name, fill=name), alpha=0.2)+
  theme_minimal()

00001b

UPDATE: ggplot shows the distributions not to be very different, so I take my words back.

Posterior distributions seem to be different, so my rule of thumb “if it can be listed in transformed parameters it requires the adjustment” seems to be acceptable?

Those posteriors look (at least potentially) extremely similar by eye. One just happened to sample a bit further out into the extremely long right-hand tail. As long as the implementation of the ICDF is correct, this should not require a Jacobian adjustment.

The rule of thumb is not whether or not a variable is in transformed parameters; it is whether a parameter defined in either transformed parameters or locally in the model block shows up on the left-hand side of a sampling statement (or equivalent target += statement) that is intended to have the effect of sampling the quantity on the left-hand side from the distribution on the right-hand side. Such a model requires a Jacobian adjustment when the transform is nonlinear (more precisely, such a model always requires a Jacobian adjustment, but if the transform is linear the Jacobian determinant is constant and can safely be ignored by Stan’s sampler).

For more on Jacobians adjustments, some good resources beyond the manual and users guide include this very gentle introduction:
https://jsocolar.github.io/jacobians/
and some of the links therein, especially including this overview of probability theory:
https://betanalpha.github.io/assets/case_studies/probability_theory.html

4 Likes

Thank you for your very thorough reply. I read your blogpost and it was one of the most enlightening explanations I could find. Sorry for pushing the argument. I logically understand I am wrong, but the intuition evades me. I guess I was still confused about the following:

In the first model, if we rewrite it in the target+= format, my Rayleigh prior makes contribution to the likelihood

model {
  target+= rayleigh_lpdf(Theta | 0.1);
  for (n in 1:N)
    target+=exponential_lpdf(y[n] | Theta);
}

In the second model, since the prior is now uniform, the contribution of the prior to the total likelihood is a constant 0, so the prior contributes nothing. Yes, the values of QTheta are different from the values of Theta, but that should not affect how the total likelihood is distributed between the data and the prior (or should it?).

model {
  target+= uniform_lpdf(u | 0,1);
  for (n in 1:N)
    target+=exponential_lpdf(y[n] | QTheta);
}

I am, at the end of the day, changing the left side of the ~: it is Theta in the first and u in the second model. I guess I expected something like this to be balancing out the transformation.

model {
  target+= uniform_lpdf(u | 0,1); // equal 0
  target+= something_based_on_rayleigh_icdf();
  for (n in 1:N)
    target+=exponential_lpdf(y[n] | QTheta);
}

And if my ICDF transformation does not in fact change the “left side of tilde”, then what you’re saying is that the only way I would require a Jacobian in this example, is if my transformation would affect the data, y[i]? [1]

I have made an example of that as well and again the posteriors look very similar. Here’s my Rayleight/Exponential model where likelihood is DQF-defined (i.e. f(Q(u))-defined) . DQF is density quantile function and it is the reciprocal to the first derivative of ICDF [2].

stanmodelcode3 <- '
functions{
  real exponential_ldqf_lpdf(real p, real Theta){
   return log(Theta)+log1m(p);
  }
}
data {
  int<lower=0> N;
  int<lower=0> y[N];
}
parameters {
  real<lower=0.00001> Theta;
}
transformed parameters{
  real u[N];
  for (n in 1:N){
   u[n] = exponential_cdf(y[n], Theta);
  }
}
model {
  target += rayleigh_lpdf(Theta | 0.1);
  for (n in 1:N)
    target += exponential_ldqf_lpdf( u[n] | Theta);
}
generated quantities{
  real y_pred;
  y_pred = exponential_rng(Theta);
}
'
init_f <- function() list(Theta = runif(1, 0.001, 0.005))

# fit the model with DQF likelihood
stanDso3 <- stan_model(model_code = stanmodelcode3)
fit3 <- sampling(stanDso3, data = dat, iter = 10000, warmup=2000, 
                 init=init_f, control=list(adapt_delta=0.9)) 
pairs(fit3)
d_idx <- rstan::get_divergent_iterations(fit3)
# share of divergent
mean(d_idx)
# [1] 0

# extract divergent iterations
fit3 %>% as.matrix() %>% .[d_idx,]

draws3 <- posterior::as_draws_array(fit3) %>% 
  posterior::subset_draws(variable="Theta") %>% 
  posterior::rename_variables("PTheta"="Theta")

#useless for comparing densities
#posterior::bind_draws(draws1, draws3) %>% 
#  bayesplot::mcmc_dens()

library(ggplot2)
posterior::bind_draws(draws1, draws3) %>% 
  posterior::as_draws_df() %>% 
  tidyr::pivot_longer(-c(.iteration, .chain, .draw)) %>% 
  ggplot()+
  geom_density(aes(x=value, color=name, fill=name), alpha=0.2)+
  theme_minimal()

I just want to be dead sure I am not missing any Jacobians here. Sorry folks for picking your brain like that, but isn’t it what the forums are made for? ))

Footnotes:

[1] I know I could transform Theta as well, but let’s set it aside for now.
[2] Here’s a bit of math to support my quantile likelihood:
F^{-1}(u)=Q(u)=x, so I can drop in Q(u) for x into the likelihood, where x is my data
f(x)=f(Q(u))=\frac{dF(Q(u))}{dQ(u)} = \frac{dF(Q(u))/du}{dQ(u)/du}=\frac{dF(F^{-1}(u))/du}{q(u)}=\frac{du/du}{q(u)}=[q(u)]^{-1}

Exponential QF: Q(u)=-\ln(1-u)/\Theta
Exponential QDF:q(u)=Q'(u)=1/(\Theta(1-u))
Exponential DQF: f(Q(u))=[q(u)]^{-1}=\Theta(1-u)

I see why this is confusing, but we can simplify the problem further by ignoring the target+=exponential_lpdf(y[n] | Theta) part. At the end of the day, I think you are still underestimating the power of the parameters block.
So our models are:

parameters {
  real<lower=0> Theta;
}
model {
  Theta ~ rayleigh(0.1);
}

And the second model:

functions{
 real rayleigh_icdf(real u, real s){
  return s*sqrt(-2*log1m(u));
 }
}
parameters {
  real<lower=0, upper=1> u;
}
transformed parameters{
  real QTheta = rayleigh_icdf(u, 0.1);
}
model {
  u ~ uniform(0,1);
}

Choosing a uniform prior on u contributes nothing to the likelihood parameterized in terms of u. But this is the key point: a uniform likelihood in terms of u is a nonuniform likelihood in terms of QTheta. The uniform prior on u absolutely “contributes to a likelihood” over QTheta. This is the crucial role of the parameters block (i.e. the role of choosing the parameterization).

Here’s where we can see exactly why it matters whether a statement appears on the left-hand-side of a sampling statement. Suppose we omit any prior declaration on u (which leaves us with the implied uniform(0,1) prior due to the constraints on u), and then we write QTheta ~ rayleigh(0.1). Now we’ve got a problem which we have to solve with a Jacobian adjustment, because the pdf associated with rayleigh(0.1) is being applied to QTheta, but the implied prior on QTheta by the uniform prior on u is not flat to begin with. Note that the derivative of a CDF is exactly the PDF, so it’s no shock that the Jacobian adjustment here will exactly offset the sampling statement.

Does that help?

1 Like

Thank you, Jacob!

I think this is helpful and encouraging. I found this quote from the manual very helpful:

A transformation samples a parameter, then transforms it, whereas a change of variables transforms a parameter, then samples it. Only the latter requires a Jacobian adjustment.

Let me follow through the example, as you suggested and look at the transformation: QTheta=Q_{rayleigh}(u)=s \sqrt{-2\ln(1-u)}

This is non-linear transformation, which we can observe from the plot, similar you made in your blog.

The Jacobian adjustment is the log of the absolute derivative of the transform, which in this case is

J(Q(u))=\ln \left| \frac{d}{du}\left(s \sqrt{-2\ln(1-u)}\right)\right|

Since transform is a QF, its derivative will be \frac{dQ(u)}{du}, which is a Rayleigh QDF, denoted as q(u). Rayleigh QDF is

q_{rayleigh}(u)=\frac{s}{\sqrt{2}\sqrt{-\ln(1-u)}(1-u)}

So now we will repeat our experiment with the Jacobian

stanmodelcode2j <- "
functions{
 real rayleigh_icdf(real u, real s){
  return s*sqrt(-2*log1m(u));
 }
 real rayleigh_qdf_pdf(real u, real s){
  return s/(sqrt(2)*sqrt(-log1m(u))*(1-u));
 }
} // end of functions block
data {
  int<lower=0> N;
  int<lower=0> y[N];
}
parameters {
  real<lower=0, upper=1> u;
}
transformed parameters{
  real QTheta = rayleigh_icdf(u, 0.1);
}
model {
  // u ~ uniform(0,1) // now implied
  QTheta ~ rayleigh(0.1);
  target+=log(abs(rayleigh_qdf_pdf(u, 0.1)));
  for (n in 1:N)
    y[n] ~ exponential(QTheta);
}
generated quantities{
  real y_pred;
  y_pred = exponential_rng(QTheta);
}
"
# fit the model with regual prior
stanDso2j <- stan_model(model_code = stanmodelcode2j) 
fit2j <- sampling(stanDso2j, data = dat, iter = 10000, warmup=2000) 
fit2j
draws2j <- posterior::as_draws_array(fit2j) %>% 
  posterior::subset_draws(variable="QTheta") %>% 
  posterior::rename_variables("QJTheta"="QTheta")

posterior::bind_draws(draws1, draws2j) %>% 
  #bayesplot::mcmc_dens()
  posterior::as_draws_df() %>% 
  tidyr::pivot_longer(-c(.iteration, .chain, .draw)) %>% 
  ggplot()+
  geom_density(aes(x=value, color=name, fill=name), alpha=0.2)+
  theme_minimal()

Same result. Brilliant! So when we use ICDF as a transform we need to add its [log] derivative, a [log] QDF to the likelihood. This makes sense and aligns with my intuition.

Thank you! I think the issue is closed (for now ;) )

1 Like

Jacob, Could I, please, ask you to look at my previous example where I perform the transformation of the data.

parameters {
  real<lower=0.00001> Theta;
}
transformed parameters{
  real u[N];
  for (n in 1:N){
   u[n] = exponential_cdf(y[n], Theta);
  }
}
model {
  target += rayleigh_lpdf(Theta | 0.1);
  for (n in 1:N)
    target += exponential_ldqf_lpdf( u[n] | Theta);
}

Here, it seem I could write my model (with transformation) as follows:

\Theta \sim Rayleigh()
F_{exp}(y|\Theta) \sim [q_{exp}(\Theta)]^{-1}

Which should imply that I should add the following Jacobian:

J(F_{exp}(y))=log(|\frac{dF_{exp}(y)}{dy}|)=log(|f_{exp}(y)|)

But that does not seem to be required (given that the posteriors in that example are matching and when i do in fact add target += exponential_lpdf( y[n] | Theta); i get really funky results). Could you comment on this example? Why is Jacobian NOT required if I am replacing data (my change of variables is transforming data into probabilities).

The data are constants, not parameters, and transformations of data are also constants. The derivative of a constant with respect to anything is zero.

If you want to play with treating the outcomes as random, you can do so. Omit the data declaration, and instead declare y to be a parameter (as in, sample y from the prior). If you don’t handle the Jacobians properly, you’ll sample from the wrong prior.

1 Like

I feel stupid, but relieved. Thank you for your patience with my ignorance, sir!

Don’t feel stupid. This stuff is hard!!