What does `jacobian=TRUE` do in model$optimize(data = data_list, jacobian = TRUE)

This is my Stan file:

data {
  int<lower=0> N; // number of observations
  array[N] int<lower=0, upper=1> Y; // vector of binary observations
}
parameters {
  real<lower=0, upper=1> theta; // probability of success
}
model {
  theta ~ beta(1, 1); // prior
  Y ~ bernoulli(theta); // observation model / likelihood
}

My data and optimization

Y=c(rep(0,9), 1)
data_list = list(N = length(Y), Y = Y)
model01$optimize(data = data_list)

produces the following result:

Initial log joint probability = -6.41748 
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes  
       6      -3.25083     0.0035622   5.48885e-05           1           1        9    
Optimization terminated normally:  
  Convergence detected: relative gradient magnitude is below tolerance 
Finished in  0.2 seconds.

 variable estimate
    lp__     -3.25
    theta     0.10

This gives the MLE solution.

With an option jacobian = TRUE, the result is different

model01$optimize(data = data_list, jacobian = TRUE)
Initial log joint probability = -10.9632 
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes  
       6      -5.40673   0.000314983      1.72e-06           1           1        9    
Optimization terminated normally:  
  Convergence detected: relative gradient magnitude is below tolerance 
Finished in  0.3 seconds.

 variable estimate
    lp__     -5.41
    theta     0.17

Now theta is 0.17 which is definitely not MLE (MLE is 0.1 given before).

Apperently, the two optimizations are not the same due to the flag jacobian = TRUE.

Can anyone help me find the reason of this happening?

  • What is the interpretation of the result from optimize(jacobian = TRUE)
    The mathematics behind this option would also be very much helpful.

The explanation below from the manual page did not much help to me:


In cmdstanr, the optimize() method is used for Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) estimation. The jacobian=TRUE argument plays a crucial role when dealing with constrained parameters.

When jacobian=TRUE, the optimization process includes the Jacobian adjustment for constrained variables. This adjustment is necessary because the transformation of variables can affect the density of the posterior distribution. Including the Jacobian ensures that the optimization accounts for this change of variables, leading to a more accurate MAP estimate


On the contrary, when the prior is not uniform as shown below:

data {
  int<lower=0> N; // number of observations
  array[N] int<lower=0, upper=1> Y; // vector of binary observations
}
parameters {
  real<lower=0, upper=1> theta; // probability of success
}
model {
  theta ~ beta(10, 1); // prior
  Y ~ bernoulli(theta); 

The two theta results from optimize with and without jacobian=TRUE are all the same (even though lp__ are different):

stanfile = "bernoulli.stan"
model = cmdstan_model(stanfile)
data_list = list(N=11, Y=c(rep(0,10), 1))
model$optimize(data = data_list)
Initial log joint probability = -13.9528 
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes  
       3      -13.8629   0.000429068   1.60942e-06           1           1        6    
Optimization terminated normally:  
  Convergence detected: relative gradient magnitude is below tolerance 
Finished in  0.2 seconds.

 variable estimate
    lp__    -13.86
    theta     0.50
model$optimize(data = data_list, jacobian = TRUE)
Initial log joint probability = -16.4593 
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes  
       4      -15.2492   0.000309459   1.13377e-08           1           1        7    
Optimization terminated normally:  
  Convergence detected: relative gradient magnitude is below tolerance 
Finished in  0.2 seconds.

 variable estimate
    lp__    -15.25
    theta     0.50

Very much curuious!
Thanks.
End.

The CmdStan user guide (as opposed to the CmdStanR vignette you linked to) has more details. If I understand this section on Jacobian adjustments correctly, jacobian = FALSE should return maximum likelihood estimates whereas jacobian = TRUE should return the maximum a posteriori estimates.

Haven’t looked in detail at your model and code to see if that lines up with your output, but that’s the short answer.

Although I must say I’m also a bit confused about the terminology used to describe the jacobian = FALSE case:

returns the (regularized) maximum likelihood estimate (MLE), the value which maximizes the likelihood of the data given the parameters, (including prior terms).

1 Like

Yeah I think that’s the correct interpretation.

Yes, the explanation in the manual is quite clear.
But when the prior is a uniform distribution, MLE must be the same as MAP in theory.

Here, the simulation gave different outputs, which is why the question was posted.

Yes, understood. I was also puzzled by your different outputs with a uniform prior.
I know that theoretically your beta(1, 1) prior is identical to a uniform(0, 1) prior when your parameter theta is constrained to lie on the \left[0, 1\right] interval, so indeed your output is surprising to me.
Do you get the same output when you use theta ~ uniform(0, 1); instead, or when leave that prior statement out altogether (i.e., implicitly uniform prior density)?

See also this case study Laplace method and Jacobian of parameter transformation

2 Likes

This is not true if the parameters are constrained and our model’s defined over unconstrained parameters for sampling or for optimization. That’s why you’re getting results that look counterintuitive to you.

If I have a very simple model like this:

data {
  vector[100] y;
}
parameters {
  real mu;
}
model {
  y ~ normal(mu, 1);
}

I can calculate an MLE and it will be the same as the MAP estimate. So setting Jacobian=TRUE or jacobian=FALSE has no effect because there are no constrained parameters, hence the log Jacobian is 0.

But if I change the model to include a constrained parameter as follows,

data {
  vector[100] y;
}
parameters {
  real mu;
  real<lower=0> sigma;
}
model {
  y ~ normal(mu, sigma);
}

then this is no longer true. When I transform sigma from an unconstrained parameter to one constrained to be greater than zero, we add the Jacobian adjustment to the log density so that we get an improper uniform distribution for sigma over (0, \infty). Then Stan samples on the unconstrained space where sigma has been mapped via logarithm to unconstrained support on \mathbb{R}. We do this so that our sampling and optimization algorithms can work in unconstrained space.

Specifically, what we get will be equivalent to this in terms of what Stan samples.

data {
  vector[100] y;
}
parameters {
  real mu;
  real sigma_unc;
}
transformed parameters {
  real<lower=0> sigma = exp(sigma_unc);
}
model {
  y ~ normal(mu, sigma);
  target += sigma_unc;  // Jacobian adjustment for exp() constraint
}

Stan will sample over mu, sigma_unc in both of the previous two programs. In the first, it’s implicit.

The MAP solution will leave the Jacobian adjustment in, because that’s the distribution we care about. The MLE drops the Jacobian adjustment because the frequentists aren’t treating this as a distribution over parameters (because that’s not a thing if you’re a frequentist, unless you are careful to call the parameters “missing data”). This all carries over if you add priors—they act as penalty terms for a penalized MLE.

The beta case you gave is like the second. When we have a variable constrained <lower=0, upper=1>, we apply a logit transform to map it bijectively to \mathbb{R}, so the inverse constraining transform is inverse-logit, and that has a log Jacobian adjustment of \log \textrm{logit}^{-1}(\theta^\textrm{unc}) \cdot (1 - \textrm{logit}^{-1}(\theta^\textrm{unc})), so that’s the difference you’re going to see per iteration, where \theta^\textrm{unc} = \textrm{logit}(\theta).

If Stan was doing its optimization over the constrained space, you’d find the same thing. So if you write this in Stan, which I would not recommend,

```stan
data {
  vector[100] y;
}
parameters {
  real mu;
  real sigma;
}
model {
  y ~ normal(mu, sigma);
}

Then you’ll find that either things will fail if you initialize sigma < 0 or if you’re careful with initialization, you’ll get the same answer with jacobian=TRUE and jacobian=FALSE.

3 Likes

While I understand the Jacobian adjustment in theory, I failed to understand the Jacobian MLE references in the Cmdstan and cmdstanr documentation as discussed in the context of Laplace sampling and optimization. To me MLE vs. MAP refer to explicit priors, not to something created by reparametrization, and I get confused.

I would refer to jacobian=TRUE as MAP and a gaussian approximation in the original space, and jacobian=FALSE as MAP and a gaussian approximation in the transformed unconstrained space. And then say nothing about MLE at all. (For at least for me, that raises the question “what, does it leave my explicit priors out from the model?”)

Does this make sense? Or maybe I still get something wrong in the substance?

<lower=0> is a good example, for doing MAP in the transformed space without explicit prior actually creates a pretty natural prior on the original scale, while MAP on the original scale (constrained space) for such a parameter sounds a bit dangerous, at least for the typical scale parameters. I at least start my models without priors and often optimize at first, if possible, and have relied on those implicit priors without always realizing it.

You are getting something wrong. Did you look at the
case study Laplace method and Jacobian of parameter transformation? Stan’s normal approximation is always in the unconstrained space.

1 Like

Now I looked at it, knowing it exists helped. Thanks!

Indeed, the polarity of jacobian seems to be reverse from what I thought, for one gets optima in the constrained space with FALSE.

I’d still drop the MLE/MAP discussion from the documentation. The case study is very helpful, I’d refer to that, or take the relevant parts. (“Thus if someone cares about the posterior mode in the constrained, but is doing the optimization in unconstrained space they still need jacobian=FALSE.”; CmdStanR method $laplace() uses jacobian=TRUE because that gives the right normal approximation but then optimizes in the unconstrained space.)

Doing the normal approximation in the unconstrained space makes sense. (Not important here, but if with the wrong normal there’s more than the fact that posteriors just tend to be more normal without constraints, then that is really confusing.)

Is this a reasonable summary in the case of optimization? (and no priors etc.)
jacobian = FALSE: optima calculated in the unconstrained space (then transformed back)
jacobian = TRUE: optima calculated in the constrained space (not strictly true, but what’s what we get)

EDIT: No. The other way around.

No — this I learned from Aki now!

jacobian=FALSE gives the optimum (MAP) in the constrained space. jacobian=TRUE gives the optimum (MAP) in the unconstrained space. (This is what the case study says anyway.)

Both calculate in the unconstrained space as this is what Stan does. And what is prior and what is likelihood in your model, if such identities of the terms of the log-posterior are even defined, doesn’t matter.

Theoretically the target probability measure (posterior) is the same in both spaces but practicalities like gradients are with densities, not measures. And the base measure (implicit Lebesque measure, constant density), against which you define your densities in Stan, changes due to the parameter transformation, and you have to compensate for that with the Jacobian. Then one can somehow conceptualize the polarity of the Jacobian binary option in two ways, of which the one used in Stan options seems to be the unintuitive one for many (incl. me, and you).

1 Like

Maybe it will help to work out the simplest possible example explicitly.

parameters {
  real<lower=0> A;
}
model {
  A ~ lognormal(5, 2);
}

In the constrained space, the density for \alpha > 0 is

p_A(a) = \text{lognormal}(a \mid 5, 2).

The traditional maximum likelihood estimate is

a^* = \text{argmax}_a \ \text{lognormal}(a \mid 5, 2) = \exp(5 - 2^2) = \exp(1) = e.

Now what we’re going to do in Stan is log-transform A to B = \log A and we’ll actually sample from p_B. To work out the density of B, we use the change of variables formula,

p_B(b) = p_A(\log^{-1}(b)) \cdot \left| \frac{d}{d b}\log^{-1}(b) \right| = \text{lognormal}(\exp(b) \mid 2, 5) \cdot \exp(b),

using the fact that \log^{-1}(b) = \exp(b).

If we drop the Jacobian adjustment, which is here just the absolute derivative of the constraining transform, we get the behavior for jacobian=False, and optimize

b^* = \text{argmax}_b \ \text{lognormal}(\exp(b) \mid 5, 2).

In this case, we recover b^* = 1 because \exp(1) = e. So when we inverse transform b^*, we get a^* = \exp(b^*) = e. That’s how we get maximum likelihood estimates for constrained parameters with Stan.

If we include the Jacobian, we have the case jacobian=True, and we get the estimate

\widehat{b} = \text{argmax}_b \ p_B(b) = \text{argmax}_b \ \text{lognormal}(\exp(b) \mid 5, 2) \cdot \exp(b).

Here, \widehat{b} \neq 1, and thus the answer differs from the result if we drop the Jacobian. We cal this the max a posteriori (MAP) estimate simply because p_B is the density we’re targeting for Bayesian sampling.

[edit: made parameters to normal consistent]

5 Likes

Just a note, the order of the Lognormal parameters is inconsistent. I suppose it should be 5, 2 everywhere.

Thanks Bob, from that it’s clear why jacobian=F means constrained. Without ever calculating anything, it feels like the adjustment should be there when you “operate in the transformed space”, for it is … transformed.

Thanks—I edited to be consistent. I meant to use 2, 5 5, 2 everywhere so the expectation is easy. [edit: @StaffanBetner was right—I did mean 5, 2]

Yes, that’s why we called it the “MAP” estimate. But if you want a traditional MLE in the constrained space, you have to turn off the adjustment. The problem is just that nothing distributes through non-linear transforms.