What does `jacobian=TRUE` do in model$optimize(data = data_list, jacobian = TRUE)

This is my Stan file:

data {
  int<lower=0> N; // number of observations
  array[N] int<lower=0, upper=1> Y; // vector of binary observations
}
parameters {
  real<lower=0, upper=1> theta; // probability of success
}
model {
  theta ~ beta(1, 1); // prior
  Y ~ bernoulli(theta); // observation model / likelihood
}

My data and optimization

Y=c(rep(0,9), 1)
data_list = list(N = length(Y), Y = Y)
model01$optimize(data = data_list)

produces the following result:

Initial log joint probability = -6.41748 
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes  
       6      -3.25083     0.0035622   5.48885e-05           1           1        9    
Optimization terminated normally:  
  Convergence detected: relative gradient magnitude is below tolerance 
Finished in  0.2 seconds.

 variable estimate
    lp__     -3.25
    theta     0.10

This gives the MLE solution.

With an option jacobian = TRUE, the result is different

model01$optimize(data = data_list, jacobian = TRUE)
Initial log joint probability = -10.9632 
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes  
       6      -5.40673   0.000314983      1.72e-06           1           1        9    
Optimization terminated normally:  
  Convergence detected: relative gradient magnitude is below tolerance 
Finished in  0.3 seconds.

 variable estimate
    lp__     -5.41
    theta     0.17

Now theta is 0.17 which is definitely not MLE (MLE is 0.1 given before).

Apperently, the two optimizations are not the same due to the flag jacobian = TRUE.

Can anyone help me find the reason of this happening?

  • What is the interpretation of the result from optimize(jacobian = TRUE)
    The mathematics behind this option would also be very much helpful.

The explanation below from the manual page did not much help to me:


In cmdstanr, the optimize() method is used for Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) estimation. The jacobian=TRUE argument plays a crucial role when dealing with constrained parameters.

When jacobian=TRUE, the optimization process includes the Jacobian adjustment for constrained variables. This adjustment is necessary because the transformation of variables can affect the density of the posterior distribution. Including the Jacobian ensures that the optimization accounts for this change of variables, leading to a more accurate MAP estimate


On the contrary, when the prior is not uniform as shown below:

data {
  int<lower=0> N; // number of observations
  array[N] int<lower=0, upper=1> Y; // vector of binary observations
}
parameters {
  real<lower=0, upper=1> theta; // probability of success
}
model {
  theta ~ beta(10, 1); // prior
  Y ~ bernoulli(theta); 

The two theta results from optimize with and without jacobian=TRUE are all the same (even though lp__ are different):

stanfile = "bernoulli.stan"
model = cmdstan_model(stanfile)
data_list = list(N=11, Y=c(rep(0,10), 1))
model$optimize(data = data_list)
Initial log joint probability = -13.9528 
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes  
       3      -13.8629   0.000429068   1.60942e-06           1           1        6    
Optimization terminated normally:  
  Convergence detected: relative gradient magnitude is below tolerance 
Finished in  0.2 seconds.

 variable estimate
    lp__    -13.86
    theta     0.50
model$optimize(data = data_list, jacobian = TRUE)
Initial log joint probability = -16.4593 
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes  
       4      -15.2492   0.000309459   1.13377e-08           1           1        7    
Optimization terminated normally:  
  Convergence detected: relative gradient magnitude is below tolerance 
Finished in  0.2 seconds.

 variable estimate
    lp__    -15.25
    theta     0.50

Very much curuious!
Thanks.
End.

The CmdStan user guide (as opposed to the CmdStanR vignette you linked to) has more details. If I understand this section on Jacobian adjustments correctly, jacobian = FALSE should return maximum likelihood estimates whereas jacobian = TRUE should return the maximum a posteriori estimates.

Haven’t looked in detail at your model and code to see if that lines up with your output, but that’s the short answer.

Although I must say I’m also a bit confused about the terminology used to describe the jacobian = FALSE case:

returns the (regularized) maximum likelihood estimate (MLE), the value which maximizes the likelihood of the data given the parameters, (including prior terms).

1 Like

Yeah I think that’s the correct interpretation.

Yes, the explanation in the manual is quite clear.
But when the prior is a uniform distribution, MLE must be the same as MAP in theory.

Here, the simulation gave different outputs, which is why the question was posted.

Yes, understood. I was also puzzled by your different outputs with a uniform prior.
I know that theoretically your beta(1, 1) prior is identical to a uniform(0, 1) prior when your parameter theta is constrained to lie on the \left[0, 1\right] interval, so indeed your output is surprising to me.
Do you get the same output when you use theta ~ uniform(0, 1); instead, or when leave that prior statement out altogether (i.e., implicitly uniform prior density)?

See also this case study Laplace method and Jacobian of parameter transformation

2 Likes

This is not true if the parameters are constrained and our model’s defined over unconstrained parameters for sampling or for optimization. That’s why you’re getting results that look counterintuitive to you.

If I have a very simple model like this:

data {
  vector[100] y;
}
parameters {
  real mu;
}
model {
  y ~ normal(mu, 1);
}

I can calculate an MLE and it will be the same as the MAP estimate. So setting Jacobian=TRUE or jacobian=FALSE has no effect because there are no constrained parameters, hence the log Jacobian is 0.

But if I change the model to include a constrained parameter as follows,

data {
  vector[100] y;
}
parameters {
  real mu;
  real<lower=0> sigma;
}
model {
  y ~ normal(mu, sigma);
}

then this is no longer true. When I transform sigma from an unconstrained parameter to one constrained to be greater than zero, we add the Jacobian adjustment to the log density so that we get an improper uniform distribution for sigma over (0, \infty). Then Stan samples on the unconstrained space where sigma has been mapped via logarithm to unconstrained support on \mathbb{R}. We do this so that our sampling and optimization algorithms can work in unconstrained space.

Specifically, what we get will be equivalent to this in terms of what Stan samples.

data {
  vector[100] y;
}
parameters {
  real mu;
  real sigma_unc;
}
transformed parameters {
  real<lower=0> sigma = exp(sigma_unc);
}
model {
  y ~ normal(mu, sigma);
  target += sigma_unc;  // Jacobian adjustment for exp() constraint
}

Stan will sample over mu, sigma_unc in both of the previous two programs. In the first, it’s implicit.

The MAP solution will leave the Jacobian adjustment in, because that’s the distribution we care about. The MLE drops the Jacobian adjustment because the frequentists aren’t treating this as a distribution over parameters (because that’s not a thing if you’re a frequentist, unless you are careful to call the parameters “missing data”). This all carries over if you add priors—they act as penalty terms for a penalized MLE.

The beta case you gave is like the second. When we have a variable constrained <lower=0, upper=1>, we apply a logit transform to map it bijectively to \mathbb{R}, so the inverse constraining transform is inverse-logit, and that has a log Jacobian adjustment of \log \textrm{logit}^{-1}(\theta^\textrm{unc}) \cdot (1 - \textrm{logit}^{-1}(\theta^\textrm{unc})), so that’s the difference you’re going to see per iteration, where \theta^\textrm{unc} = \textrm{logit}(\theta).

If Stan was doing its optimization over the constrained space, you’d find the same thing. So if you write this in Stan, which I would not recommend,

```stan
data {
  vector[100] y;
}
parameters {
  real mu;
  real sigma;
}
model {
  y ~ normal(mu, sigma);
}

Then you’ll find that either things will fail if you initialize sigma < 0 or if you’re careful with initialization, you’ll get the same answer with jacobian=TRUE and jacobian=FALSE.

2 Likes