Discrepancy between Stan and TFP log_prob

As a part of the stanc3 TFP backend tests, I’m trying to compare the log prob evaluated by simple Stan programs and their generated TFP counterparts. The workflow, in general, is:

  • Generate samples and lp__ from a simple Stan program
  • Generate the corresponding TFP model
  • Evaluate log_prob of samples according to the TFP model (with the necessary inverse_log_det correction) and compare

This “checks out” for all distributions currently supported by the TFP backend, except for lkj_corr_cholesky_lpdf. We’ve started a discussion on the TFP issue tracker, and the latest “suspect” is a difference in the bijector implementation.

Is there any way to separately save the contribution of lkj_corr_cholesky_lpdf and the bijector in Stan, so we can somehow compare each separately?

Any other thoughts on this would be appreciated.

1 Like

Sorry for asking, but is bijector correction done outside of lkj_corr_cholesky_lpdf?

Would it work if you print both input and output inside that function? You would need to edit the stan source and add some cout, + parse the output.

In the log_prob function in C++, there’s a jacobian template parameter. Setting it to true includes the Jacobian correct for the (inverse) trnasforms ad setting it to false turns them off.

Thanks. I assume there’s no way to do this in R/Py/CmdStan? I don’t have much C++ experience, but I’ll try…

@Bob_Carpenter - any similar example you can point me to? This is probably trivial, but I’ve tried to understand how to call the lkj_corr_cholesky log_prob function and got lost somewhere in the repo…

EDIT - AFAIU, calling lkj_corr_cholesky_lpdf will compute the log_prob without the jacobian correction, right?

Bob I looked into this and I’m also a bit confused by the code on how the jacobian get’s called. In a generic stan model we’ll have something like the below defined in the model class

  template <bool propto__, bool jacobian__, typename T__>
  T__ log_prob(std::vector<T__> &params_r__, std::vector<int> &params_i__,
               std::ostream *pstream__ = 0) const {
    typedef T__ local_scalar_t__;
    T__ lp__(0.0);
    stan::math::accumulator<T__> lp_accum__;
    static const char *function__ = "test_model_namespace::log_prob";
    (void)function__; // suppress unused var warning

    stan::io::reader<local_scalar_t__> in__(params_r__, params_i__);

    try {
      current_statement__ = 1;
      validate_non_negative_index("b", "K", K);
      Eigen::Matrix<local_scalar_t__, -1, 1> b;
      b = Eigen::Matrix<local_scalar_t__, -1, 1>(K);

      current_statement__ = 1;
      b = in__.vector(K);
      local_scalar_t__ Intercept;

      current_statement__ = 2;
      Intercept = in__.scalar();
      {
        current_statement__ = 3;
        lp_accum__.add(student_t_lpdf<false>(Intercept, 3, 0, 10));
        current_statement__ = 4;
        lp_accum__.add(bernoulli_logit_glm_lpmf<false>(Y_opencl__, X_opencl__,
                                                       Intercept, b));
      }
    } catch (const std::exception &e) {
      stan::lang::rethrow_located(e, locations_array__[current_statement__]);
      // Next line prevents compiler griping about no return
      throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***");
    }
    lp_accum__.add(lp__);
    return lp_accum__.sum();
  } // log_prob()

Where the whole model inherits from model_base_crtp<test_model>. In the case of log_prob<false, true>(...) (i.e. the jacobian) log_prob differs to log_prob_jacobian, but in model_base_crtp that’s defined as

  inline double log_prob_jacobian(std::vector<double>& theta,
                                  std::vector<int>& theta_i,
                                  std::ostream* msgs) const override {
    return static_cast<const M*>(this)->template log_prob<false, true>(
        theta, theta_i, msgs);
  }

So it’s just calling the above function, but the function above doesn’t do anything with jacobian__ so I’m not following how it is used.

@Adam_Haber I think for your use case, instead of dealing with the above, take the test for lkj_corr_test and write one that checks over the support and logs it somewhere. You can also modify the cpp to write a log of the intermediate values. Then you can do the same thing for tensorflow and compare them.

1 Like

Thanks! This is indeed due to differences in the bijectors; Stan’s lkj_corr_cholesky_log and TFP’s CholeskyLKJ.log_prob give exactly the same results.

This is just a log density function, so there’s no change of variables.

As @stevebronder points out, the template parameters for propto and Jacobian are on the log_prob function generated for a model. The Jacobian template parameter controls whether the Jacobian adjustment’s included or not. These are Jacobians that result from constrained parameters. So if you declare a parameter as a covariance matrix or Cholesky factor of a correlation matrix, then you’ll see where the Jacobian is either included or not at the log_prob level.

Some additional information - I’ve compared Stan and TFP’s bijectors:

The TFP implementation is quite simple - they plug the unconstrained vector in the lower triangular part of an identity matrix of the right size, and then normalize the rows. Stan’s implementation first transforms the unconstrained vector by applying tanh element-wise, then plugs it into the lower triangular part of an all-zeros matrix and does the normalization of the rows “on the fly”. What is the motivation behind Stan’s implementation (that naively seems more complicated)?

For covariance, a Cholesky factor is just a triangular matrix with strictly positive diagonal element. So that bijection’s easy—we just log transform the positive values to unconstrained.

For correlation matrices, a Cholesky factor is all of the above, but each row must be a unit vector (Euclidean length 1) in addition to having a positive element on the diagonal, which is where the extra constraints are coming from. tanh maps smoothly to (-1, 1) and then further gets smoothly scaled to ensure the final value can be set to the right positive value to make everything a unit vector.

I’m don’t what TFP’s doing, but it’s not enough to just normalize unconstrained values to have unit length and flip the sign on the last element to positive. That’s not bijective, so there’s not a well-defined Jacobian.

There’s no sign flipping - the “pre-normalization” diagonal element is 1, so is stays positive after normalization, as well (since normalization just divides the entire row by a positive number). And whatever were the off-diagonal elements of the i-th row, they’re all going to be between -1 and 1 after normalization, so all “requirements” are covered… no? I’m obviously not a TFP developer, this is just my understanding of their implementation.

Cool. With one of the elements fixed, everything else gets identified.

So if I underestand, they start with a lower-triangular matrix that has a 1 on the diagonal, and then map each row

x -> x / sqrt(sum(square(x)))

That’s got to be cheaper to evaluate than tanh both for the function and the derivative.

I was thinking about our unit_vector type, where we can’t do that, because we can’t constrain the sign on the final element.

@bgoodri and @betanalpha — any opinion on whether it’s worth trying this alternative transform?

Also, can we do the same thing for a simplex starting with positive values and a fixed 0 in the last position and then just do softmax:

x -> exp(x) / sum(exp(x)).

That’s less clearly a win computationally.

3 Likes

The challenge here is going to be quantifying not just the cost of the log Jacobian determinant and its gradient but also how the transform influences the total target density function on the unconstrained space. While the tanh is more expensive my (very weak) intuition is that it might flatten out the density on the unconstrained space and make things easier to navigate, especially with the LKJ density function defining the nominal target.

As usual the answer will be “what increases effective sample size per computational cost in circumstances where the convergence is sufficiently well-behaved that effective sample size is well-defined”.