Saturating the max tree depth for every sample in extremely simple model

Hi,

I’ve been working on adapting Stan to work with a C++ framework I need to use for generating the likelihoods I need to fit my data. Mostly based on reverse-engineering the output I got from a test model I wrote and ran with CmdStan (and examining CmdStan’s command.hpp), I’ve derived a new class from stan::model::prob_grad that implements (amongst other things) log_prob() as well as a stan::callbacks::writer derivative to store the samples that come out in a format I can work with. I’m running NUTS with adaptation using a call to stan::services::sample::hmc_nuts_diag_e_adapt(). (I’d be happy to give more technical details if necessary, but I didn’t want to clutter the original post, and I don’t think they’re needed for this discussion.) This appears to at least run Stan successfully.

Now I’m trying to test whether the fitting is actually working using an extremely simple model before I trust it to do anything sensible elsewhere. I’ve run up against a feature I didn’t expect and don’t understand. My model is a quadratic function with a single parameter, y = a x^2, and my likelihood is a unit normal centered around a true value a_0, with y evaluated only at a single data point of x=1, so that L(x;a) = N(a - a_0). For the sake of simplicity, I’m assuming the prior on a is flat, so it contributes nothing to the total LL. This makes log_prob look like the plot below as a function of a (in this example a_0 = 1.8, selected at random):

When I run Stan, it’s able to find the correct value a_0 with no trouble, but no matter how many samples I run in warmup (I’ve tried as many as 10K) or how large I set the max tree depth (I’ve gone as high as 17), literally 100% of my post-warmup samples saturate the max tree depth.

I’ve inserted lots of debugging code to ensure that the return value of log_prob is literally what I plot above, so I know Stan’s getting the right LL values. And the LL surface obviously doesn’t have any strange features for the sampler to get stuck in. I’ve read what I can find in the Stan manual and online (including the page about warnings) and I can’t work out whether I should be concerned.

I’m a bit worried, though, that this might indicate I have a problem in my implementation, and I especially worry that sampling won’t work correctly when I try a more complex model. (I’ve experimented with more complex models and had the same behavior, and where the samples don’t match my expected posterior surface, but it’s less clear from those whether it’s the model or my Stan usage that’s at fault.)

Any insight – suggestions for things to check, further reading in the Stan manual or elsewhere that I might have missed, assurance that everything is ok and I don’t need to worry, etc. – would be extremely welcome. Thanks!

I think we’ll need some more technical detail to help you out. Have you written the equivalent program in the Stan language to see if you also see the same problem? If it doesn’t have that property, then there’s a disconnect between what you think you’ve implemented directly in C++ from what’s being translated from the Stan language.

If it does have the same problem, then it’s something else.

I’m not exactly sure what your example model is… does it look like…

data {
  real x;
  real y;
}
parameters {
  real a;
}
model {
  target += normal_lpdf(a | 1.8, 1);
}

I don’t know how you’re "find[ing] the correct value of a_0" when your only parameter is a?

Maybe if you clarified a bit, I’d be able to help more.

Hi,

Thanks for offering to help. Yeah, you’re basically right. My model has undergone a couple of different revisions as I’ve repeatedly simplified it down to try to understand what’s going on. I think the original I began with would be equivalent to:

data {
  int<lower=0> N;
  real x[N];
  real y[N];
}
parameters {
  real a;
}
model {
  for (n in 1:N)
    target += normal_lpdf(y[n] | a * square(x[n]), 1);
}

with the following data:

N <- 5
x <- c(1, 2, 3, 4, 5)
y <- c(1.8, 7.2, 16.2, 28.8, 45.0)

(which is just y=a_0 x^2 with a_0=1.8 as noted in the previous post.)

I eventually simplified it so that I was only using N=1, which means x^2 = 1 for that single point, hence what I wrote in the first post. Sorry it wasn’t clearer.

I ran the Stan program above in PyStan and it worked fine: no divergences, no saturated tree depth, nothing. It finds a=1.8 as well:
image

So clearly there’s something wrong with the way I had it implemented. I guess I’ll look at the autogenerated C++ code for the model again, but in the meantime any other advice for things to look at would probably get me there faster. :)

Thanks!

Also, I’m happy to give more details on my implementation if there are particular things you’d be interested in reading. Unfortunately because it’s part of a larger framework the entire code is way too bulky to copy-paste here. But I can extract pieces as needed.

One more thing I have noticed. In the PyStan version with the model above, I note:

Gradient evaluation took 6e-06 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
Adjust your expectations accordingly!

whereas in my C+±adapted version I have

Gradient evaluation took 0 seconds
1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Adjust your expectations accordingly!

My impression is that this is due to the use of autodiff in my C++ version vs explicitly calculated gradients for normal_lpdf in the PyStan one. But I haven’t been able to confirm that by reading the Stan manual or googling. If I’m not right, I’d be interested to know if it indicates a possible problem, and what it might imply.

Update: I tried using so-called “test gradient mode” and it looks like there’s a problem for sure:

TEST GRADIENT MODE

 Log probability=-244.45

 param idx           value           model     finite diff           error
         0             2.5               0          -685.3           685.3

but I don’t know where that means I should look next…

I don’t know either. Is the C++ code somewhere public? If it is, I could take a quick look. If not, then you’re on your own. It’s super-hard to debug without even seeing any of the code.

Hi,

Yeah, I understand. I was kind of hoping that somebody would just recognize the symptoms and be able to give me a conceptual answer that points me in the right direction. But I do appreciate your effort nonetheless.

Unfortunately the code isn’t public. I’m not trying to be cagey, but it’s part of a large framework that I don’t have the rights to ‘declassify’ and even if I could my guess is that I’d be wasting your time digging through a class hierarchy that’s largely not relevant to the problem.

I’m going to try to write up a simpler test case that exhibits the same behavior which I can share here so that it’s straightforward to look at. Hopefully that won’t take me more than a day or two.

Thanks for sticking with me.

@jwolcott, no prob!

Yeah… that’s usually the right way to go about it. Unfortunately, this problem is in the details. Just based on what output you’ve shown, it looks like the autodiff isn’t actually working. Maybe trace through and make sure everything is using the correct types?

Understood.

Great. If you look in Stan’s unit tests, there are simple tests that demonstrate calling the different functions. Maybe that’ll clue you in on what’s happening?

1 Like

Actually… that might be the conceptual hint I was looking for. I just realized that I’m discarding the stan::math::var-ness of the parameters I get passed into log_prob() by using stan::math::var::val() in order to pass it into the library function I need to use to calculate the likelihood. (Whew.) I’d written this code ages ago and forgotten about it until you just mentioned the types. Obviously that prevents autodiff from tracing into it.

So… I guess I have a conceptual followup then. Is there a recipe somewhere for handling a case where I have a “black box” library function that I can’t rewrite to work with Stan vars? Unfortunately my likelihood is mechanically not always a strict mathematical function of the parameters – this library function I have to use basically does a lookup in a precalculated likelihood stored as a histogram. (If the answer to this question depends on the details of the implementation, I’ll proceed with my simplified test case and we can talk about it then.)

There are some circumstances where I could conceivably calculate the gradients myself analytically. But unfortunately that doesn’t work for all my use cases.

(edit: the “black box” function isn’t a blocking conceptual problem for the test case I showed early in this thread, but it is for the whole use case I’m working towards and thus the implementation I’m using.)

This is probably not going to work. You need something that produces well behaved gradients.

Well… they are. In actual point of fact it’s a not a histogram that’s used for lookups, it’s a 3rd order polynomial spline from the histogram, so that’s both continuous and differentiable… But the point is still that I have a “black box” that I can’t pass a stan::math::var to.

Yup, no matter what you do something has to produce those gradients. It could be a polynomial you fit to the black box, it could be analytical, or you could template the entire black box s.t. our AD works with it. Fact is something still have to produce those and they have to be well behaved enough not to throw off the HMC.

So, I’ve managed to hack my way just deep enough into the framework that I was able to get autodiff to work in my test scenario. That immediately fixed the problem – I don’t come anywhere close to saturating the tree depth, I get a sensible distribution of a values around a_0, etc. So, thank you to @syclik for pointing me in the right direction!

Now, while I’m pondering whether I can get in far enough to make autodiff work for my general case, I’m wondering about something: is it crazy to consider hooking up the finite differences calculation that is used in the “test gradients” mode to be an “explicit calculation” of the gradients for my “black box” calculator? I know it’s supposed to be slower, but is it orders of magnitude in a typical problem? Or does it vary so much from problem to problem that you can’t give me any useful rule of thumb? (I might be willing to pay the penalty in order to not have to try to hack in even deeper, depending on how much penalty I might expect to pay.)

Thanks!

Glad you were able to figure it out.

Not crazy. But, you have to understand the limitations of finite differences to know whether that’s going to work. Autodiff is precise up to machine tolerance (if implemented correctly). Finite differences, however, doesn’t have that property and it’s not hard to end up with gradients that are very far from the true gradients. (It depends on what order method you use.)

And that all feeds into the HMC machinery. In some preliminary simulations, it looks like single-precision isn’t good enough for HMC.

If you had N parameters, more specifically N unconstrained parameters, then the time to compute all gradients using reverse mode is O(1) with respect to the time to compute the function without the gradients.

With finite differences, using a two point method, you’re looking at O(N).

What’s faster will depend on the implementation and the constant out in front. Stan is really fast for autodiff; see the Math autodiff paper for a comparison against the implementation of the function without gradients, an autodiff version, and against other systems: [1509.07164] The Stan Math Library: Reverse-Mode Automatic Differentiation in C++

Essentially, yes. You also have to account for the accuracy of the method and how it plays with HMC.

1 Like

This is all really good food for thought, thanks.

I think I’ve gotten what I need from this thread so I’ll mark it ‘solved’.

Thanks again for all your help.

1 Like