The consequences of approximate gradients

Hi Folks,

I’m taking a look at a project I have had on the backburner for a while now, and am running into the same issues that I did back then.

The gist is that I have interfaced Stan with a finite element PDE solver written in Fortran (not by me), that simulates data, and the derivative of the data w.r.t. the parameters. The parameters are material properties for the cells in an initial mesh that the user specifies, but for each calculation, the code does a series of mesh refinements for the actual field simulations.

I normally use the original standalone Fortran program for point estimates with meshes that have thousands of parameters (using a built-in optimizer), so the ideal scenario would be to obtain parameter distributions using Stan models that are analogous to that objective function (or not; one thought was to estimate parameters for a small number of basis functions instead of the entire mesh, but I guess I haven’t gotten that far yet).

The issue is that, when running CmdStan in diagnostic mode, there’s a significant error between model gradient, and the finite difference approx. that Stan calculates. I’ve confirmed that the Fortran-to-Stan interface I wrote is getting the same gradient as when I just directly plug the params into the standalone Fortran program, so I’m somewhat convinced at this point that the error is in the approximate gradient. I haven’t determined how much error (or instability?) the adaptive mesh refinement scheme could be introducing to the FD gradient approx. The errors are pretty significant, ranging from ~0.1 to thousands.

I’m only starting to test this, but so far with small problems (< 10 params) using simulated data, I can still recover the true param vals. using the Stan optimization routines. My concern though (without knowing NUTS very well) is how will the gradient error affect the NUTS sampler, and if this basically means that NUTS is a no-go for this.

1 Like

My experience is that it’s hard to tell how much a faulty gradient would affect the mixing, and it’s always depends on the model.

In general it’s worth investigating if AMR gradient is off by thousands, especially when some FEM solver’s a posteriori error estimator is based on \mathcal{H}^1. Is it possible to fix your solver’s AMR depth so you can do a convergence test on derivative g vs mesh size h?

I assume AMR=Auto Mesh Refinement?

The code takes the target error tolerance, and the max # of refinements as, as user parameters, so at the least I can bound it (not sure if this is what you mean?) There are also a host of other refinement options hidden in the code (most of which aren’t really documented), so I’m holding out hope that somewhere in there is a way to balance speed and accuracy.

EDIT: Just tried a mesh w/ ~200 params (still pretty sparse for this type of problem), and the decrease in gradient error was pretty dramatic. Still getting a couple of params w/ error in the thousands, but looks like most of them are below 200 now (this was also w/ max refinement steps = 1).

My bad, there are way too many acronyms flying on discourse. AMR=adaptive mesh refinement.

Not knowing your model, I can only guess that there might be some hot spots in the field with large gradient. Depends on a posteriori error estimation, AMR may or may not refine these spots. A coarse mesh doesn’t see these gradient, and you see reduced error against FD results. But as mesh refines to similar scale of the hot spot size, gradient at a node in the hot area might be polluted by elements around it. The treatment is always problem dependent, but if you are confident on both PDE solver and FD solution, one option is to ignore some(but not all) of the nodes/elements with large gradient.

See for some discussion about what happens when there are errors in the gradient calculation. If you can compute the log target density sufficiently accurately then the most clear indication that gradient error is big enough to be problematic will be the average acceptance probability dropping no matter how you tune the integrator step size. This will often cause the step size adaptation to fail, so you would want to look for the step size being tuned to extremely low values and the average acceptance probability staying below 0.8.

If you the errors afflicting the gradients also afflict the function evaluation itself then you won’t be able to compute the transitions probabilities the dynamic HMC implementation that drives Stan and it’s not clear how the problems will manifest.

Also keep in mind that finite difference is itself an approximation and can be easily biased by poor floating point arithmetic so it may be worth spending a bit of time working out your own finite difference calculation with a carefully chosen epsilon.

1 Like

Hmm, do you mean design the mesh so that the “hot spots” basically have larger/more blocky elements than the rest of the mesh? I think the challenging part there, is that once I run this thing on real data, then a-priori I have little to no knowledge of what the (spatial) distribution of the parameters should be. One thought though, is that I could obtain point estimates (either using the built in optimizer or Stan’s), and then do a “manual” mesh coarsening around the regions where the model looks like it’s trying to fill things in.

Thanks for the resource. Something like this is what I was worried about. I suspect that running the model will end up taking days (a 2 parameter “test run” w/ only 100 warmup and 100 saved samples, using 40 cores on a cluster already took ~10 hours), and that at that point debugging these kinds of problems will basically be futile.

If I can just get point estimates using Stan though, I think that’ll already be a pretty big win, since being able to specify statistical models is IMO going to be a huge win compared to being locked into (basically inflexible) built in optimizer. Even if I can’t get full bayesian estimates, being able to do things like specify custom data distributions in a hierarchical model, or use different priors (compared to the one regularizing functional in the built-in optimizer), would be useful.

The “data”/responses returned by the FEM code is also a nonlinear function of the parameters, so I wonder how much error could be simply due to that.

One thought that comes to mind, is that I could test this on meshes that have horizontal symmetry (i.e. 1D models). For 1D models, I have a closed form solution for the gradients, so I could compare that to the FEM solutions to try to get a feel for how bad it is. Not sure how well that result will extrapolate though (because of the way the boundary conditions are handled in the FEM code, I suspect that it will do suprisingly well on the 1D models…)

This can also be achieved by playing with refinement depth and tolerance (always fun).

To large PDE models, MCMC is rarely applicable without some surrogate/reduced-order modeling in-between. If with 2 parameters the ballpark nb. of elements is also 2, then it’d be unrealistic to apply the method to actual applications.: the finer the mesh, the closer the inference gets to its nonparametric nature, since your parameter is a discretized field function.

I don’t have the AMR diagnostic output on hand, but the number of elements in the 2 parameter model for the actual forward calculation was >= ~400. Each forward call actually winds up being a barrage of forward calculations on a number of meshes (I think it’s one per data point), each of which does it’s own AMR, so it gets really hard to slog through the AMR diagnostic output, but >= 400 seemed to be what I was seeing.

My knowledge of FEM methods is pretty limited (I’ve only written a few toy ones for my own edification), but based on my limited experience, this particular code is really really damned fast. Regardless, I think you’re right, that for anything realistic I’ll have to explore representing simplifying the parameter space somehow (if I ever get that far with this).