# Logspline density estimation

#1

Hi there,

I’m playing with the idea of logspline density estimation in Stan. See these notes (http://www.econ.uiuc.edu/~roger/courses/574/lectures/L18.pdf) for a reasonably non-technical explanation. The idea is that given a sample X = \left(X_1, \ldots, X_n\right), we estimate their density f by exponentiating a linear combination of B-spline basis functions:

f(x) = \frac{\exp{\sum_{j=1}^P \theta_j B_j(t)}}{c(\theta)},

with normalizing constant c(\theta) = \int \exp{\sum_{j=1}^P \theta_j B_j(t)}\mathrm{d}t. Presumably the Stan code would be something like this:

data{
int n; //Size of X-sample
int P; //Dimensionality of spline basis
matrix[n, P] B; //Basis function evaluations at sample X-points

parameters{
vector[P] theta; //Spline coefficients
... //Other smoothing stuff
}

model {
target += B*theta - log(c);
... //Suitable priors for theta and other smoothing stuff
}


The challenge is with \log(c(\theta)). Although it is a “normalizing constant” for the density, it is not constant with respect to the parameters \theta and so I think it must be included in the log posterior. However, it doesn’t have a closed form - presumably it should therefore be estimated numerically, but I’m not sure if there’s a way to do this efficiently. In particular I imagine the autodiff of the numerical approximation would be a mess.

Does anyone have any experience with this kind of model in Stan - or any kind of density estimation involving some normalizing function? Any insight is greatly appreciated.

#2

Correct

It is, but it can work surprisingly well haha.

The easiest way to do this robustly for now is probably with the ODE integrators: https://mc-stan.org/docs/2_18/functions-reference/functions-ode-solver.html

There’s a 1D quadrature thing that’d be better for this, but it’s been stalled in pull requests since I haven’t pushed it forward (sorry :D!).

You might have some mileage coding up a quadrature manually, but if you do that it’d probably be worth computing an error estimate in generated quantities and tracking it (compute the error estimate by doubling the quadrature node count or whatever and comparing that estimate against the one you use in your model).

#3

Thanks Ben!

I’ll give the ODE method a try…although my use case actually involves estimating a lot of densities, so it may become infeasible after all.

Is there a built-in way to get an error estimate from the RKF45 scheme? Since the integral is really “part of the likelihood”, I suppose that a numerical approximation would result in some kind of “approximate posterior” (in addition to the usual MCMC uncertainty).

#4

The ode integrator functions have tolerance arguments that you can use.

Yeah, but just keep the tolerance low enough so that you don’t have to worry about this stuff :D.

@maxbiostat working through doing some custom C++ code that uses integrals and such in a model over here: Problem implementing Zipf (power law) distribution -- external C++ . You might be able to use that here.

It’d probably be faster than the ode integrator for this. This boost integral code is the same stuff that’s being used in the yet unfinished 1d integrator stuff.

Edited for clarity… Twice!