Good example of PR with function, documentation and gradients

There are a few functions I use regularly that would probably be useful additions to Stan. I’d like to contribute these functions and their gradients. I’ve read Mitzi’s guide to doing so here: but find it more a developer reference than tutorial for new developers. I was hoping someone (@mitzimorris or @syclik or @Bob_Carpenter perhaps) could point me towards an old (best practice) PR where someone contributed a new function and its analytical gradients?


If you can comment on what’s missing we could make a separate tutorial (yes that’s the sound of me volunteering). That might be an effective way of meeting your goal.

You might also want to say what these are since not all suggested additions end up being accepted.

Sure. Below is a description cc @seantalts

Aggregate random coefficients logit is a common discrete choice model used for supply-demand modeling when you observe aggregate choices. The most socially useful applications of the model is by justice departments, analyzing the competitive consequences of firm mergers.

On the demand side of the model it uses a simulated likelihood setup: individual i in market t choses between j\in 1:J alternatives described by an observed choice attribute vector of length P, X_{jt} and unobserved scalar \xi_{jt}–potentially related to X through some model. Each potential choice provides the decisionmaker with utility:

u_{ijt} = X_{jt} \beta_{i} + \xi_{jt} + \epsilon_{ijt}

where \epsilon \sim \text{Gumbel()}. The distribution of preferences \beta_{i} is Gaussian with a correlation matrix \Omega, scale \tau, and Cholesky factor of the correlation matrix L_{\Omega}. If people make the choice that maximizes their utility, then their choice probability is softmax

p_{ijt} = \frac{\exp(X_{jt} \beta_{i} + \xi_{jt} )}{\sum_{k = 1}^{J}\exp(X_{kt} \beta_{i} + \xi_{kt})}

Now if we observed individual choices and \xi was unrelated to X, we already have our likelihood and we’re done. But we observe aggregate choices only—counts of choices within each market. So we simulate the likelihood by first drawing some standard normal vectors, one for each simulated decision-maker:

z_{i}\sim \text{Multi normal}(0, I_{P})

Then simulate each person’s choice probability:

\beta_{i} = \hat{\beta}_{(P\times 1)} + \text{diag}(\tau)_{P\times P} L_{\Omega(P\times P)} z_{i(P\times 1)}
P_{i(J\times 1)} = \text{softmax}(X_{t(J\times P)}\beta_{i(P\times 1)} + \xi_{t(J\times 1)})

We then simply do Monte Carlo integration average across all the simulated decisionmakers

s_{jt} = \frac{1}{I}\sum_{i} P_{ji}

Which gives us the generative model for our choice counts:

y_{t} \sim \text{multinomial}(s_{t}, N)

The function I’d be adding is the one that gives us s_{j} given X, z and \xi, as well as the gradients (described on p6 here: Currently autodiff is quite slow (I presume because of the monte carlo integration step).


I think this is a pretty clean example PR of adding a function and its gradients:

The wiki page you linked to is a community maintained guide that we hope to keep up to date for exactly this request, though it’s become long enough and includes enough material that maybe we should have a shorter wiki page that just leads someone through a simple example.

Thanks Sean!

Yup—lots of different things people might want to do. It’s hard to write one-size-fits-all doc.

The example @seantalts linked was for a new density function with vectorization, so it’s pretty complicated. It involves the operands_and_partials abstraction. The upside is that it produces performant code at all levels of autodiff and also allows efficient vectorization. This is probably closest to what you want to do.

At the lowest end is a simple function that uses the precomputed gradients vari. Beyond that would be writing a custom vari class. An example of a customvari for a scalar function is

At an intermediate level would be matrix derivatives. At the base, those might use adj_jac_multiply or they could write their own custom vari with interactions with the various stacks to deal with virtual vs. precompiled function calls.

Do you have working code you could share where the problem is autodiff speed? That’s helpful to triangulate along with the math.

I think you can seriously speed up your case by using map_rect. If you do the MC by-patient integration in that function (do the sharding per subject or per groups of subjects), then the map_rect will internally build an AD tape for each subject only and so to say compress the AD tape immediately when evaluating the contributions from a given subject.

It won’t be nice to code - sorry for that - but I am relatively sure that this problem will benefit a lot from map_rect.

Thanks for this! Will attempt this weekend and let you know how it goes.