Good example of PR with function, documentation and gradients

James_Savage · April 26, 2019, 2:38pm

There are a few functions I use regularly that would probably be useful additions to Stan. I’d like to contribute these functions and their gradients. I’ve read Mitzi’s guide to doing so here: https://github.com/stan-dev/stan/wiki/Contributing-New-Functions-to-Stan but find it more a developer reference than tutorial for new developers. I was hoping someone (@mitzimorris or @syclik or @Bob_Carpenter perhaps) could point me towards an old (best practice) PR where someone contributed a new function and its analytical gradients?

sakrejda · April 26, 2019, 3:13pm

If you can comment on what’s missing we could make a separate tutorial (yes that’s the sound of me volunteering). That might be an effective way of meeting your goal.

You might also want to say what these are since not all suggested additions end up being accepted.

James_Savage · April 26, 2019, 3:52pm

Sure. Below is a description cc @seantalts

Aggregate random coefficients logit is a common discrete choice model used for supply-demand modeling when you observe aggregate choices. The most socially useful applications of the model is by justice departments, analyzing the competitive consequences of firm mergers.

On the demand side of the model it uses a simulated likelihood setup: individual i in market t choses between j\in 1:J alternatives described by an observed choice attribute vector of length P, X_{jt} and unobserved scalar \xi_{jt}–potentially related to X through some model. Each potential choice provides the decisionmaker with utility:

u_{ijt} = X_{jt} \beta_{i} + \xi_{jt} + \epsilon_{ijt}

where \epsilon \sim \text{Gumbel()}. The distribution of preferences \beta_{i} is Gaussian with a correlation matrix \Omega, scale \tau, and Cholesky factor of the correlation matrix L_{\Omega}. If people make the choice that maximizes their utility, then their choice probability is softmax

p_{ijt} = \frac{\exp(X_{jt} \beta_{i} + \xi_{jt} )}{\sum_{k = 1}^{J}\exp(X_{kt} \beta_{i} + \xi_{kt})}

Now if we observed individual choices and \xi was unrelated to X, we already have our likelihood and we’re done. But we observe aggregate choices only—counts of choices within each market. So we simulate the likelihood by first drawing some standard normal vectors, one for each simulated decision-maker:

z_{i}\sim \text{Multi normal}(0, I_{P})

Then simulate each person’s choice probability:

\beta_{i} = \hat{\beta}_{(P\times 1)} + \text{diag}(\tau)_{P\times P} L_{\Omega(P\times P)} z_{i(P\times 1)}

P_{i(J\times 1)} = \text{softmax}(X_{t(J\times P)}\beta_{i(P\times 1)} + \xi_{t(J\times 1)})

We then simply ~~do Monte Carlo integration~~ average across all the simulated decisionmakers

s_{jt} = \frac{1}{I}\sum_{i} P_{ji}

Which gives us the generative model for our choice counts:

y_{t} \sim \text{multinomial}(s_{t}, N)

The function I’d be adding is the one that gives us s_{j} given X, z and \xi, as well as the gradients (described on p6 here: http://faculty.wcas.northwestern.edu/~ane686/supplements/Ras_guide_appendix.pdf). Currently autodiff is quite slow (I presume because of the monte carlo integration step).

seantalts · April 26, 2019, 5:34pm

I think this is a pretty clean example PR of adding a function and its gradients: https://github.com/stan-dev/math/pull/658

The wiki page you linked to is a community maintained guide that we hope to keep up to date for exactly this request, though it’s become long enough and includes enough material that maybe we should have a shorter wiki page that just leads someone through a simple example.

James_Savage · April 26, 2019, 5:52pm

Thanks Sean!

Bob_Carpenter · April 27, 2019, 4:16am

Yup—lots of different things people might want to do. It’s hard to write one-size-fits-all doc.

The example @seantalts linked was for a new density function with vectorization, so it’s pretty complicated. It involves the operands_and_partials abstraction. The upside is that it produces performant code at all levels of autodiff and also allows efficient vectorization. This is probably closest to what you want to do.

At the lowest end is a simple function that uses the precomputed gradients vari. Beyond that would be writing a custom vari class. An example of a customvari for a scalar function is https://github.com/stan-dev/math/pull/856.

At an intermediate level would be matrix derivatives. At the base, those might use adj_jac_multiply or they could write their own custom vari with interactions with the various stacks to deal with virtual vs. precompiled function calls.

Bob_Carpenter · April 27, 2019, 4:19am

Do you have working code you could share where the problem is autodiff speed? That’s helpful to triangulate along with the math.

wds15 · April 29, 2019, 10:56am

I think you can seriously speed up your case by using map_rect. If you do the MC by-patient integration in that function (do the sharding per subject or per groups of subjects), then the map_rect will internally build an AD tape for each subject only and so to say compress the AD tape immediately when evaluating the contributions from a given subject.

It won’t be nice to code - sorry for that - but I am relatively sure that this problem will benefit a lot from map_rect.

James_Savage · May 11, 2019, 2:58am

Thanks for this! Will attempt this weekend and let you know how it goes.

Topic		Replies	Views
Helpful function repository General	11	927	January 14, 2021
Case study: Extreme value analysis and user defined probability functions in Stan Modeling	4	2293	November 15, 2017
Need for occasional consulting on stan model development Jobs	0	525	March 2, 2020
Tutorial: Using R's Distribution Functions in Stan Models (external C++) General	2	178	July 31, 2024
Stan reading material General	6	535	March 7, 2023

Good example of PR with function, documentation and gradients

Related topics