I was discussing an issue with a friend yesterday, and he had some bio-data where he was interested in several thousand genes across several batches of experiments. Of course, each batch has specific nuisance parameters associated with it: temperature varies, enzyme activity varies, different people did different incidental things to the samples etc.

When you try to fit these nuisance parameters simultaneously with a model for thousands of real effects on thousands of real genes of interest… it can be challenging, and lead to tiny step sizes and divergences etc.

Now, with his data, he has other additional genes, and has observed that the nuisance parameters tend to come out the same regardless of what subset of genes he uses, as long as the subset isn’t too small (ie. not just 1 or 2 genes etc but with a random sample of 50 or 100 genes he gets pretty much the same nuisance values).

So the idea we had was to randomly select a moderate number of genes from the ones he isn’t interested in, fit a model to this, get the posterior distribution for the nuisance parameters, and then try to propagate those nuisance parameters into his “real” model on thousands of genes of interest so that the prior for the nuisance parameters is highly informed, thereby hopefully improving the convergence of the sampler.

So, supposing that you had samples for nuisance parameters from a first small run, how might you use that in a second Stan run?

One thought I had was to define the prior in the “real” model in terms of a very basic base prior, and then a tight normal distribution based “likelihood” over the samples from the previous run, which you then pass in as data. Thoughts on how to make this whole idea work well would be welcome.