Nuisance parameters prior from posterior

I was discussing an issue with a friend yesterday, and he had some bio-data where he was interested in several thousand genes across several batches of experiments. Of course, each batch has specific nuisance parameters associated with it: temperature varies, enzyme activity varies, different people did different incidental things to the samples etc.

When you try to fit these nuisance parameters simultaneously with a model for thousands of real effects on thousands of real genes of interest… it can be challenging, and lead to tiny step sizes and divergences etc.

Now, with his data, he has other additional genes, and has observed that the nuisance parameters tend to come out the same regardless of what subset of genes he uses, as long as the subset isn’t too small (ie. not just 1 or 2 genes etc but with a random sample of 50 or 100 genes he gets pretty much the same nuisance values).

So the idea we had was to randomly select a moderate number of genes from the ones he isn’t interested in, fit a model to this, get the posterior distribution for the nuisance parameters, and then try to propagate those nuisance parameters into his “real” model on thousands of genes of interest so that the prior for the nuisance parameters is highly informed, thereby hopefully improving the convergence of the sampler.

So, supposing that you had samples for nuisance parameters from a first small run, how might you use that in a second Stan run?

One thought I had was to define the prior in the “real” model in terms of a very basic base prior, and then a tight normal distribution based “likelihood” over the samples from the previous run, which you then pass in as data. Thoughts on how to make this whole idea work well would be welcome.

not just 1 or 2 genes etc but with a random sample of 50 or 100 genes he gets pretty much the same nuisance values

Hierarchically modeling seems appropriate here.

So the idea we had was to randomly select a moderate number of genes from the ones he isn’t interested in, fit a model to this, get the posterior distribution for the nuisance parameters, and then try to propagate those nuisance parameters into his “real” model on thousands of genes of interest so that the prior for the nuisance parameters is highly informed, thereby hopefully improving the convergence of the sampler.

It should be relatively easy to test if highly informed priors improve the inference speed, I think. Just run the slow fit, get the posteriors, and then use those to estimate really accurate priors. If the model runs super fast – then maybe getting a better guess is the way to go. If it’s still slow, probably worth looking elsewhere.

How strongly does the data inform the posteriors? Is it super vague (cause then the hierarchical stuff might help)? Or does it make the posteriors super tight?

So, supposing that you had samples for nuisance parameters from a first small run, how might you use that in a second Stan run?

I don’t think there’s any way to break inferences into pieces. All the parameters inform all the other ones, and it takes all the data to do that.

I think this conversation comes up when folks ask about having one part of a model inform another, but not the reverse situation (maybe one part is easy to compute, but another part is hard or something). This is the ‘cut’ thing in JAGS/BUGS apparently. I’m not familiar with it myself, other than I wanted it once and looked around and tried some stuff then realized the story was more complicated than I thought and I should probably not do it :P. (there’s a cool presentation somewhere about how JAGS/BUGS isn’t really doing the “right” thing either – can find it if you want)

If you Google around for that stuff you can find Andrew’s comments on it and such. There’s some stuff on the old mailing list too.

One way to get around this a little is multiple imputation. In this context, it’s where you randomly select sets of parameters from previous fits, run a bunch of different fits with them as fixed data, and mix it all together again. It corresponds to integrating the parameters out manually, I think. I’m not super familiar when this is a good thing or a bad thing. But usually there are downsides to this stuff, so it’d be worth digging around. Anyway, that’s the name so you know what to look for ^^. Hope that helps!

Yeah, in this case it’s not a cut that we want so much as we want to run a fit which only has to estimate a few tens of gene level parameters, because as the dimensions get high, with a vague prior on the nuisance parameters, the probability that the sampler gets stuck or has problems on at least one chain during warmup seems to approach 1. But with a small number of total parameters it’s generally well behaved. So then doing the run on genes we’re not interested in to get “pre estimates” of the nuisance parameters, then using those estimates for a prior on a big run for many many genes we ARE interested in will hopefully keep the sampler from failing to warm up properly.

I do think in general that we know tighter priors on nuisance parameters already helps (maybe not solves all problems but at least reduces some of them), we just want these priors to be informed by some extraneous data we happen to have before running the big inference on the main stuff of interest.

The biggest question is how to use samples from the first small run within the code for the second run to create a prior? Pretend there are maybe 10 nuisance parameters, it seems like 10D kernel density estimation would be a nightmare… we could just create say normal approximations using mean and sd of each parameter separately, but that loses all correlation information from the first run (which I’m not sure whether that’s a problem here). I suppose with something like only 10 nuisance parameters we could do a full multi-normal estimate so we could just hand the covariance matrix to the final run as data… but that doesn’t seem to scale so well to cases with more parameters.

Technically, if you estimate from N = 100, you an use that as a prior in estimating the remaining N’ - N parameters and get the same answer. The computation will probably be better behaved if the N = 100 case controls the computations for the remaining parameters. You can do this exactly if things are conjugate.

Or you can just run and then plug in those values if they really are well identified.

Otherwise, I don’t see how to do much more other than formulate a better prior for the larger run.

Right, assuming it’s not conjugate and so you’d like to use Stan’s output from a small run to inform the big run, the question comes down to something like what is the best way to tell Stan your prior if all you have are samples from it?

Treat it as data. But this then requires you to model it. For example, if y_prior are draws from the prior and y is your data (also assumed to be drawn from the prior), then if your prior is normal(mu, sigma), then it’s just

y_prior ~ normal(mu, sigma);
y ~ normal(mu, sigma);

It essentially uses the posterior p(mu, sigma | y_prior) as the prior for mu and sigma when estimating y.

Yeah, that’s not quite what I’m thinking. Suppose I have say a 10 dimensional set of nuisance parameters involved in a complex multi-thousand dimensional problem. I get some small subset of data and fit a hierarchical model with just a few hundred parameters, and get posterior samples of the nuisance parameters. they live in a probably correlated manifold in 10D space.

Now I have a big set of data and I want to have the 10D samples of the nuisance parameters help me define my prior so when I run my multi-thousand dimensional model it uses a reasonably informed prior thereby avoiding some pathologies that would come about with a less informed prior.

One thing I could do, of course, would be to do something like a 10D independent normal with mean and sd equal to the marginal mean and sd from the samples… but this loses all the dependency structure…

the other thing I could imagine would be to define a density using a 10 dimensional kernel density estimate and a few hundred of the samples… that would probably be computationally expensive to calculate and would depend heavily on the choice of kernel in 10 dimensions!

beyond that I’m not sure what seems good. I think the indication is that this is not really a solved problem so there’s no good known default idea ;-)