Avoiding two stage sampling from a distribution

In my problem I have a number of input parameters lambda=(lambda1,lambda2,…,lambdak), and I have defined a univariate output function, Q(lambda), that is in general, a nonlinear black box deterministic function of the inputs. I have a desired target output distribution p(Q|data), which results from fitting a parametric distribution to some data, and want to produce an input distribution that results in it. Since the function is many-to-one, there are many possible input distributions that could do this, and so we need to specify a prior distribution, p(lambda), over inputs to define a unique posterior. In this circumstance the ‘posterior’ distribution over inputs is given by,

p(lambda|data) = p(lambda)/p(Q(lambda)) * p(Q|data),

where p(Q(lambda)) is the output distribution that would result from our choice of priors.

To sample from the input posterior distribution, p(lambda|data), I have been doing this in two stages. In the first, I independently sample inputs, lambda, from the priors and fit a kernel density estimator to the samples of Q(lambda). I then use my estimated pHat(Q(lambda)) in a second stage where I do MCMC on the RHS of the above expression to yield estimates of the posterior, p(lambda|data).

This appears to me to be a case of a doubly intractable distribution, which poses issues for MCMC. However the oddity of this problem is that I can independently sample from all elements on the RHS: p(lambda), p(Q(lambda)), and p(Q|data). However I only have a functional form of p(lambda) and p(Q|data). This is the reason I do the first stage sampling, to estimate p(Q(lambda)).

I can’t help thinking that there may be a one step sampling method here. Does anyone have an idea as to how I might do this?

Best,

Ben

I have some questions! I apologize if you’ve answered some of them already but this seems like a tricky inference so I wanted to be clear.

What is your data like? And what are these lambdas?

Is X a length N vector of real numbers? Or is it discrete? Does it have an upper and lower bound? (if we’re going to try to put a distribution on it we should have these things)

Lambda is a vector of length K. Is there one Lambda_i for each X_i (in which case K == N)? Or is every X_i determined by the same set of K lambdas (and so we’re probably going to make an independent and identically distributed assumption)? Or is it each X_i has it’s own length K vector lambda (in which case Lambda is an NxK matrix)?

Is this at all interpretable as a regression?

Like, is this something like:

X ~ f(lambda)

Except our knowledge of lambda is noisy (so we’ll do inferences there too) and we don’t know quite what ~ means?

And do you by any chance have plots of X? Or lambda? Or just any plots that describe the data? Might make it easier to think about.

Hi Ben,

Thanks for your responses. The research problem I have is quite general. I am trying to develop a method to do parameter inversion for deterministic systems, where the map from the inputs to the output are many-to-one. For example, imagine a SIR ODE model which has a number of input parameters (lambda1,lambda2,…,lambdak): the death rates of the different compartments, the disease transmissiability etc. Here our output summary statistic might consist of measurements of the susceptible population size at a certain point in time. Rather than deal with the data directly, we fit a distribution (for example, a normal) to the sample of summary statistics. The sample of summary statistics can be of any size – the only point is that we use it to fit a given distribution. So here we might find that our output distribution can be approximated by,

S(10) ~ N(mu,sigma),

where (mu,sigma) are estimated from our sample of measurements of the susceptible population size S, at time t=10. In my formulae in the original question Q = S(10) for this example.

In the problem I am describing the uncertainty does not come from measurement error; the system is deterministic. It comes from the fact that the input-to-output map is many-to-one. This means that many inputs could cause a given output value. In turn, this means that many input distributions could cause a given output distribution. To make this distribution identified, you need to specify a prior distribution over inputs, from which you obtain a unique posterior input distribution.

So here we end up with a distribution over input parameters for the SIR model that could cause a given distribution of S(10) measurements that is consistent with our prior parameter beliefs. The difficulty with sampling from this distribution is the fact that a functional form of p(S(10)) – i.e. a sort of prior predictive distribution, is not known. This means that my current method for sampling from p(lambda|data) involves a first stage when I estimate a functional form for p(S(10)) by independently sampling from the prior distributions of the various inputs. I can then use this distribution to do MCMC in a second step. The reason for asking this question was that I am not sure whether this two step approach is best; I suspect there is a one-step MCMC-or-otherwise sampling approach that is better.

Note here, the SIR example is only one particular case of the more general problem in which I am working. In general the best thing to assume is that the outputs Q = a black box function of lambda. I am not sure whether thinking about it as a type of regression is right. Here the function can be extremely nonlinear, and it may be impossible to write down a parametric form for it.

Anyway, hope that helps a bit. Sorry I am aware that this is a quite general query, but just wondered if anyone had any bright ideas!

Another way to view the problem is that whilst Q = f(lambda) is deterministic, due to the degeneracy of the function we can regard the inverse lambda ~ f^-1(Q) as stochastic.

I’ve thought about this and I don’t think I have great answers for you but I’ll give yah comments anyway cause this is the internet so why not haha.

Rather than deal with the data directly, we fit a distribution

I’m not convinced this is an advantage.

Is this the inference you’re trying to do?

X ~ approx_pdf()
Q(lambda) ~ approx_pdf()

Where you fit some sorta empirical distribution to X, and then say that Q(lambda) should also follow that distribution? The issue with this is that if you want to talk about probabilities of values of Q and lambda, then you’re going to have to know about how the density transforms between lambda and Q.

So here we end up with a distribution over input parameters for the SIR model that could cause a given distribution of S(10) measurements that is consistent with our prior parameter beliefs.

If you want to talk in terms of densities, this is a neat book on how to do forward models in terms of densities (using approximate expansions): http://press.princeton.edu/titles/9229.html

It’s really short and relatively easy to read. The first chapter is free online and does a good job outlining the book. There’s a section at the end (chapter 8) that talks about Bayesian inference.

In the problem I am describing the uncertainty does not come from measurement error; the system is deterministic.

No measurement error? System deterministic and we know it exactly? Biology ODEs? Dubious! Not saying it wouldn’t work or those things don’t apply in certain situations, but the probability is high that if we were talking about a specific problem I’d argue with you a bunch about this :P. It ruffles my feathers at least, haha.

Good luck!