Bayesian Inference with free optimisation parameters

Hello Stan folks!

I’m facing a specific problem within Bayesian Inference (BI), and I’m hoping to find out whether or not there is a way to solve this problem using the tools provided by Stan. It is the first time I’m working with BI numerically, so please bare with me.

The problem is this: A certain experiment can have four different outcomes. For each outcome, there is an associated likelihood function, which is given by a an exact analytical expression. Evaluation of each of these likelihoods is not computationally cumbersome. In addition, the likelihood functions depend on external, real parameters. In each step of the inference, these parameters should be optimised according to some general principle (e.g., minimisation of the variance of the parameters we seek to estimate).

Based on the fact that the asymptote of the posterior should not depend too much on the chosen prior, I am probably going to go with a Gaussian prior, parametrised with some mean and variance.

My fairly, albeit general, question is whether I can do this in Stan?

Do the four likelihoods depend on the same parameter? It’s probably easier if you can write a simplified version of your model (the maths, that is). I understand that you might not be at liberty to share the exact details, so a simplified, bare-bones version should do.

Yes, all the likelihood functions depend on the same parameters. Thus, in each step of the inference, the likelihood should be updated depending on the outcome of the experiment. Also, before each experiment, the real parameters should be optimised. Hence, in each step what is needed is the numerical evaluation of the marginal distribution as well as the optimisation of the parameters. Thus what I have to work with is the following:

P(\theta | X ) = \frac{P(X | \theta; a,b) * P(\theta)}{\int P(X | \theta; a,b) * P(\theta) d\theta}

I hope this is sufficient, let mw know otherwise.

OK, we’re slowly getting there. So you want to optmise a and b at every step of the MCMC, conditional on the sampled \theta?

Indeed, the goal is to estimate \theta. Given some prior distribution, the following should more or less happen in each step:

  • Determine a and b (via numerical optimsation, e.g., maximum entropy gain or similar)
  • Perform experiment
  • Update the likelihood
  • Calculate posterior

Can it be down with Stan?

It’s still not clear to me why you would like to do that, but given some conditions, it might be possible to do what you want. If you can write the gradients \frac{\partial P }{\partial a} and \frac{\partial P}{\partial b} in closed-form, you can use Stan’s algebraic solver to find the \hat{a} and \hat{b} that solve \frac{\partial P}{\partial a} = \frac{\partial P }{\partial b} = 0.
@bbbales2, am I seeing things here or is this on the right track?

The reason for the optimisation step is that these parameters provide ‘feedback’ to each and every experiment being performed. Hence, when optimised, they provide the optimal experimental setup so that the actual unknown parameters \theta can be determined more efficiently.

Perhaps I may phrase it like this: Is there within the STAN package the possibility to numerically optimise functions, e.g., using Golden Search or some other standard method?

I am trying to be as clear as possible, but the numerics of BI is really new to me. Thanks.

EDIT: It appears to me that having likelihood functions depend on additional (fixed or otherwise) parameters should not be unusual or strange. However, the situation could perhaps be resolved if one views them just additional (random) parameters that ought to be estimated. This, however, does not seem to be the best solution, since these parameters are not truly random.

Stan has an algebraic solver, which you can exploit to find the pair a, b that maximise a given function (of \theta, I’m assuming). Also,

P(\theta \mid X ) = \frac{P(X \mid \theta; a,b) P(\theta)}{\int P(X \mid \theta; a,b) P(\theta) d\theta}

should really be

P(\theta \mid X; a, b ) = \frac{P(X \mid \theta; a,b) P(\theta)}{\int P(X \mid \theta; a,b) P(\theta) d\theta}
since you’re not marginalising over a and b.

I still don’t fully understand what you’re trying to do and suspect strongly there might be some misunderstanding about what Bayesian inference is. I recommend you write down what your experiments are, what the data are and then what you expect out as inferences.

Think of it like this: if the experiments have already been performed, then a and b have already been optimised and what you should actually do is plug in the values of a and b for each data point. If you don’t know a and b, you’re in hot waters, because the optimal a and b seem to depend on \theta which you also don’t know – and on X which you do know, but still. So even if you optimise to find \hat{a} and \hat{b} at every iteration, all you’re doing is a very fancy parameter transformation; your prior on \theta will induce a prior on a and b. Ditto for the posterior.

1 Like

Thanks for pointing out the missing a and b.

Okay, then I will look into that solver. Thanks again for your expertise!

1 Like

It isn’t unusual. Just think estimating the mean of a Gaussian when you know the variance. What is unusual, though, is having these “fixed” parameters depend on the unknowns (\theta in your case). If a and b don’t depend on \theta, it’s all rosy. If you can write the gradient of a - “nice” - function f(X; a, b) with respect to a and b, you should be able to use the algebraic solver to get your answer - and chuck it in the transformed data block. But, again, if a and b depend on \theta they are random under the Bayesian paradigm, since a prior measure \Pi(\theta) will push a measure onto a, b.

Let me clarify: the parameters ‘a’ and ‘b’ do not depend on theta - they are not random variables.

Now: Indeed, the likelihood functions are given as analytical expressions and so I can differentiate them, no problem. What remains is to sort out if the algebraic solver can help me. Usually, algebraic solvers are used in softwares such as Mathematica to solve (algebraic) equations. However, I do not intend to work with algebraic expressions here - In order to calculate the statistical properties of \theta, what I need is probably samples as per the MCMC technique - are you saying that the algebraic solver somehow can aid me here? If so, is there an example of this somewhere you can point to so I can verify this claim (i.e., a problem where a minimisation problem is solved using the algebraic solver provided by STAN)?

This is my suspicion: I don’t think calculating the gradient of the likelihood and subsequent use of the algebraic solver will help me - what I need a numerical solver. If STAN does not provide this, that’s alright - I can use the native stuff that comes with, e.g., Python. Moreover, using the gradients will probably induce unwanted numerical noise in the solutions, something I would like to avoid.

Lastly, considering the fact that my data arrives in a sequential fashion (with optimisation of the parameters ‘a’ and ‘b’ in each step of the sequence), i.e., not at the same time, I am starting to wonder whether or not I need do use sequential MCMC. As far as I can tell this is not yet provided by STAN - maybe there is a workaround somehow. Can you point in any direction w.r.t this? Thanks again.

If you have a function f(X; a, b) for which you want to find \hat{a}, \hat{b} that maximise it, then, if f is well-behaved you can write the equations \frac{\partial f}{\partial a} = 0 and \frac{\partial f}{\partial b} = 0 and use the algebraic solver to find those roots (numerically; don’t be fooled by the “algebraic” in the name).

I couldn’t find a minimisation problem, but this should get you started with solving equations numerically via algebraic solver. You can also code-up a Newton solver in the functions block if you want.

This is why I have been urging you to carefully consider what your data is and what inferences you want out of this exercise. If a and b don’t depend on \theta, you could always use any external routine in Python or R or whatever to get the “optimal” a and b and then pass them as data to Stan. What I discussed above is an alternative to externally deriving a and b: you’d find them conditional on X and then write them in the transformed data block, whereas if you determine them externally, you’d chuck them in the data block.

It depends on what your goals are, really. If you need to learn about \theta “on-line”, i.e., as data arrives, that’s more complicated. But if you don’t, just accumulate the data, derive the $a$s and $b$s and pass them as data.

If you can provide more detail, I can try and help a bit more.

@aleli, I have been following along and think it might be helpful for everyone if you could clarify some things.

If I understand your problem correctly, your workflow is something like the following.

  1. Design an experiment with a1 & b1 as parameters and collect data
  2. Based on the data estimate theta
  3. Use an optimiser with theta as input to derive optimal a2 and b2
  4. Collect new data in an experiment based on a2 and b2
  5. Go to 2.

Is this what is going on or am I still misunderstanding your problem?