# Help needed- Model Averaging in Hierarchical Model

Hi,
I’m trying to work out a way to efficiently do model comparison for very different models, please see below
I am currently attempting to create a model that comes in the form of a hierarchical regression. Lines are fit to a set of data, and the slopes of each lines are assumed to be drawn from some common distribution.

Unfortunately, each “line” in my dataset is not necessarily linear and can be produced by a number of different generative models which contain no information about the line slope. One such model is a model where circles or ellipses are fit to the data instead of lines. In this case, the only shared parameter between the two models is the standard deviation of the noise on the data, \sigma. Effectively the metric of comparison between the two models is something like the goodness of total least squares fit between them (which is similar to the Akaike Weight).

I am trying to come up with a scheme where the probability of a dataset being linear is determined and then this is taken into account at the higher stage of the hierarchy. Because the data can be generated by the line OR circle model, then the likelihood should just be the sum of their likelihoods, i.e. if we have a line slope \theta then we can get the posterior probability:

P(\theta|y) \propto P(\theta)(P(y|\theta,M_{line})+P(\ y|\theta,M_{circ}))

If we try and fit all the parameters in the model at once, then we almost never achieve convergence because each model represents a different mode. However, if we use the marginalized equation above it becomes clear that P(y|\theta,M_{circ}) is a constant, so if we can find the value of this constant then we can incorporate this into our model. I was wondering if there was a computationally efficient way of computing this (effectively you’d want to be computing P(y|M_{circ})) for each y). I have a fairly efficient way of doing the fits for each model (takes around ~2s for 10000 draws of each) in Stan. Would something like an Akaike weighting scheme work here, given that generally in the past I’ve found WAIC/LOO-CV estimates to be fairly unstable each time the model is run? Perhaps using AIC would be the most stable weighting system in this case?

Does anyone on this forum have any suggestions on what the best metric to tackle a problem like this would be?

Hey it’d be useful if you share your Stan code with perhaps some example data. I personally find it fairly difficult to understand what you mean.

If the same data is generated by two models the sum of their likelihoods IMHO isn’t an obviously meaningful quantity. If it was a mixture model and part of the data was generated by each model then that would make sense and it would need a mixing proportion as well as priors.

Ok, let’s take the simple version before the model hierarchy. We have two data vectors, let’s call them y and z. Let’s go over a few cases:
Here we have data where y and z have been created from gaussian noise about a line. The parameter of interest in this case is the slope of the line, which should be hierarchical for many of these experiments, e.g. for y_k Here we have data where y and z have been created from displacing points from a circle. Although we’re interested in the slope and intercept of this line, calculating it for these data will yield a biased result. Here we have data where y and z have also been created from displacing points from a circle segment, however as the circle radius increases, it becomes harder to distinguish this from a line, and a line is generally a more parsimonious model. In the case of M_{circ}, I want to be sampling slopes from the prior as I am currently unsure of how to get the relevant parameter from the circle (It’s not necessarily the ratio of the tangents with the lines z=0,y=0).

My stan models for fitting both of these cases work well, but getting weighted samples from both is where I’m having my headache.

I think the sum of the likelihoods * priors is meaningful in this case. This comes from a Bayesian Model Averaging framework, (see e.g. https://www.stat.colostate.edu/~jah/papers/statsci.pdf , if you follow through equations 1 and 2 then you get something proportional to the sum of the likelihoods*priors provided you have equal priors for each model). Alternatively, I can compute this by computing the evidence for each model directly, or some kind of Akaike weight for an information criterion.

As I’ll be computing them multiple times, I want to be able to find a way of computing these weights that is a). computationally fast and b). does not give significantly different results each time I run the program.

This may be wrong, but I’d find it surprising: If two models are independent and fitted in the same Stan code; it’s my understanding that they should fit just the same as if they were fitted separately. Meanwhile both models add to the target variable, which by your description would make this setup akin to model averaging given that they had the same priors.

I guess the artefact of averaging are the samples, and averaged samples can be collected from the generated quantities block. The generated quantities block does not affect the model fitting; so it the model fits, you can generate quantities from it. If I remember right the averaging can be done using something like the log likelihood ratio of the models but I don’t remember so well.

Since the _rng functions cant be used in the model block I’m not sure this would help you feed the results up a hierarchy if it was all part of the same mega model.

It sounds like you might be fitting your models using just one chain if you get convergence but different results each time your run them. This hints at non identifiability of your models. To see this, run your model with 3-4 chains and observe the Rhat and neff values in the output.

Apologies if I’ve misunderstood you again, @andre.pfeuffer may be able to help more here. I haven’t used Bayesian model averaging before.

Can you share the reasoning behind your derivation?

I think I see where the source of confusion can be here.

Say I have a set of K models, then my posterior probability of a value \theta given some data Y is given by:

P(\theta|Y)=\sum_{i=1}^K P(\theta|Y,M_i)P(M_i)

where
P(M_i)=\frac{P(Y|M_i)}{\sum_{i=1}^K P(Y|M_i)}

and
P(\theta|Y,M_i)= \frac{P(Y|\theta , M_i)P(\theta | M_i)}{P(Y|M_i)}

We can do some cancelling from these last 3 equations and get that

P(\theta|Y) \propto \sum_{i=1}^K P(Y|\theta , M_i)P(\theta | M_i)

I was considering a case in which \theta is a ‘dummy variable’ actually doesn’t exist in the likelihood function for all i>1. This is because if one of the other models fits very well, we know something is wrong, so we shouldn’t be fitting \theta using this experiment. When i>1, the likelihood reduces to a constant which is effectively the value you get by marginalizing out the likelihoods of all the other parameters. Then I consider the prior for that model to be the same, so the likelihood effectively works out as the sum of the likelihoods, due to the shared prior. Because:

P(\theta|Y) \propto P(\theta) \sum_{i=1}^K P(Y|\theta , M_i)

However, if you make all the variables in the model ‘dummy variables’ so they’re shared between both models, it’s hard to jump from one model to the other (only narrow regions of parameter space in which they connect). You also aren’t really accounting for the complexity in different models, so this is really just a metric of goodness of fit, which will always be better with more complicated models.

So, I can use the model evidence or some kind of Akaike type weighting scheme to find the average value of \theta for each experiment. I guess this works, the difficulty will be calculating P(Y|M_i) for each model.
That’s fine, but then what if we have some set of experiments, each with \theta_j that is controlled by some hyperparameter \phi.
P(\phi,\theta_j|Y_j)=P(Y_j|\theta_j)P(\theta_j|\phi)P(\phi)

But what I will have for each \theta_j is an expression for P(\theta_j |Y_j), Not P(Y_j|\theta_j). Because I’m assuming that all models have the same P(\theta_j) , I think that we can obtain a form for P(\theta_j|Y_j) from the sum of likelihoods? Perhaps this is not the case or a good assumption. But because of the model averaging I have to compute P(Y_j|M_{ij}) for at least one model, and then weights for every other model.

Any suggestions on a scheme that would allow me to do this? I might reasonably have to do this for ~N =20 experiments and perhaps up to K = 3-4 models for a single \phi so it’s imperative that I at least get something that will give similar results each time it’s run, which is why I’m thinking about going with Akaike weighting.

1 Like

I need to think about this a bit more. In the mean time, you might want to check Testing hypotheses via a mixture estimation model.

How about instead of model averaging have the both models in the same model via continuous model expansion? If you parameterize the circle model with inverse radius, then inverse radius = 0 is the linear model. This way you would have just one model and you can examine the posterior of the inverse radius. Would that be ok?

Can you elaborate what do you mean by unstable? AIC is certainly more stable, but also likely to have bias.

How about instead of model averaging have the both models in the same model via continuous model expansion? If you parameterize the circle model with inverse radius, then inverse radius = 0 is the linear model. This way you would have just one model and you can examine the posterior of the inverse radius. Would that be ok?

This would work from the perspective of the radius I think, the problem I have is that fundamentally under this model it’s hard to describe the parameter of interest, that is the slope of the line.

If we have a set of K points then we can model the data as being something like:
x_k=x_c+r\cos(\omega_k),y_k=y_c+r\sin(\omega_k)

The uncertainty in the ‘slope parameter’ for the circle model would probably look something like the range of tangents to the circle constrained by the data, i.e. a wrapped uniform distribution between \omega_0+90^{\circ} and \omega_K+90^{\circ}. When the circle radius is infinite, this collapses to an infinitessimally small range, and for large circle radius, there is a large uncertainty on this parameter as desired. However, it’s hard to envisage or formalize how this works when the slope angle parameter is also controlled by a hyperparameter for multiple experiments. This is because the slope angle parameter isn’t informing the data in a clear way, it’s not the mean parameter of a distribution, just… somewhere within the uniform. Any ideas?