Compute likelihood ratio for model comparison

I want to compute likelihood ratio of two generalized linear models where one model is nested: say Model 1 has three parameters m0, m1 and m2 (so y=f(m0+m1*x1+m2*x2)) while Model 2 is restricted such that m2=m3=0. I have several ideas how to compute LR and I want some feedback which of these is correct and what are their advantages and disadvantages.

Idea 1: Compile and sample the two models separately and obtain log-likelihood of each model with get_lp(). Then compute the likelihood ratio outside of STAN.

Idea 2: Savage-Dickey density ratio. Evaluate only Model 1. Compute LR=p(m1=0,m2=0|D)/p(m1=0,m2=0).

Idea 3: Introduce model switching indicator variable. Say:

y=f(m0+h*m1*x1+h*m2*x2) h~ Bernoulli(p)

Then compute LR=(1-E[p])/E[p] where E is expectation.

In addition, I see many people here nowadays are evaluating the model likelihood in the generated quantities block. Why is that?

Based on several ideas you listed it seems to me that you are not so much interested in computing the likelihood ratio, but comparing these models with some approach which is not necessary LR.

This approach is likely to favor the more complex model.

This seem to require marginal likelihood, which is difficult to compute.

You can’t use discrete variable in Stan, but you could use continuous h and then the model would correspond to having a joint prior on m1 and m2, which is sensible in this case. If m0, m1 and m2 are orthogonal then you could look at the posterior of hm1 and hm2.

To compute leave-one-out cross-validation


1 Like

Dear Prof. Vehtari, thanks for Your ideas.

I explored all of the above ideas and wanted to report what I got so far.

Idea 1: Strongly prefers the simple model. My guess is that the likelihood of the more complex models includes the likelihood of all three parameters (vs. likelihood from just one parameter) so the likelihood of the complex model is always smaller.

Idea 3: I followed this suggestion by Bob on how to implement the model comparison as a mixture model. The comparison slightly favors the simple model with mean mixing coefficient at ~0.3 and the estimate has very range. In a fake data simulation, the estimate of the mixing coefficient was insensitive to the changes in the amount of evidence, so I abandoned this idea.

Idea 2: This worked really well. Modeling the parameter distribution did not work. Then I just estimated a model without prior on m1 and m2 and estimated the joint density of m1 and m2 at 0 by applying kernel density estimation algorithm to the samples of m1 and m2. Again, did fake data simulation and so far the model comparison works as expected.

p(y | theta) is the likelihood (as a function of theta)—parameters have priors p(theta) and posteriors p(theta | y).