I’m teaching myself about
log_mix() by trying to write a mixture model that compares quadratic and linear regressions. In the cases I’ve tested, the model seems to do a good job of identifying the correct data-generating process and estimating the parameters for that GDP. But with a quadratic DGP the estimates for the linear model are bad. Really bad. Really really bad.
In the case in the attached example,
lm() estimates the intercept as 24 (SE 2.5) and the slope as 1.5 (.08); this fits the data reasonably well considering it’s the wrong GDP. Stan’s point estimates for the linear side of the mixture (the
_lin parameters) are 35 (SE 13) and 1 (.15).
#> Inference for Stan model: square.
#> 2 chains, each with iter=2000; warmup=1000; thin=1;
#> post-warmup draws per chain=1000, total post-warmup draws=2000.
#> mean se_mean sd 2.5% 25% 50% 75% 97.5%
#> lambda 0.51 0.45 0.45 0.01 0.05 0.61 0.97 0.99
#> mu_lin 34.65 12.67 17.63 15.02 23.23 26.44 43.48 79.89
#> beta_lin 0.99 0.15 3.08 -7.72 0.82 1.44 1.56 7.60
#> mu_sqr 28.93 21.64 24.55 4.63 9.28 13.70 47.65 85.09
#> beta_sqr 1.41 2.27 4.00 -9.01 -0.27 3.04 3.31 7.48
#> gamma_sqr -0.06 0.03 0.73 -1.83 -0.04 -0.03 0.00 1.58
#> sigma_lin 13.24 2.93 117.75 0.67 4.74 8.07 9.16 40.97
#> sigma_sqr 17.65 5.42 139.43 0.45 4.65 5.50 6.49 71.78
#> lp__ -177.78 10.96 11.10 -192.60 -188.27 -179.79 -166.89 -162.40
#> n_eff Rhat
#> lambda 1 11.90
#> mu_lin 2 1.32
#> beta_lin 419 1.02
#> mu_sqr 1 1.83
#> beta_sqr 3 1.15
#> gamma_sqr 527 1.00
#> sigma_lin 1613 1.00
#> sigma_sqr 663 1.00
#> lp__ 1 5.16s
In addition, the plot at the end of the script and shown below (NB this chunk depends on the tidyverse) shows the posterior samples for the fitted values (the \hat y's). The fitted curves for the quadratic regression are in blue; they all look good. The linear model’s curves are in red; many have basically nothing to do with the data.
square.R (1.5 KB)
square.stan (1.3 KB)
I just realized that the effective sample sizes and R-hats are also really bad across the board. I reverted the Stan code to an earlier version, which uses the same standard deviation \sigma for both the linear and quadratic regression. The ESSes are still a little low, but much better than the version in the OP. The estimates for the quadratic regression are good; those for the linear regression are still bad; and the final plot is qualitatively the same as the version in the OP.
Inference for Stan model: square.
2 chains, each with iter=500; warmup=250; thin=1;
post-warmup draws per chain=250, total post-warmup draws=500.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff
lambda 0.06 0.01 0.05 0.00 0.02 0.04 0.07 0.21 95
mu_lin 45.63 1.28 20.59 4.90 31.20 45.54 59.34 85.88 260
beta_lin 0.72 0.43 4.26 -8.12 -1.47 0.83 2.63 9.93 99
mu_sqr 9.45 0.19 2.40 4.94 7.84 9.39 10.97 14.11 154
beta_sqr 3.16 0.02 0.21 2.75 3.03 3.18 3.30 3.61 176
gamma_sqr -0.03 0.00 0.00 -0.04 -0.04 -0.03 -0.03 -0.02 199
sigma 5.64 0.04 0.62 4.61 5.21 5.61 5.96 7.22 274
lp__ -167.74 0.29 2.44 -173.45 -169.02 -167.43 -166.09 -163.56 70
square.stan (1.2 KB)
square.R (1.5 KB)
Drive-by comment because I am in the right timezone:
- Did you get warnings about divergences?
- These mixture models can be poorly identified. I think the original model with different sigma’s shows signs of this problem. The true DGP looks fairly linear for x < 35, so I wonder whether non-identifiability is a problem here.
- I understand the goal is to understand
log_mix() better so the following might not be relevant. I think if your goal would be to figure out whether the DGP is linear or quadratic, it might be better to estimate both models separately and compare them with
The two-sigmas model had warnings about divergences, but not the model with a single, shared sigma.
Maybe I don’t understand how non-identifiability could be manifesting here. If both component models were reasonable fits, wouldn’t lambda (the mixture parameter) remain near 0.5 and the components would contain their respective best estimates? Instead it looks like the whole model is identifying the true DGP pretty well, with lambda near 0 and good estimates for the true quadratic DGP.
More drive-by commentary:
- When the quadratic term is exactly zero the two models are completely identical, and hence unidentifiable. I would suspect the model is very poorly identified in this neighbourhood.
- Having a common sigma makes very little sense, as adding an extra polynomial term adds a lot of flexibility to the model and the value of sigma needs to shrink considerably. Taking a punt here, if we fix sigma to the value we get under the linear model and try and add a quadratic term, the posterior for the quadratic term is likely to be close to zero because the model is attributing all remaining variation to noise. The value for sigma would need to decrease in order for the posterior for the quadratic term to move away from zero.
- I would be looking at the
pairs() plots for the models, you don’t have a lot of parameters and they can tell you a lot about where your model is misbehaving.
- Also I’ve never seen regression model selection framed as a mixture model before, not to say it cannot be framed like this, but I’m not sure it make sense here? Mixtures, to me at least, are more commonly employed when your data are produced from more than one data generating process. Here, we have only one data generating process, the quadratic. Maybe it would make sense if for some values of x the y's were generated from a linear polynomial, and then for a separate region of x values the y's were generated from a quadratic. Combined with changing \lambda to also be a function of x, this could maybe make sense? I could be talking absolute rubbish though.
- One way to check all of these things is to run both prior and posterior predictive checks,in order to see what possible data sets your model can generate, as well as what data sets your posterior distributions can generate. Both of these can then be compared to your data set at hand to see if the model is feasible.
After some further experimentation,
- The priors on the linear model need to be flatter than those on the quadratic model. Small changes to the quadratic DGR can lead to large changes in the best frequentist linear model fit, especially as the true DGP becomes “more quadratic.”
- Using two sigma terms pretty consistently produces worse samples, in terms of ESS, R-hat, degenerate posteriors, etc. ¯\_(ツ)_/¯
- The sampler seed can make a huge difference in the quality of the samplers.
- The sampling problems get worse, not better, as the true DGP becomes “more quadratic.” That seems to support my impression that identifiability isn’t the issue.
Since there were questions about why I’d use mixtures for this, let me reframe the issue. I’m interested in doing Bayesian model selection between these two regression models. In terms of a mixture model, the mixture parameter gives the probability distribution over the two regressions. Apparently I’m not the first person to think this way. (That earlier thread also reports sampling problems, and also didn’t seem to get any resolution.) This approach also seems to be in line with Andrew Gelman’s standard advice, namely, to fit one big multilevel model that contains all your various hypotheses or comparisons.
So is there some other multilevel approach to Bayesian model selection in Stan that I’ve just missed?
The simplest way to do Gelmanian full modelling is to center
x so that x_center and x_center^2 are roughly orthogonal and just run the quadratic model. The coefficient on x_center^2 will give you an idea of how quadratic the relation is.
loo approach to model comparison still seems a more reliable route to me. The mixture model approach is more akin to the Bayes Factor approach and assumes that you have the right model as a possibility (which is the case in your simulation but may not be true in the actual use case).
loo allows that you do not have the right model.
I am still not convinced you have no identification problems. See also http://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html. To identify sampling problems in general you probably want to have a look at the relation between parameters. Michael does this in the case study with mu1-mu2 scatterplots.