Evaluating fit logistic regression models


I have implemented logistic regression with GP in cmdstan. What is the best way to evaluate fit? For normal models I am using residuals and p-value for \chi^2 discrepancy. Is there an equivalent of residuals, \chi^2 for logistic regression? Or should I use a more advanced techniques?

Thanks for any suggestions.

The loo package is your friend. It has a build-in compare function to estimate which model
performs best. However, take a look at neff/Rhat, divergent transitions, traceplots, pairs plots and
new fancy plots, if everything looks good.


Thanks a lot. What do you mean by new fancy plots?

My first step would be to assess that fit for some model is good enough (currently I have only one model that makes sense). Traditional folks are used to R^2 graph. Is there some equivalent in the logistic regression case?

My next step would be to add more covariates and use compare as you have suggested.

There are lots of misinterpretation with p-value, ref. to the blog of Prof. Gelman.
Especially, I doubt it makes sense in the Bayesian world, but I don’t want to open the “box of Pandora”.

With fancy plots I mean things like Rank histogram plots. Aki did and does a lot of good things.
In my small world, I mostly look if the parameters looks sound, doing boxplots and rethink
my problem, the diagnostics, my data and my priors. Changing my perspective, learn
about the data, the priors, and then improve my model(s). Eg. Having a logistic regression,
but also can use a beta logistics model. Thinking about what is the difference? What is
the multivariate variation (-> Dirichlet). Can it be justified? How that will change the looic diagnostic?

1 Like

See videos and case studies for model assessment and selection at https://avehtari.github.io/modelselection/
For logistic regression, see diabetes case study.


Many thanks to both. mcmc_trace and mcmc_rank_hist along with Rhat, n_eff show good convergence with no divergencies. However loo is not so happy. Possibly there are not enough factors?

    Computed from 4000 by 633 log-likelihood matrix
             Estimate   SE
    elpd_loo   -282.7  9.3
    p_loo        75.8  3.3
    looic       565.4 18.6
    Monte Carlo SE of elpd_loo is NA.
    Pareto k diagnostic values:
                           Count Pct.    Min. n_eff
    (-Inf, 0.5] (good)     586   92.6%   1324      
    (0.5, 0.7]  (ok)        43    6.8%   329       
    (0.7, 1]    (bad)        4    0.6%   526       
    (1, Inf)    (very bad)   0    0.0%   <NA>  

What does it mean and how to rectify it? Please note that the logistic regression is with GPs. Also, posterior classification accuracy is 0.89 and posterior balanced classification accuracy is 0.55, LOO classification accuracy is 0.94, LOO balanced classification accuracy is 0.55,

1 Like

loo output looks quite good and better than I expected for GP. If you want to be more certain, you could try running loo exactly for those 4 observations with higher k-values.

The plot looks also good.

It is good that numbers make sense. In parallel I would like to try lgpr.

  1. From tutorial I wasn’t able to grasp how to handle different number of time points for each individual. Is there an example I could use?

  2. Also, I am missing some theory for lgpr. Is there a reference? Is y~binomial(f)? Does it handle interactions up to 2? What about sex*location?

  3. With respect to my model I went as far as to draw calibration plot. For a single covariate the calibration plot doesn’t look too interesting. Idea is to get all bars on the diagonal. Right? What does it mean?

3a. I am adding covariates one by one. Calibration plot gets more interesting. Is there any better approach? BTW, alpha and rho in the trace plot are the GP parameters. I have GP1 for t=0,2,4 and GP1+GP2 for t=1,3,5.

  1. What is the meaning of classification accuracy and balanced classification accuracy?

  2. I would like to add random effect on each individual in my model? Is there an example I could use? I am OK with mixed effect models but never did for GPs. Should I expect better predictions?

No. Bars are intervals with some %, and thus on average that % of the intervals would overlap with the diagonal.

Better than interesting? I don’t understand what you mean.

See e.g. https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers

See lgpr or brms

Depends on what you know about the phenomenon you are modeling

I compared LonGP https://www.nature.com/articles/s41467-019-09785-8.pdf and lgpr in https://jtimonen.github.io/lgpr-usage/basic.html and have several questions.

  1. It seems in LonGP you have used one GP for id and another GP for id and age. In lgpr you have used only GP for id and age. Is there a reason why GP for id is not useful and you dropped it?
  2. In LonGP you have used categorical kernel (which looks like identity matrix) but in lgpr you used zero-sum kernel. Is there any reason why since problems look similar?

I don’t remember the details for this, but I guess it’s identifiability issue.

This detail I do remember. We realized later that zero-sum kernel is better choice as it reduces identifiability problems, but the difference is small.

Thank you. Did you try to drop f^{(1)}(id,age) and leave just f=f^{(2)}(age)+...? Was there big difference? I am referencing the problem in lgpr tutorial. By using the intuition on mixed models it seems that it is just dropping individual specific age random effect.

In lgpr man pages https://rdrr.io/github/jtimonen/lgpr/man/ I wasn’t able to find how to specify different number of time points for each individual. In the simulate_data function t_data is common time axis. Is it true that the assumption is that all individuals have the same number of measurements?

Is it ACC and BACC? I am trying to understand the logic in https://avehtari.github.io/modelselection/diabetes.html, sec 4.3. When for each draw I calculate ACC[draw] and BACC[draw] and take an average I get different values than when using pred. As I understand preds is a matrix with rows representing each draw. I calculated true positives (TP) and true negatives (TN) for each draw as:

  for (draw in 1:nrow(preds)) {
    for (n in 1:ncol(preds)) {
      if (preds[draw, n] == y[n] & y[n] == HEALTHY) TP[draw] <- TP[draw] + 1
      if (preds[draw, n] == y[n] & y[n] == SICK) TN[draw] <- TN[draw] + 1

and ACC,BACC as:

  for (draw in 1:nrow(preds)) {
    ACC[draw] <- (TP[draw] + TN[draw]) / length(y)
    BACC[draw] <- (TP[draw] / length(y[y == HEALTHY]) + TN[draw] / length(y[y == SICK])) / 2

The values are less optimistic than given by:

# posterior classification accuracy

Which way is correct?