How to evaluate a Poisson/NB CAR/IAR model


#1

Dear community,
I am trying to fit a Spatial regression model describing the crimes in various areas of Boston. I have two alternative models and I have to choose one of them.

The first model M_1 is a Negative Binomial:

Y=NB(\alpha + X\beta + \phi, \gamma)

where X are the 4 covariates, \alpha is the intercept, \gamma is the overdispersion of the Negative Binomial, and \phi is the random effect component modeled with CAR (https://github.com/stan-dev/example-models/tree/IAR-case-study). The CAR component models the spatial effect (e.g. near areas have similar crime level).

I have another alternative model M_2 that is:

Y=NB(\alpha + X\beta, \gamma)

without the CAR component. Although simpler, it might not be correct, as the errors could be spatially correlated.

I observed that the WAIC, PSIS-LOO of M_2 are much lower than M_1: 5400 vs 8200 for PSIS-LOO. Can I trust this evaluation? Is this evaluation biased due to spatial correlation? Should I always prefer a CAR model vs a normal one?

M_1

UserWarning: For one or more samples the posterior variance of the log predictive
    densities exceeds 0.4. This could be indication of WAIC starting to fail see
    http://arxiv.org/abs/1507.04544 for details

          waic     waic_se       p_waic  warning
 0  8070.291496  789.179451  1170.563822        1

 UserWarning: Estimated shape parameter of Pareto distribution is greater than 0.7 for
    one or more samples. You should consider using a more robust model, this is because
    importance sampling is less likely to work well if the marginal posterior and LOO posterior
    are very different. This is more likely to happen with a non-robust model and highly
    influential observations.
   influential observations."""

            loo      loo_se        p_loo  warning
 0  7993.776744  549.149153  1132.306446        1

M_2

UserWarning: For one or more samples the posterior variance of the log predictive
    densities exceeds 0.4. This could be indication of WAIC starting to fail see
    http://arxiv.org/abs/1507.04544 for details
           waic   waic_se    p_waic  warning
 0  5472.280139  80.66075  16.79452        1
            loo     loo_se      p_loo  warning
 0  5472.624518  80.731235  16.966709        0

thanks


#2

Don’t think so, no. But check this paper out.

The thing about being able to trust these things is that the diagnostics need to be clean. See this case study for an example of how to evaluate the associated diagnostics. @paul.buerkner and @avehtari can advise further.


#3

It looks like there is a problem with M1 (the model including the CAR component).
Perhaps the model did not converge or you computed the log-likelihood values incorrectly?


#4

Comparison with LOO (or waic if you insist) is fine if you are fine with the leave-one-out prediction task. For spatial data sometimes people want instead predict for certain region and then you would need to take into account the spatial correlation. This is easier to understand in the case of time series with correlated residuals, and if you want to predict future, then you need to switch from LOO to leave-future-out-cross-validation.

See A quick note what I infer from p_loo and Pareto k values
and then tell how many Pareto k values are larger than 0.7 and how many parameters do you have in M_1. My guess is that you spatial prior is very weak, but with the above information I can be more certain.

LOO diagnostic is more reliable than waic diagnostic, and since there are no warnings from loo diagnostic for M_2, LOO is working just fine for that model.


#5

Good point. I found a problem in the likelihood, but this does not solve the problem. Moreover, I saw that sometimes the chain does not converge

Good point. I found a problem with the likelihood, but this does not solve the problem. I here put the code of the two models. For the covariates, I use the QR decomposition (https://mc-stan.org/users/documentation/case-studies/qr_regression.html) to reduce the collinearity between variabiles.

M1

functions {
  /**
  * Return the log probability of a proper conditional autoregressive (CAR) prior 
  * with a sparse representation for the adjacency matrix
  *
  * @param phi Vector containing the parameters with a CAR prior
  * @param tau Precision parameter for the CAR prior (real)
  * @param alpha Dependence (usually spatial) parameter for the CAR prior (real)
  * @param W_sparse Sparse representation of adjacency matrix (int array)
  * @param n Length of phi (int)
  * @param W_n Number of adjacent pairs (int)
  * @param D_sparse Number of neighbors for each location (vector)
  * @param lambda Eigenvalues of D^{-1/2}*W*D^{-1/2} (vector)
  *
  * @return Log probability density of CAR prior up to additive constant
  */
  real sparse_car_lpdf(vector phi, real tau, real alpha, 
    int[,] W_sparse, vector D_sparse, vector lambda, int n, int W_n) {
      row_vector[n] phit_D; // phi' * D
      row_vector[n] phit_W; // phi' * W
      vector[n] ldet_terms;
    
      phit_D = (phi .* D_sparse)';
      phit_W = rep_row_vector(0, n);
      for (i in 1:W_n) {
        phit_W[W_sparse[i, 1]] += phi[W_sparse[i, 2]];
        phit_W[W_sparse[i, 2]] += phi[W_sparse[i, 1]];
      }
    
      for (i in 1:n) ldet_terms[i] = log1m(alpha * lambda[i]);
      return 0.5 * (n * log(tau)
                    + sum(ldet_terms)
                    - tau * (phit_D * phi - alpha * (phit_W * phi)));
  }
}
data {
  int<lower = 1> n;
  int<lower = 1> p;
  matrix[p,n] X;
  int<lower = 0> y[n];
  matrix<lower = 0, upper = 1>[n, n] W; // adjacency matrix
  int W_n;                // number of adjacent region pairs
}

transformed data {
  int W_sparse[W_n, 2];   // adjacency pairs
  vector[n] D_sparse;     // diagonal of D (number of neigbors for each site)
  vector[n] lambda;       // eigenvalues of invsqrtD * W * invsqrtD
  matrix[n, p] Q = qr_Q(X')[, 1:p] * n;
  matrix[p, p] R = qr_R(X')[1:p, ] / n;
  matrix[p, p] R_inv = inverse(R);
  
  { // generate sparse representation for W
  int counter;
  counter = 1;
  // loop over upper triangular part of W to identify neighbor pairs
    for (i in 1:(n - 1)) {
      for (j in (i + 1):n) {
        if (W[i, j] == 1) {
          W_sparse[counter, 1] = i;
          W_sparse[counter, 2] = j;
          counter += 1;
        }
      }
    }
  }
  for (i in 1:n) D_sparse[i] = sum(W[i]);
  {
    vector[n] invsqrtD;  
    for (i in 1:n) {
      invsqrtD[i] = 1 / sqrt(D_sparse[i]);
    }
    lambda = eigenvalues_sym(quad_form(W, diag_matrix(invsqrtD)));
  }
 


}

parameters {
  real beta0;            // intercept
  vector[p] b_tilde;
  vector[n] phi_unscaled;
  real<lower = 0> tau;
  real<lower = 0, upper = 0.99> alpha;
  real<lower=0.0000001> phi2; // neg. binomial dispersion parameter
}

transformed parameters {
  vector[p] b = R_inv * b_tilde;
  vector[n] phi; // brute force centering
  phi = phi_unscaled - mean(phi_unscaled);
}

model {
  phi_unscaled ~ sparse_car(tau, alpha, W_sparse, D_sparse, lambda, n, W_n);
  beta0 ~ normal(0.0, 1.0);
  b ~ student_t(3, 0, 1);
  alpha ~ beta(0.5, 0.5);
  tau ~ gamma(2, 2);
  y ~ neg_binomial_2_log(beta0 + Q * b_tilde + phi, phi2);
  phi2 ~ cauchy(0, 3);
}

generated quantities {
    vector[n] log_lik;
    for (i in 1:n)
         log_lik[i] = neg_binomial_2_log_lpmf(y[i] | beta0 + Q[i] * b_tilde + phi[i], phi2);
}

M2 is similar, just without all the spatial component.

UserWarning: For one or more samples the posterior
variance of the log predictive
        densities exceeds 0.4. This could be indication of WAIC starting to fail see
        http://arxiv.org/abs/1507.04544 for details

  """
         waic    waic_se      p_waic  warning
0  4040.15879  47.877265  283.883447        1
RuntimeWarning: overflow encountered in reduce
  return umr_sum(a, axis, dtype, out, keepdims, initial)
UserWarning: Estimated shape parameter of Pareto di
stribution is greater than 0.7 for
        one or more samples. You should consider using a more robust model, this is because
        importance sampling is less likely to work well if the marginal posterior and LOO posterior
        are very different. This is more likely to happen with a non-robust model and highly
        influential observations.
  influential observations."""
           loo     loo_se       p_loo  warning
0  4415.544767  53.055198  471.576436        1

#6

Comparison with LOO (or waic if you insist) is fine if you are fine with the leave-one-out prediction task. For spatial data sometimes people want instead predict for certain region and then you would need to take into account the spatial correlation. This is easier to understand in the case of time series with correlated residuals, and if you want to predict future, then you need to switch from LOO to leave-future-out-cross-validation.

My objective here is to describe and predict crime. In the former task I observe the beta coefficients and I explain what is going on in the observed city (e.g. crime happens more in places with more poverty etc). In the latter task I answer to the question “if the neighborhood or the city changes, how generalizable is my model?”. A measure of goodness of fit for the prediction task would be PSIS-LOO, while I’m not still sure about the description task’s metric.

Thus, if I want to predict crime for a certain area, and I know the surrounding areas, I could use the spatial component. Do I need to? For this reason, I wanted to compare the models.

LOO diagnostic is more reliable than waic diagnostic, and since there are no warnings from loo diagnostic for M2, LOO is working just fine for that model.

Indeed. Thus, In that example I can assume that the parameters added by the spatial component are not important after all, are they?

See A quick note what I infer from p_loo and Pareto k values
and then tell how many Pareto k values are larger than 0.7 and how many parameters do you have in M1. My guess is that you spatial prior is very weak, but with the above information I can be more certain.

I observed a problem in the likelihood. I corrected it. In the new model, as before, the number of rows n=617, thus the parameters are p=n (CAR effects) +5+3 (CAR + NB). 551 parameters are higher than 0.7


#7

Yes, in this case loo is sensible. Now your results don’t yet say that you can’t learn from the neigbors, now that indicate that your spatial prior is very weak and this can be also due to a misspecification.

  • You mentioned that you fixed a likelihood, did you fix it for M_2, too? I didn’t see new loo results
  • As you are not showing the output directly from loo(), it’s not certain whether you are using log densities or negative log densities (and maybe even mutliplied by 2, ew)
  • What are the number of parameters for M_1? You say the difference would be just that M_2 has also the spatial component? The loo result you previously reported has p_loo=17, which is more than 5+3 above.

  • In your new result p_waic is about half of the total number of parameters, which matches well the analytic results in in http://dx.doi.org/10.1007/s11222-013-9416-2 and experimental results in http://link.springer.com/article/10.1007/s11222-016-9696-4 that waic fails if model is complex (or prior is weak)

  • In your new results, even if there are many high Pareto-k’s, p_loo is closer to the total number of parameters, indicating that spatial prior is really weak and each spatial effect is almost independent from the others. It is also likely that p_loo is also underestimated as there are so many high k’s. In this it’s likely just that since the prior is weak, each observation is highly influential for the local parameter and full posterior and leave-one-out posterior are so different that importance-sampling fails.


#8

You mentioned that you fixed a likelihood, did you fix it for M2, too? I didn’t see new loo results

The likelihood was correct (implemented as the M1 model, but without spatial component)

As you are not showing the output directly from loo(), it’s not certain whether you are using log densities or negative log densities (and maybe even mutliplied by 2, ew)

If your question is about implementation, I’m using arviz’s implementation of loo (https://arviz-devs.github.io/arviz/generated/arviz.loo.html#arviz.loo) and I checked the weights on your request with a negative log likelihood. (https://arviz-devs.github.io/arviz/generated/arviz.psislw.html#arviz.psislw)

What are the number of parameters for M1? You say the difference would be just that M2 has also the spatial component? The loo result you previously reported has p_loo=17, which is more than 5+3 above.

Yeah sorry, I tested a lot of models. I correct myself with 10 covariates + 1 intercept + 1 NB overdispersion. I do not count the transformed parameters (QR decomposition).

In your new result p_waic is about half of the total number of parameters, which matches well the analytic results in in http://dx.doi.org/10.1007/s11222-013-9416-2 and experimental results in http://link.springer.com/article/10.1007/s11222-016-9696-4 that waic fails if model is complex (or prior is weak)

Thanks

In your new results, even if there are many high Pareto-k’s, p_loo is closer to the total number of parameters, indicating that spatial prior is really weak and each spatial effect is almost independent from the others. It is also likely that p_loo is also underestimated as there are so many high k’s. In this it’s likely just that since the prior is weak, each observation is highly influential for the local parameter and full posterior and leave-one-out posterior are so different that importance-sampling fails.

The CAR model is, in the end, a multivariate normal prior centered at zero with a variance that depends on neighbors and a tau parameter. I tried a very “aggressive” prior removing tau. I have this:

WARNING:pystan:Maximum (flat) parameter count (1000) exceeded: skipping diagnostic tests for n_eff and Rhat.
To run all diagnostics call pystan.check_hmc_diagnostics(fit)
{'n_eff': True, 'Rhat': True, 'divergence': True, 'treedepth': True, 'energy': True}
UserWarning: Estimated shape parameter of Pareto distribution is greater than 0.7 for
        one or more samples. You should consider using a more robust model, this is because
        importance sampling is less likely to work well if the marginal posterior and LOO posterior
        are very different. This is more likely to happen with a non-robust model and highly
        influential observations.
  influential observations."""
           loo     loo_se     p_loo  warning
0  5391.762288  82.144443  90.71397        1

With just 5 high Pareto-ks. With tau = half-normal(1):

UserWarning: Estimated shape parameter of Pareto distribution is greater than 0.7 for
        one or more samples. You should consider using a more robust model, this is because
        importance sampling is less likely to work well if the marginal posterior and LOO posterior
        are very different. This is more likely to happen with a non-robust model and highly
        influential observations.
  influential observations."""
           loo     loo_se       p_loo  warning
0  5297.269333  80.484596  294.159892        1

With 138 high Pareto ks.

Thanks


#9

No my question is whether the scale in the value for loo (e.g. 4415.544767) is in log density scale, negative log density scale or negative log density scale multiplied by two (sometimes called deviance scale). The links you give refer to a paper which uses log density scale, but looking at the values I suspect you are using deviance scale, but it’s impossible to be certain especially since you all the links you site refer to the use of log density scale.

That’s 12, but p_loo is 17, which indicates outliers (see the post I mentioned before)

No need to force the prior to be stronger. It is possible that there is no spatial effect which could be learned from this data.


#10

No my question is whether the scale in the value for loo (e.g. 4415.544767) is in log density scale, negative log density scale or negative log density scale multiplied by two (sometimes called deviance scale). The links you give refer to a paper which uses log density scale, but looking at the values I suspect you are using deviance scale, but it’s impossible to be certain especially since you all the links you site refer to the use of log density scale.

I don’t really know how to answer the question… I looked at the implementation and they do this:

loo_lppd_i = -2 * _logsumexp(log_weights, axis=0)
loo_lppd = loo_lppd_i.sum()

No need to force the prior to be stronger. It is possible that there is no spatial effect which could be learned from this data.

I don’t think so, as M_1 has the errors (y- average(yhat)) that are spatially correlated (Moran’s I 0.04 with p-value <0.01) while M_2 does not (Moran’s I -0.002 with p-value=0.72). I don’t really understand how to evaluate a model like this as I have to choose between having auto-correlated residuals or using WAIC/PSIS-LOO that fail.

Thanks


#11

That’s a good answer. The code has -2, but they don’t document it properly. I reported this to arviz developers.

ok

I want to emphasize that in this case it’s possible that the spatial model is fine, and the problem is just that waic and PSIS-LOO fail for certain type of models. Refit’s in LOO would require 551 refits, which sounds a lot. You could use k-fold-CV, with k<<551.


#12

That’s a good answer. The code has -2, but they don’t document it properly. I reported this to arviz developers.

Aha thanks

I want to emphasize that in this case it’s possible that the spatial model is fine, and the problem is just that waic and PSIS-LOO fail for certain type of models. Refit’s in LOO would require 551 refits, which sounds a lot. You could use k-fold-CV, with k<<551.

OMG. Ok, I’ll try!!


#13

@avehtari I’m trying to model my problem as a k-fold-CV but I’m stucked because the estimated CAR parameters (phi), which are random effects, are dependent on the areas. Thus, I am not sure on how to proceed. Can you help me?

I even tried to evaluate the model through cross-validation using for M_1:

log_lik_cv[i] = neg_binomial_2_log_lpmf(y[i] | beta0 + Q[i] * b_tilde, phi2)

instead of

log_lik_cv[i] = neg_binomial_2_log_lpmf(y[i] | beta0 + Q[i] * b_tilde + phi[i], phi2)

Assuming to have no clue on the spatial correlation of hold-out data. However, I think it might be considered wrong, as it would be like a different model


#14

Your model is factorizable, but the prior structure problem is same as in section “Exact LOO-CV with re-fitting” in Leave-one-out cross-validation for non-factorizable models vignette. “The solution is to model y_i as a missing observation and estimate it along with all of the other model parameters.” Instead of exact LOO-CV, you are trying to do exact k-fold-CV, so you can use the same approach.