About model comparison

I am comparing the models using the metrics DIC, WAIC, LOOIC and also calculating the values of MAE, RMSE, MAPE, NMAE, NRMSE for the prediction results of these models to evaluate the prediction accuracy. The results are as follows.


The model with the lowest DIC, WAIC, LOOIC is 1-4, but the model with the lowest MAE, RMSE, NMAE, NRMSE based on predicted values is 1-8 and the lowest MAPE is model 1-7.
The models with the lowest DIC, WAIC, LOOIC did not produce the most accurate forecasts, I am confused about which value to refer to when comparing models.

The metrics you cite and I know all measure different things. I’m not familiar with four of the ones you cite (LOOIC, MAPE, NMAE, NRMSE). Are you trying to evaluate point estimates or posterior predictive inference? Are you trying to measure performance on the training data like DIC and WAIC do, or are you trying to measure held out performance like LOO would do (though still within data set)? Do you care about probabilistic predictions and calibration (like providing an interval estimate with nominal coverage and measure whether it has that coverage) or just how close a point estimate is?

  • All *IC values look suspiciously big. How many observations and parameters do you have?
  • DIC should be just ignored
  • The very big differences between WAIC and LOOIC indicate mistake in computation, or very unreliable estimates and badly misspecified models, and you should have got warning messages. How did you compute these? Can you show also the diagnostic messages? What is the model?
  • Did you compute MAE, RMSE, MAPE, NMAE, and NMRSE with or without cross-validation? If they have been computed without cross-validation, they are not that good for model selection and can have arbitrary ranking compared to cross-validated results.

I have 13,818 data in arrays according to the framework of age (21) - period(7) - gender (2) - region (47) (I vectorized them when importing them into stan for calculation). With regard to the number of parameters, each model contains at least 94 (47*2) vector parameters of length 21 and 47 vector parameters of length 7, and more for more complex ones (2 to 4 times more).
The code to calculate WAIC is as follows:

D1 <- -2 * sum(log(colMeans(exp(log_lik))))
p_waic <- sum(apply(log_lik, 2, var))
  WAIC <- D1 + 2 * p_waic

The calculation of LOOIC uses the function included in cmdstanr

fit$loo()

This is a model for mortality forecasting, based on the following two papers “Mortality density forecasts: An analysis of six stochastic mortality models”, “Bayesian model averaging for mortality forecasting using leave-future-out validation”. I am using stan to implement several models from these two papers and try to make improvements (e.g. modifying the structure of the parameters, adding gender-region factors)
The loo did pop up with a warning and I am still in the early stages of modelling and am still trying to work this out.

Computed from 4000 by 13818 log-likelihood matrix

         Estimate    SE
elpd_loo -75520.9 297.9
p_loo      6720.2 112.1
looic    151041.8 595.8
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     10973 79.4%   0         
 (0.5, 0.7]   (ok)        1629 11.8%   0         
   (0.7, 1]   (bad)        889  6.4%   0         
   (1, Inf)   (very bad)   327  2.4%   0         
See help('pareto-k-diagnostic') for details.
Warning message:
Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.

About iteration, I have tried to set it to be 5000 and after about 30 hours of computing, the output could not be read or written successfully because of a lack of memory. At the moment I can only set iter to 1000 for a preliminary attempt.
I did not cross-validate when calculating MAE etc. I’m not quite sure if the data of the structure I’m currently using (array of age-period-gender-prefecture) is suitable for cross-validation.

About the reason for the excessive number of BADs in the LOO diagnosis, one possible reason I can think of is that the high age group in the data is affecting the model fit due to a much smaller population size than the overall level and a much larger mortality rate than the overall level, I am trying to remove this part to recalculate the model.

I tried to start again with something relatively simple. Now I build the model using data containing a total of 3636 (36 time periods from 0 to 100 * 36 time periods from 1975 to 2010), with the two best performing models having the following loo diagnoses (to save time and memory I only set iter and warmup to 5000 each and did not increase adapt_delta and max_ treedepth. The calculated results 4 of 20,000 (0.0%) transitions ended with a divergence and 19967 of 20,000 (100.0%) transitions hit the maximum treedepth limit of 10):
model A:

Computed from 20000 by 3636 log-likelihood matrix

         Estimate    SE
elpd_loo -23399.7 151.6
p_loo      1145.7  39.1
looic     46799.4 303.3
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     3428  94.3%   1         
 (0.5, 0.7]   (ok)        126   3.5%   0         
   (0.7, 1]   (bad)        67   1.8%   0         
   (1, Inf)   (very bad)   15   0.4%   0         
See help('pareto-k-diagnostic') for details.

model B:

Computed from 4000 by 3636 log-likelihood matrix

         Estimate    SE
elpd_loo -23273.1 147.2
p_loo      1095.5  36.3
looic     46546.2 294.4
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     3438  94.6%   200       
 (0.5, 0.7]   (ok)        143   3.9%   82        
   (0.7, 1]   (bad)        43   1.2%   11        
   (1, Inf)   (very bad)   12   0.3%   2         
See help('pareto-k-diagnostic') for details.

Later I will remove the lower and higher age groups from this data and recalculate the model again.

Does this mean 9421 + 477 = 2303 parameters?

Ok, that it explains why you didn’t get warnings for waic (like you would have received if using the function from the loo package). Anyway waic is not giving anything extra if you use loo, and as loo is better, you can drop waic.

You can do wthat the diagnostic recommends, ie help('pareto-k-diagnostic'), or
see LOO package glossary — loo-glossary • loo and Cross-validation FAQ • loo for further information how to interpret the diagnostic.

I see that p_loo is 6720 is much larger than the number of parameters you mentioned and there are many very high khats, and thus it seems like the model is badly misspecified.

1000 is just fine for initial experimentation. Increasing the number of iterations is not going to make this problem to go away.

That then explains the discrepancy between the criteria. The model is quite flexible, likely to be misspecified and overfitting.

loo is cross-validation (and waic is making the same assumptions as loo)
See When is cross-validation valid?

That’s always a good idea when having problems.

This is fine for these experiments

Is 3636 also the number of parameters? I’m asking so that I can compare p_loo to the actual number of parameters. If p_loo is less than the total number of parameters, but relatively high it is also possible that your model is just too flexible (instead of being misspecified). See e.g. what happens if there is one parameter for each observation Roaches cross-validation demo

2303 is the minimum number of parameters for all the model, the one shown now has about 4654 parameters, but this is still far less than 6720 and I will try to solve this problem.

May I ask the reason for this and for waic?

3636 is the number of data, which is a matrix of 101 age * 36 years.The number of parameters for both models is 497. It’s my fault that I didn’t learn it carefully, but I wanted to ask if I also needed to calculate the number of hyperparameters. For example, when the parameter ks follows the following structure

 target += normal_lpdf(ks|alpha,sigma_k);
 target += normal_lpdf(alpha[2:T]|alpha[1:(T-1)]+delta[2:T],sigmak_a);
 target += normal_lpdf(delta[2:T]|delta[1:(T-1)],sigmak_d);

Do I need to include the number of alpha, delta, and all three scale parameters? If this is necessary, then model A is 891, and model B is 597. The very first model is probably 7912.

I forgot to add the links to FAQ for this. See What is the relationship between AIC, DIC, WAIC and LOO-CV? and How are LOO and WAIC related? and the references mentioned in there.

You can count everything that you declare in the parameters block. If you shared the code, and tell the dimensions of the declared parameters, I could do the counting, too. Seeing the model code could help also to provide suggestions on how to modify the model.

Thank you!
Here are my three models:
The model proposed at the beginning of this post (applied to age-period-gender-area data),There are a total of 14,900 parameters here:

functions {
  real partial_sum_lpdf(array[] real logm,
                        int start, int end,
                        array[] int death) {
    return poisson_log_lupmf(death[start:end]| logm);
  }
  /**
  * Return the log probability of a proper conditional autoregressive (CAR) prior 
  * with a sparse representation for the adjacency matrix
  *
  * @param phi Vector containing the parameters with a CAR prior
  * @param tau Precision parameter for the CAR prior (real)
  * @param alpha Dependence (usually spatial) parameter for the CAR prior (real)
  * @param W_sparse Sparse representation of adjacency matrix (int array)
  * @param n Length of phi (int)
  * @param W_n Number of adjacent pairs (int)
  * @param D_sparse Number of neighbors for each location (vector)
  * @param lambda Eigenvalues of D^{-1/2}*W*D^{-1/2} (vector)
  *
  * @return Log probability density of CAR prior up to additive constant
  */
  real sparse_car_lpdf(vector phi, real tau, real alpha, 
    array[,] int W_sparse, vector D_sparse, vector lambda, int n, int W_n) {
      row_vector[n] phit_D; // phi' * D
      row_vector[n] phit_W; // phi' * W
      vector[n] ldet_terms;
    
      phit_D = (phi .* D_sparse)';
      phit_W = rep_row_vector(0, n);
      for (i in 1:W_n) {
        phit_W[W_sparse[i, 1]] = phit_W[W_sparse[i, 1]] + phi[W_sparse[i, 2]];
        phit_W[W_sparse[i, 2]] = phit_W[W_sparse[i, 2]] + phi[W_sparse[i, 1]];
      }
    
      for (i in 1:n) ldet_terms[i] = log1m(alpha * lambda[i]);
      return 0.5 * (n * log(tau)
                    + sum(ldet_terms)
                    - tau * (phit_D * phi - alpha * (phit_W * phi)));
  }
}
data {
  int<lower=0> T;                  // number of years = 7 
  int<lower=0> A;                  // number of age categories = 21
  int<lower=0> P;                  // number of area categories = 46
  int<lower=0> Tf;                 // number of forecast years = 2
  array[A*T*P*2] int<lower=0> death;        // deaths
  array[A*T*P*2] real<lower=0> expos;        // exposures
  int n;    // numver of iszero  = 1
  matrix<lower = 0, upper = 1>[(P-n),(P-n)] W;
  array[n] int<lower=1> iszero;
  int W_n;                // number of adjacent region pairs
}
transformed data {
  array[A*T*P*2] real loge = log(expos);        // log exposures
  int AT = A*T;
  int ATG = AT*2;
  int ATGP = ATG*P;
  array[W_n, 2] int W_sparse;   // adjacency pairs
  vector[P-n] D_sparse;     // diagonal of D (number of neigbors for each site)
  vector[P-n] lambda;       // eigenvalues of invsqrtD * W * invsqrtD 
  
  { // generate sparse representation for W
  int counter;
  counter = 1;
  // loop over upper triangular part of W to identify neighbor pairs
    for (i in 1:(P-n-1)) {
      for (j in (i + 1):(P-n)) {
        if (W[i, j] == 1) {
          W_sparse[counter, 1] = i;
          W_sparse[counter, 2] = j;
          counter = counter + 1;
        }
      }
    }
  }
  for (i in 1:(P-n)) D_sparse[i] = sum(W[i]);
  {
    vector[(P-n)] invsqrtD;  
    for (i in 1:(P-n)) {
      invsqrtD[i] = 1 / sqrt(D_sparse[i]);
    }
    lambda = eigenvalues_sym(quad_form(W, diag_matrix(invsqrtD)));
  }
  
}
parameters {
  array[2,P] vector[A] agp;          // alpha_x
  simplex[A] b1;                   // beta_x, strictly positive and sums to 1
  simplex[A] b2;
  array[2,P] simplex[A] bgp1;
  array[2,P] simplex[A] bgp2;
  vector[T] ks;                    // vector of k_t differences
  array[2,P] vector[T] kgps; 
  vector[T+A-2] gs;                    // vector of k_t differences
  array[2,P] vector[T+A-2] ggps; 
  vector[T] alpha_k;
  vector[T] delta_k;
  vector[T+A-2] alpha_g;
  vector[T+A-2] delta_g;
  array[2,P] vector[T] alpha_kgp;
  array[2,P] vector[T] delta_kgp;
  array[2,P] vector[T+A-2] alpha_ggp;
  array[2,P] vector[T+A-2] delta_ggp;
  real<lower = 0> sigma_k;          // standard deviation of the random walk
  array[2] vector<lower = 0>[P] sigma_kgp;
  real<lower = 0> sigma_g;          // standard deviation of the random walk
  array[2] vector<lower = 0>[P] sigma_ggp;
  real<lower=0> sigmak_a;
  real<lower=0> sigmak_d;
  real<lower=0> sigmag_a;
  real<lower=0> sigmag_d;
  array[2] vector<lower = 0>[P] sigmakgp_a;
  array[2] vector<lower = 0>[P] sigmakgp_d;
  array[2] vector<lower = 0>[P] sigmaggp_a;
  array[2] vector<lower = 0>[P] sigmaggp_d;
  vector[P-n] thetas;               // spatial effects without zero
  real<lower = 0> tau;
  real<lower = 0, upper = 1> alpha;
}
transformed parameters  {
  vector[T] k; 
  array[2,P] vector[T] kgp;
  real km;
  array[2,P] real kgpm;
  vector[T+A-2] g; 
  array[2,P] vector[T+A-2] ggp;
  real gm;
  array[2,P] real ggpm;
  array[A*T*P*2] real logm;
  vector[P] theta;// spatial effects with zero
  
  km = mean(ks);
  k = ks - km;
  gm = mean(gs);
  g = gs - gm;
  for(s in 1:2) for(p in 1:P){
    kgpm[s,p] = mean(kgps[s,p]);
    kgp[s,p] = kgps[s,p] - kgpm[s,p];
    ggpm[s,p] = mean(ggps[s,p]);
    ggp[s,p] = ggps[s,p] - ggpm[s,p];
  }
  theta[1:46] = thetas;
  theta[47] = 0;
  for (p in 1:P) for (s in 1:2) for (j in 1:T) for (i in 1:A) {
    if(i == 1){
      logm[ATG*(p-1) + AT*(s-1) + A*(j-1) + i] = loge[ATG*(p-1) + AT*(s-1) + A*(j-1) + i] + 
                                                 b1[i] * k[j] + b2[i] * g[j-i+A-1] +
                                                 agp[s,p][i] + bgp1[s,p][i] * kgp[s,p][j] + bgp2[s,p][i] * ggp[s,p][j-i+A-1] +
                                                 theta[p];
    }
    else{
      logm[ATG*(p-1) + AT*(s-1) + A*(j-1) + i] = loge[ATG*(p-1) + AT*(s-1) + A*(j-1) + i] + 
                                                 b1[i] * k[j] + b2[i] * g[j-i+A] +
                                                 agp[s,p][i] + bgp1[s,p][i] * kgp[s,p][j] + bgp2[s,p][i] * ggp[s,p][j-i+A] +
                                                 theta[p];
    }
  }
}
model {
  int grainsize = ATGP%/%4;
  target += reduce_sum(partial_sum_lupdf, logm, grainsize, death);
  // k[1:2], g[1:2] ~ improper flat priors; or whatever
  target += normal_lupdf(ks|alpha_k, sigma_k);
  target += normal_lupdf(alpha_k[2:T]|alpha_k[1:(T-1)] + delta_k[2:T],sigmak_a);
  target += normal_lupdf(delta_k[2:T]|delta_k[1:(T-1)], sigmak_d);
  target += normal_lupdf(gs|alpha_g,sigma_g);
  target += normal_lupdf(alpha_g[2:(T+A-2)]|alpha_g[1:(T+A-3)] + delta_g[2:(T+A-2)], sigmag_a);
  target += normal_lupdf(delta_g[2:(T+A-2)]|delta_g[1:(T+A-3)], sigmag_d);
  target += dirichlet_lupdf(b1|rep_vector(1, A));
  target += dirichlet_lupdf(b2|rep_vector(1, A));
  target += student_t_lupdf(sigma_k|4, 0, 1);
  target += student_t_lupdf(sigma_g|4, 0, 1);
  target += student_t_lupdf(sigmak_a|4, 0, 1);
  target += student_t_lupdf(sigmak_d|4, 0, 1);
  target += student_t_lupdf(sigmag_a|4, 0, 1);
  target += student_t_lupdf(sigmag_d|4, 0, 1);
  for(s in 1:2) for(p in 1:P){
    target += normal_lupdf(kgps[s,p]|alpha_kgp[s,p], sigma_kgp[s][p]);
    target += normal_lupdf(alpha_kgp[s,p][2:T]|alpha_kgp[s,p][1:(T-1)] + delta_kgp[s,p][2:T],sigmakgp_a[s][p]);
    target += normal_lupdf(delta_kgp[s,p][2:T]|delta_kgp[s,p][1:(T-1)], sigmakgp_d[s][p]);
    target += normal_lupdf(ggps[s,p]|alpha_ggp[s,p],sigma_ggp[s][p]);
    target += normal_lupdf(alpha_ggp[s,p][2:(T+A-2)]|alpha_ggp[s,p][1:(T+A-3)] + delta_ggp[s,p][2:(T+A-2)], sigmaggp_a[s][p]);
    target += normal_lupdf(delta_ggp[s,p][2:(T+A-2)]|delta_ggp[s,p][1:(T+A-3)], sigmaggp_d[s][p]);
    target += normal_lupdf(agp[s,p]|0,10);
    target += dirichlet_lupdf(bgp1[s,p]|rep_vector(1, A));
    target += dirichlet_lupdf(bgp2[s,p]|rep_vector(1, A));
  }
  for(s in 1:2){
    target += student_t_lupdf(sigma_kgp[s]|4, 0, 1);
    target += student_t_lupdf(sigma_ggp[s]|4, 0, 1);
    target += student_t_lupdf(sigmakgp_a[s]|4, 0, 1);
    target += student_t_lupdf(sigmakgp_d[s]|4, 0, 1);
    target += student_t_lupdf(sigmaggp_a[s]|4, 0, 1);
    target += student_t_lupdf(sigmaggp_d[s]|4, 0, 1);
  }
  target += sparse_car_lpdf(thetas| tau, alpha, W_sparse, D_sparse, lambda, P-n, W_n);
  target += gamma_lupdf(tau|2,2);
}
generated quantities {
  vector[Tf] k_p;
  vector[Tf] alpha_k_p;
  vector[Tf] delta_k_p;
  array[2,P] vector[Tf] kgp_p;
  array[2,P] vector[Tf] alpha_kgp_p;
  array[2,P] vector[Tf] delta_kgp_p;
  vector[Tf] g_p;
  vector[Tf] alpha_g_p;
  vector[Tf] delta_g_p;
  array[2,P] vector[Tf] ggp_p;
  array[2,P] vector[Tf] alpha_ggp_p;
  array[2,P] vector[Tf] delta_ggp_p;
  vector[T+Tf+A-2] g_f;
  array[2,P] vector[T+Tf+A-2] ggp_f;
  vector[A*Tf*2*P] mfore; // predicted death rates
  vector[ATGP] log_lik;
  array[ATGP] real m;
  int pos1 = 1;
  
  delta_k_p[1] = delta_k[T] + sigmak_d * normal_rng(0,1);
  alpha_k_p[1] = alpha_k[T] + delta_k_p[1] + sigmak_a * normal_rng(0,1);
  for (t in 2:Tf){
    delta_k_p[t] = delta_k_p[t-1] + sigmak_d * normal_rng(0,1);
    alpha_k_p[t] = alpha_k_p[t-1] + delta_k_p[t] + sigmak_a * normal_rng(0,1);
  }
  for (t in 1:Tf){k_p[t] = alpha_k_p[t] - km + sigma_k * normal_rng(0,1);}
    delta_g_p[1] = delta_g[T+A-2] + sigmag_d * normal_rng(0,1);
    alpha_g_p[1] = alpha_g[T+A-2] + delta_g_p[1] + sigmag_a * normal_rng(0,1);
  for (t in 2:Tf){
    delta_g_p[t] = delta_g_p[t-1] + sigmag_d * normal_rng(0,1);
    alpha_g_p[t] = alpha_g_p[t-1] + delta_g_p[t] + sigmag_a * normal_rng(0,1);
  }
  for (t in 1:Tf){g_p[t] = alpha_g_p[t] - gm + sigma_g * normal_rng(0,1);}
  g_f = append_row(g, g_p);
  for(s in 1:2) for(p in 1:P){
    delta_kgp_p[s,p][1] = delta_kgp[s,p][T] + sigmakgp_d[s][p] * normal_rng(0,1);
    alpha_kgp_p[s,p][1] = alpha_kgp[s,p][T] + delta_kgp_p[s,p][1] + sigmakgp_a[s][p] * normal_rng(0,1);
    for (t in 2:Tf){
      delta_kgp_p[s,p][t] = delta_kgp_p[s,p][t-1] + sigmakgp_d[s,p] * normal_rng(0,1);
      alpha_kgp_p[s,p][t] = alpha_kgp_p[s,p][t-1] + delta_kgp_p[s,p][t] + sigmakgp_a[s][p] * normal_rng(0,1);
    }
    for (t in 1:Tf){kgp_p[s,p][t] = alpha_kgp_p[s,p][t] - kgpm[s,p] + sigma_kgp[s][p] * normal_rng(0,1);}
      delta_ggp_p[s,p][1] = delta_ggp[s,p][T+A-2] + sigmaggp_d[s][p] * normal_rng(0,1);
      alpha_ggp_p[s,p][1] = alpha_ggp[s,p][T+A-2] + delta_ggp_p[s,p][1] + sigmaggp_a[s][p] * normal_rng(0,1);
    for (t in 2:Tf){
      delta_ggp_p[s,p][t] = delta_ggp_p[s,p][t-1] + sigmaggp_d[s][p] * normal_rng(0,1);
      alpha_ggp_p[s,p][t] = alpha_ggp_p[s,p][t-1] + delta_ggp_p[s,p][t] + sigmaggp_a[s][p] * normal_rng(0,1);
    }
    for (t in 1:Tf){ggp_p[s,p][t] = alpha_ggp_p[s,p][t] - ggpm[s,p] + sigma_ggp[s][p] * normal_rng(0,1);}
    ggp_f[s,p] = append_row(ggp[s,p], ggp_p[s,p]);
  }
  for (p in 1:P) for (s in 1:2) for (j in 1:Tf) for (i in 1:A) {
    if(i == 1){
      mfore[pos1] = exp(b1[i] * k_p[j] + b2[i] * g_f[T+j-i+A-1] +
                  agp[s,p][i] + bgp1[s,p][i] * kgp_p[s,p][j] + bgp2[s,p][i] * ggp_f[s,p][T+j-i+A-1] +
                  theta[p]);
    }
    else{
      mfore[pos1] = exp(b1[i] * k_p[j] + b2[i] * g_f[T+j-i+A] +
                  agp[s,p][i] + bgp1[s,p][i] * kgp_p[s,p][j] + bgp2[s,p][i] * ggp_f[s,p][T+j-i+A] +
                  theta[p]);
    }
    pos1 += 1;              
  }
  
  for(i in 1:ATGP){
    m[i] = exp(logm[i] - loge[i]);
    log_lik[i] = poisson_log_lpmf(death[i]|logm[i]);
  }
}

The function sparse_car_lpdf in the model is referenced from this page:Exact sparse CAR models in Stan

model A(applied to age-period data),There are a total of 825 parameters here:

functions {
  real partial_sum_lpdf(array[] real logm,
                        int start, int end,
                        array[] int death) {
    return poisson_log_lupmf(death[start:end] | logm);
  }
}
data {
  int<lower=0> T;                  // number of years = 36
  int<lower=0> A;                  // number of age categories = 101
  int<lower=0> Tf;                 // number of forecast years = 10
  array[A*T] int<lower=0> death;        // deaths
  array[A*T] real<lower=0> expos;        // exposures
}
transformed data {
  array[A*T] real loge = log(expos);        // log exposures
}
parameters {
  vector[A] a;                    // alpha_x
  simplex[A] b1;                   // beta_x, strictly positive and sums to 1
  simplex[A] b2;
  vector[T] ks;                 // vector of k_t(without first element)
  vector[T+A-1] gs;
  vector[T] alpha_k;
  vector[T] delta_k;
  vector[T+A-1] alpha_g;
  vector[T+A-1] delta_g;
  real<lower = 0> sigma_k;          // standard deviation of the random walk
  real<lower = 0> sigma_g;
  real<lower=0> sigmak_a;
  real<lower=0> sigmak_d;
  real<lower=0> sigmag_a;
  real<lower=0> sigmag_d;
}
transformed parameters  {
  vector[T] k;   
  vector[T+A-1] g;
  array[A*T] real logm;
  real km;
  real gm;
  
  km = mean(ks);
  k = ks - km;
  gm = mean(gs);
  g = gs - gm;
  for (j in 1:T) for (i in 1:A) {
    logm[A*(j-1)+i] = loge[A*(j-1)+i] + a[i] + b1[i] * k[j] + b2[i] * g[j-i+A];
  }
}
model {
  int grainsize = A*T%/%3;
  target += reduce_sum(partial_sum_lpdf, logm, grainsize, death);
  target += normal_lpdf(ks|alpha_k,sigma_k);
  target += normal_lpdf(alpha_k[1]|delta_k[1],sigmak_a);
  target += normal_lpdf(alpha_k[2:T]|alpha_k[1:(T-1)]+delta_k[2:T],sigmak_a);
  target += normal_lpdf(delta_k[1]|0,sigmak_d);
  target += normal_lpdf(delta_k[2:T]|delta_k[1:(T-1)],sigmak_d);
  target += normal_lpdf(gs|alpha_g,sigma_g);
  target += normal_lpdf(alpha_g[1]|delta_g[1],sigmag_a);
  target += normal_lpdf(alpha_g[2:(T+A-1)]|alpha_g[1:(T+A-2)]+delta_g[2:(T+A-1)],sigmag_a);
  target += normal_lpdf(delta_g[1]|0,sigmag_d);
  target += normal_lpdf(delta_g[2:(T+A-1)]|delta_g[1:(T+A-2)],sigmag_d);
  target += normal_lpdf(a|0,10);
  target += dirichlet_lpdf(b1|rep_vector(1, A));
  target += dirichlet_lpdf(b2|rep_vector(1, A));
  target += student_t_lpdf(sigma_k|4, 0, 1);
  target += student_t_lpdf(sigma_g|4, 0, 1);
  target += student_t_lpdf(sigmak_a|4, 0, 1);
  target += student_t_lpdf(sigmak_d|4, 0, 1);
  target += student_t_lpdf(sigmag_a|4, 0, 1);
  target += student_t_lpdf(sigmag_d|4, 0, 1);
}
generated quantities {
  vector[Tf] k_p;
  vector[Tf] alpha_k_p;
  vector[Tf] delta_k_p;
  vector[Tf] g_p;
  vector[T+Tf+A-1] g_f;
  vector[Tf] alpha_g_p;
  vector[Tf] delta_g_p;
  vector[A*Tf] mfore; // predicted death rates
  vector[A*T] log_lik;
  array[A*T] real m;
  int pos1 = 1;
  delta_k_p[1] = delta_k[T] + sigmak_d * normal_rng(0,1);
  alpha_k_p[1] = alpha_k[T] + delta_k_p[1] + sigmak_a * normal_rng(0,1);
  for (t in 2:Tf){
    delta_k_p[t] = delta_k_p[t-1] + sigmak_d * normal_rng(0,1);
    alpha_k_p[t] = alpha_k_p[t-1] + delta_k_p[t] + sigmak_a * normal_rng(0,1);
  }
  for (t in 1:Tf){k_p[t] = alpha_k_p[t] - km + sigma_k * normal_rng(0,1);}
  delta_g_p[1] = delta_g[T+A-1] + sigmag_d * normal_rng(0,1);
  alpha_g_p[1] = alpha_g[T+A-1] + delta_g_p[1] + sigmag_a * normal_rng(0,1);
  for (t in 2:Tf){
    delta_g_p[t] = delta_g_p[t-1] + sigmag_d * normal_rng(0,1);
    alpha_g_p[t] = alpha_g_p[t-1] + delta_g_p[t] + sigmag_a * normal_rng(0,1);
  }
  for (t in 1:Tf){g_p[t] = alpha_g_p[t] - gm + sigma_g * normal_rng(0,1);}
  g_f = append_row(g, g_p);
  for (t in 1:Tf) for (i in 1:A) {
    mfore[pos1] = exp(a[i] + b1[i] * k_p[t] + b2[i] * g_f[T+t-i+A]);
    pos1 += 1;
  }
  for(i in 1:A*T){
    m[i] = exp(logm[i] - loge[i]);
    log_lik[i] = poisson_log_lpmf(death[i]|logm[i]);
  }
}

model B(applied to age-period data),There are a total of 553 parameters here:

functions {
  real partial_sum_lpdf(array[] real logm,
                        int start, int end,
                        array[] int death) {
    return poisson_log_lupmf(death[start:end] | logm);
  }
}
data {
  int<lower=0> T;                  // number of years = 36
  int<lower=0> A;                  // number of age categories = 101
  int<lower=0> Tf;                 // number of forecast years = 10
  array[A*T] int<lower=0> death;        // deaths
  array[A*T] real<lower=0> expos;        // exposures
}
transformed data {
  array[A*T] real loge = log(expos);        // log exposures
}
parameters {
  vector[A] a;                    // alpha_x
  simplex[A] b1;                   // beta_x, strictly positive and sums to 1
  simplex[A] b2;
  vector[T] ks;                 // vector of k_t(without first element)
  vector[T+A-1] gs;
  real c;
  real<lower = -1, upper = 1> psi;
  vector[T] alpha_k;
  vector[T] delta_k;
  real<lower = 0> sigma_k;          // standard deviation of the random walk
  real<lower = 0> sigma_g;
  real<lower=0> sigmak_a;
  real<lower=0> sigmak_d;

}
transformed parameters  {
  vector[T] k;   
  vector[T+A-1] g;
  array[A*T] real logm;
  real km;
  real gm;
  
  km = mean(ks);
  k = ks - km;
  gm = mean(gs);
  g = gs - gm;
  for (j in 1:T) for (i in 1:A) {
    logm[A*(j-1)+i] = loge[A*(j-1)+i] + a[i] + b1[i] * k[j] + b2[i] * g[j-i+A];
  }
}
model {
  int grainsize = A*T%/%3;
  target += reduce_sum(partial_sum_lpdf, logm, grainsize, death);
  target += normal_lpdf(ks|alpha_k,sigma_k);
  target += normal_lpdf(alpha_k[1]|delta_k[1],sigmak_a);
  target += normal_lpdf(alpha_k[2:T]|alpha_k[1:(T-1)]+delta_k[2:T],sigmak_a);
  target += normal_lpdf(delta_k[1]|0,sigmak_d);
  target += normal_lpdf(delta_k[2:T]|delta_k[1:(T-1)],sigmak_d);
  target += normal_lpdf(gs[3:(T+A-1)]|c + gs[2:(T+A-2)] + psi * (gs[2:(T+A-2)] - gs[1:(T+A-3)] - c),sigma_g);
  target += normal_lpdf(a|0,10);
  target += normal_lpdf(c|0,10);
  target += normal_lpdf(psi|0,1);
  target += dirichlet_lpdf(b1|rep_vector(1, A));
  target += dirichlet_lpdf(b2|rep_vector(1, A));
  target += student_t_lpdf(sigma_k|4, 0, 1);
  target += student_t_lpdf(sigma_g|4, 0, 1);
  target += student_t_lpdf(sigmak_a|4, 0, 1);
  target += student_t_lpdf(sigmak_d|4, 0, 1);
}
generated quantities {
  vector[Tf] k_p;
  vector[Tf] alpha_k_p;
  vector[Tf] delta_k_p;
  vector[Tf] g_p;
  vector[T+Tf+A-1] g_f;
  vector[A*Tf] mfore; // predicted death rates
  vector[A*T] log_lik;
  array[A*T] real m;
  int pos1 = 1;
  delta_k_p[1] = delta_k[T] + sigmak_d * normal_rng(0,1);
  alpha_k_p[1] = alpha_k[T] + delta_k_p[1] + sigmak_a * normal_rng(0,1);
  for (t in 2:Tf){
    delta_k_p[t] = delta_k_p[t-1] + sigmak_d * normal_rng(0,1);
    alpha_k_p[t] = alpha_k_p[t-1] + delta_k_p[t] + sigmak_a * normal_rng(0,1);
  }
  for (t in 1:Tf){k_p[t] = alpha_k_p[t] - km + sigma_k * normal_rng(0,1);}
  g_p[1] = c + g[T+A-1] + psi * (g[T+A-1] - g[T+A-2] - c) + sigma_g * normal_rng(0,1);
  g_p[2] = c + g_p[1] + psi * (g_p[1] - g[T+A-1] - c) + sigma_g * normal_rng(0,1);
  for (t in 3:Tf){g_p[t] = c + g_p[t-1] + psi * (g_p[t-1] - g_p[t-2] - c) + sigma_g * normal_rng(0,1);}
  g_f = append_row(g, g_p);
  for (t in 1:Tf) for (i in 1:A) {
    mfore[pos1] = exp(a[i] + b1[i] * k_p[t] + b2[i] * g_f[T+t-i+A]);
    pos1 += 1;
  }
  for(i in 1:A*T){
    m[i] = exp(logm[i] - loge[i]);
    log_lik[i] = poisson_log_lpmf(death[i]|logm[i]);
  }
}

I have tried to streamline the data by cutting the original 101 age categories from 0 to 100 to 66 age categories from 15 to 80 years.
They now have the following LOO diagnosis and look much better.
model A now has 615 parameters.

Computed from 20000 by 2376 log-likelihood matrix

         Estimate    SE
elpd_loo -14904.5  86.0
p_loo       728.6  28.9
looic     29809.0 172.1
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     2192  92.3%   2         
 (0.5, 0.7]   (ok)        118   5.0%   1         
   (0.7, 1]   (bad)        46   1.9%   0         
   (1, Inf)   (very bad)   20   0.8%   0         
See help('pareto-k-diagnostic') for details.

model B now has 413 parameters.

Computed from 20000 by 2376 log-likelihood matrix

         Estimate    SE
elpd_loo -14846.1  82.8
p_loo       639.9  22.7
looic     29692.3 165.7
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     2277  95.8%   91        
 (0.5, 0.7]   (ok)         69   2.9%   58        
   (0.7, 1]   (bad)        28   1.2%   2         
   (1, Inf)   (very bad)    2   0.1%   2         
See help('pareto-k-diagnostic') for details.

What if you would try instead

neg_binomial_2_lpmf(death[start:end] | exp(logm), phi)

You need to do some changes to pass both logm and phi, and phi as a parameter. Negative binomial model is an overdispersed version of Poisson, and should be more robust.

Thanks for helping. I’ll try it.