Stacking weights optimization error

Optimization error when calculating stacking (logarithmic rule) weights as:

“Error in optim(theta.old, fun, gradient, control = control, method = method, :
initial value in ‘vmmin’ is not finite”

I am aware a K-fold validation is needed. I want to first understand why stacking weights doesn’t work.

[loo points.rdata|attachment](upload://j5NEYDCZQX1Fhsb1ckPev961Sls.rdata) (703 Bytes)

PBMA_wts <- pseudobma_weights(loo_points_mtx,BB=F)
PBMA_wts_BB <- pseudobma_weights(loo_points_mtx,BB=T)
LogStacking_wts <- stacking_weights(loo_points_mtx)

By the way I also calculated information criteria for comparison including AIC and WAIC (which are not ideal)

[IC list.rdata|attachment](upload://5CSlpsEidekKu4KwnIzfqAuGE85.rdata) (129 Bytes)
  # AIC weights
  if(abs(min(AIC_list)) >= abs(max(AIC_list))){
    AIC_Z <- min(AIC_list)
  } else {
    AIC_Z <- max(AIC_list)
  }
  AIC_wts <- exp(AIC_list - AIC_Z) / sum(exp(AIC_list - AIC_Z))

  # WAIC weights
  if(abs(min(unlist(WAICValue_list))) >= abs(max(unlist(WAICValue_list)))){
    WAIC_Z <- min(unlist(WAICValue_list))
  } else {
    WAIC_Z <- max(unlist(WAICValue_list))
  }
  WAIC_wts <- exp(unlist(WAICValue_list)-WAIC_Z)/sum(exp(unlist(WAICValue_list)-WAIC_Z))

The error occurs in the following part of the code:

   w <- constrOptim(                        // from loo github
      theta = rep(1 / K, K - 1),
      f = negative_log_score_loo,
      grad = gradient,
      ui = ui,
      ci = ci,
      method = optim_method,
      control = optim_control
    )$par

where

K <- 2           // number of models        
N <- 36        // number of data points
loo_point <- matrix(                             
c(493.6934,366.5908,464.6874,367.3387,3.233788,30.40792,135.3944,-3.441301,2.23516,1856.88,-146.4352,-242.303,124.8462,20.50234,21.82202,359.6379,257.8549,42.67879,313.4528,278.1136,218.155,168.2236,52.72648,93.40489,8497.56,5049.595,3791.679,6629.044,5873.091,-1117.002,-1338.822,-1584.049,34.91057,-64.57774,-25.43519,-1.728413,532.8422,385.5359,494.7422,389.4739,7.721791,30.49167,143.952,-4.274733,1.025613,1932.87,-267.3354,-584.8296,139.897,22.95893,-8.303333,383.7568,273.5287,44.33569,330.5231,289.4175,249.0978,136.6463,31.03426,-195.0715,8440.698,5788.364,3510.647,5327.009,10466.88,535.6504,150.5,-2252.704,-69.03277,-51.78461,-9.175409,-44.37744),
    nrow = N, ncol = K
)              // unique data for replication
elpd_loo = c(31023.97,36552.71 )

    negative_log_score_loo <- function(w) {            // from loo github
      # objective function: log score
      stopifnot(length(w) == K - 1)
      w_full <- c(w, 1 - sum(w))
      sum <- 0
      for (i in 1:N) {
        sum <- sum + log(exp(lpd_point[i, ]) %*% w_full)
      }
      return(-as.numeric(sum))
    }
    
    gradient <- function(w) {
      # gradient of the objective function
      stopifnot(length(w) == K - 1)
      w_full <- c(w, 1 - sum(w))
      grad <- rep(0, K - 1)
      for (k in 1:(K - 1)) {
        for (i in 1:N) {
          grad[k] <- grad[k] +
            (exp_lpd_point[i, k] - exp_lpd_point[i, K]) / (exp_lpd_point[i,]  %*% w_full)
        }
      }
      return(-grad)
    }
    
    ui <- rbind(rep(-1, K - 1), diag(K - 1))  # K-1 simplex constraint matrix
    ci <- c(-1, rep(0, K - 1))

Can someone explain to me what this error mean? Thanks a lot!

I tried running your example, but it’s missing something and I get errors.

However, I can see that there is exp(lpd_point[i, ]) and your loo_point (should it be lpd_point?) has so big values that exp produces Infs. Usually shifting lpd values to have maximum 0 before exponentiating helps (if densities are needed only upo to proportionality), but now the range of the values is such that only one value is then non-zero and there is not much to optimize.

2 Likes

Thanks!

Yes studying the source code from your github I also realized the optimization error was caused by exp(lpd_point) so that exp produces Infs.
I tried to deduct the max of lpd_point before taking exp (which was used in calculating my AIC and WAIC weights). It was indeed not useful as you suggested. Denoting transformed lpd_point as lpd_point_Z, then exp(lpd_point_Z) consists of one 1 and all other 0s. Hence the optimization failed.

I am exploring ways to reduce the range of my log_lik values for the following reasons:

  1. the large magnitude of log_lik values caused me trouble in the calculation of AIC and WAIC although I circulated around it by transforming the max to 0 before taking exp. If I get to improve log_lik, I can improve my WAIC weights as well.
  2. lpd_point was calculated using log_lik. If I get to reduce the range of my log_lik, I may be able to improve my lpd_point as well.
  3. I have some initial thoughts this may be something to do with my sampling likelihood function. I am working on summarized structure of data in contrast to the usual individual structure. Only summary statistics of the data were available. For example, I have a continuous predictor of three values, the means and standard deviations of depend variable are available at each of the three levels. Assuming a normal distribution, I used the density function from eq 9 on page 4 of Crump, K. S. (1995). Calculation of benchmark doses from continuous data. Risk Analysis, 15(1), 79-89. https://doi.org/10.1111/j.1539-6924.1995.tb00095.x

I am not sure which part of my example was missing. Do you load my data before running the model weights function?

TL;DR: transform lpd values works. But instead of deducting global maximum, deduct pointwise maximum.

The optimization error was caused by the large range of lpd values.
Transforming lpd values by deducting by its max before taking exp usually helps. However in this case after transformation there is “only one non-zero value of 1. There is not much to optimize” as stated by @avehtari
The source code of stacking weights and pseudo BMA weights revealed that the optimization was performed pointwise if I understand correctly. Therefore, a simple solution would be to transform lpd values not by the global max but pointwise max. This was implemented in R as:

  lpd_point_Z <- lpd_point - apply(lpd_point,1,max)

Using transformed lpd_point_Z, the optimization was able to converge.

Similarly, this pointwise transformation can be applied to the calculation AIC and WAIC. Taking the example of AIC.

log_lik_mtx <- extract_log_lik(stanfit, merge_chains = T)
AIC <- apply(log_lik_mtx,1,sum) - Npars         ## Npars is number of parameters in the model

Pointwise AICs are obtained for each model.

AIC_mtx <- cbind(AIC_m1, AIC_m2,...)

Then transformed by deducting pointwise maximum

  AIC_Z <- AIC_mtx - apply(AIC_mtx ,1,max)

The mean of pointwise AIC weights are summarized as the estimate:

  AIC_wts <- apply(exp(AIC_Z) / apply(exp(AIC_Z),1,sum),2,mean)

In my case, the pointwise transformation allows stacking weights to be calculated. It also enhances accuracy of WAIC weights. However, I did not observe enhancement in AIC, nor any performance boost.

1 Like