Problem running loo_subsample with variational inference and cmdstanr

I am trying to run loo_subsample for model that I have fitted using cmdstanr using variational inference.

data {
  int<lower=0> N;       // number of observations
  int<lower=0> year[N]; // year hierarchy level
  int<lower=0> day[N];  // day hierarchy level
  int<lower=0> hour[N]; // hour hierarchy level
  
  int<lower=0> num_years;
  int<lower=0> num_days;
  int<lower=0> num_hours;
  
  int<lower=0> score[N];      // post scores
}

parameters {
  vector<lower=0>[num_years] year_mu;
  vector<lower=0>[num_years] year_phi;
  vector<lower=0>[num_days] days_mu;
  vector<lower=0>[num_days] days_phi;
  vector<lower=0>[num_hours] hours_mu;
  vector<lower=0>[num_hours] hours_phi;
}

model {
  year_mu ~ normal(0, 10);
  year_phi ~ normal(0, 10);
  days_mu ~ normal(0, 10);
  days_phi ~ normal(0, 10);
  hours_mu ~ normal(0, 10);
  hours_phi ~ normal(0, 10);
  
  score ~ neg_binomial_2_log(year_mu[year] + days_mu[day] + hours_mu[hour], year_phi[year] + days_phi[day] + hours_phi[hour]);
}

The model fits fine. Test run using loo_i is ok too. But when I try to run loo_subsample somehow I get wrong dimensions in the draws argument. Instead of passing the whole draws matrix (which happens as expected when I use loo_i), loo_subsample passes a single row of that matrix to my likelihood function, which I have checked by printing its dimensions. Here is the code I use for running loo:

data <- list(score = hn_posts$score, 
             N = length(hn_posts$score),
             year = as.integer(as.factor(hn_posts$year)),
             day = as.integer(hn_posts$weekday),
             hour = hn_posts$hour,
             num_years = n_distinct(hn_posts$year),
             num_days = n_distinct(hn_posts$weekday),
             num_hours = n_distinct(hn_posts$hour))

hierarchical_fit <- hierarchical_model$variational(data = data, output_samples = 5000, algorithm = "fullrank")

hierarchical_draws <- hierarchical_fit$draws()

llfun_hierarchical <- function(data_i, draws, log = TRUE) {
  year_mu <- glue("year_mu[{data_i$year}]")
  year_phi <- glue("year_phi[{data_i$year}]")
  days_mu <- glue("days_mu[{data_i$day}]")
  days_phi <- glue("days_phi[{data_i$day}]")
  hours_mu <- glue("hours_mu[{data_i$hour}]")
  hours_phi <- glue("hours_phi[{data_i$hour}]")
  
  print(dim(draws))
  print(draws[, c(year_mu, days_mu, hours_mu)])
  mu_sum <- rowSums(
    draws[, c(year_mu, days_mu, hours_mu)]
  )
  
  phi_sum <- rowSums(
    draws[, c(year_phi, days_phi, hours_phi)]
  )
  
  dnbinom(data_i$score, mu = mu_sum, size = phi_sum, log = log)
}

hierarchical_params <- as_draws_matrix(hierarchical_draws)

# this runs fine
#loo_i(1, llfun_hierarchical, data = as.data.frame(data), draws = hierarchical_params)

# but this fails since dim of draws in llfun_hierarchical is (1 x num of parameters)  instead of (n x num of parameters)
loo_subsample(llfun_hierarchical, 
              data = as.data.frame(data), 
              draws = hierarchical_params, 
              observations = 10, 
              cores = 1)

I use the latest master build of cmdstan and cmdstanr, and stable build of loo.

It seems that when using variational inference or other posterior approximation methods you should use loo_approximate_posterior instead. The following code seems to work ok:

log_p <- hierarchical_fit$lp()
log_g <- hierarchical_fit$lp_approx()
loo_approximate_posterior(llfun_hierarchical,
                          data = as.data.frame(data), 
                          draws = hierarchical_params, 
                          log_p = log_p,
                          log_g = log_g,
                          cores = 15)

Although I find this function a lot slower to run. Is it supposed to be performant with large datasets? I have 200k samples

Strange, but it seems to run forever even on 100 data points:

loo_hierarchical <- loo_approximate_posterior(llfun_hierarchical,
                          data = as.data.frame(data)[0:100,], 
                          draws = hierarchical_params, 
                          log_p = log_p,
                          log_g = log_g,
                          cores = 25)
1 Like

Tagging @mans_magnusson, our expert on loo_approximate_posterior(). Mans, if possible can you take a look at this when you have a chance?

Sure, seems strange. Do you have an reproducible example I could take a look at that has this problem?

A short note: approximate posterior loo only handle the problem of using approximate posteriors (as VI). The subsampling part handles scaling to large data.

Is there any way I can use subsampling and approximate posterior at the same time? Seems not to be covered by the docs

As for the example, I can send you my notebook, it uses a publicly available dataset

I thought I have combined them both in the vignette, but now Im not sure.

Ideally you put a minimal working code snippet here so I can check. Going through a whole notebook might become messy.

1 Like

Also, in the vignette here you can find info on how to combine the two approaches.

A heads up, your posterior might be quite badly approximated with VI if you use for example mean-field and then you will get a lot of bad k-hat values. The easiest solution is then probably to use HMC (and maybe loo subsampling).

2 Likes

Sure, here is a standalone example:

library(loo)

llfun_test <- function(data_i, draws) {
  print(dim(draws))
}

loo_subsample(llfun_test,
              data = as.data.frame(matrix(rnorm(36),nrow=6)),
              draws = matrix(rnorm(36),nrow=6),
              log_p = rep(1, 100),
              log_g = rep(1, 100),
              observations = 100)

As an output, I get the following:

[1] 1 6
[1] 1 6
[1] 1 6
[1] 1 6
[1] 1 6
[1] 1 6

But I should get 6x6 matrix for draws, right?

About combining both approaches: thanks a lot, I have missed that part in the doc.

Thanks!

In the example the loglik function just prints the dimensions? It needs to return the log-likelihood. See the vignette for an example.

No problem, here is another example. I just stripped the previous one down to demonstrate the problem it the easiest way.

library(loo)
llfun_test <- function(data_i, draws) {
  print(dim(draws))
  dnorm(data_i$V1, 0, 1)
}

loo_subsample(llfun_test,
              data = as.data.frame(matrix(rnorm(1000), nrow=100)),
              draws = matrix(rnorm(1000), nrow=100),
              log_p = rnorm(10),
              log_g = rnorm(10),
              observations = 50)

This gives me the following output:

[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1]  1 10
[1] 100  10
Error in ap_psis.matrix(log_ratios = -ll_i, log_p = log_p, log_g = log_g,  : 
  Assertion on 'log_p' failed: Must have length 1, but has length 10.

I have restarted the R Server just in case, but I still can reproduce this behavior. The later log_p part may come because I generate synthetic data in this example, so it might not be related to the root cause. The draws dimension problem you see here reproduces for me on real data.

Alright. Just so I understand. What do you think is the problem here? I thought it was that it took long time?

Without testing it the code. loo_subsample approximate the lpd_i using a point estimate of the parameters for all observations (you can choose other alternatives, but this is the fastest). Then it uses the whole posterior (all draws) for the subsample. So I would expect you to get 1 10 outputs 100 times and then one 100 10 in the end. Or am Im missing something here?

See the AISTATS paper referenced in the vignette for details.

1 Like

Oh, so that is the expected way for it to work. I think that should be mentioned in the doc

draws : An object containing the posterior draws for any parameters needed to compute the pointwise log-likelihood. Unlike data , which is indexed by observation, for each observation the entire object draws will be passed to the draws argument of the log-likelihood function.

This is mentioned in the loo_approximate_posterior documentation.

Yes. Sorry for this. Would it be ok for you to file this as an github issue in the loo package?

Sure, Confusing documentation for loo_subsampling when using approximate posterior · Issue #170 · stan-dev/loo · GitHub. Thanks a lot for devoting your time to deal with this. I love how active the Stan community is. And thanks for the paper reference.

2 Likes

@mans_magnusson Performance seems to still be an issue though. Now I see that loo_subsample utilizes only 1 core instead of 25. Should we continue this duscussion here, or I should open a new topic?

I have assembled a minimal example to reproduce the performance issue.

Please download the following file (it contains R objects that are necessary to run the example) loo_subsample_example.RData - Google Drive

The code is

library(glue)
library(loo)

load('loo_subsample_example.RData')

llfun_hierarchical <- function(data_i, draws, log = TRUE) {
  year_mu <- glue("year_mu[{data_i$year}]")
  year_phi <- glue("year_phi[{data_i$year}]")
  days_mu <- glue("days_mu[{data_i$day}]")
  days_phi <- glue("days_phi[{data_i$day}]")
  hours_mu <- glue("hours_mu[{data_i$hour}]")
  hours_phi <- glue("hours_phi[{data_i$hour}]")
  
  #print(dim(draws))
  #print(draws[, c(year_mu, days_mu, hours_mu)])
  mu_sum <- rowSums(
    as.matrix(draws[, c(year_mu, days_mu, hours_mu)])
  )
  
  phi_sum <- rowSums(
    as.matrix(draws[, c(year_phi, days_phi, hours_phi)])
  )
  
  dnbinom(data_i$score, mu = mu_sum, size = phi_sum, log = log)
}

hierarchical_params <- as_draws_matrix(hierarchical_draws)

#loo_i(1, llfun_hierarchical, data = as.data.frame(data), draws = hierarchical_params)
log_p <- hierarchical_fit$lp()
log_g <- hierarchical_fit$lp_approx()

loo_hierarchical <- loo_subsample(llfun_hierarchical,
                          data = as.data.frame(data), 
                          draws = hierarchical_params, 
                          log_p = log_p,
                          log_g = log_g,
                          observations = 100,
                          cores = 25)

UPD answer to the next two posts below by @mans_magnusson

Sorry, I have reached a limit on the maximum number of replies here, so I can’t write more messages for the next 9 hours. Seems to be a rather strange restriction :(
Did not profile the code yet, I will look in that direction. Loglik seems to be the first bottleneck suspect. Is likelihood computation parallelized? If not, then we might have 200k iterations running in sequence.

UPD2
Just ran the profiler, here are the results:

glue takes some time, but I don’t think that that is the major issue. Maybe the problem is in the lapply? I think we could be able to parallelize this part to remove the bottleneck.

UPD 3
Here is this code loo/loo_subsample.R at e197aa7c5ac56881ad67dae3f90479af96d83c0a · stan-dev/loo · GitHub. I think that should be parallelizable without major issues. I see that parLapply is used elsewhere in the package and that might be a suitable drop-in replacement

UPD 4
Indeed, this was the problem. I have looked at how parallel computation was implemented here and swapped lapply with mclapply and the function finished in a few seconds. I might work on a PR for this. Here is a short example on the fix in loo_subsample.R:

# Compute the lpds
    lpds <- unlist(mclapply(X = seq_len(N), mc.cores = cores, FUN = function(i) {
      ll_i <- .llfun(data_i = data[i,, drop=FALSE], draws = point_est)
      ll_i <- as.vector(ll_i)
      lpd_i <- logMeanExp(ll_i)
      lpd_i
    }))

UPD 5

Sent a pull request here: Added parallel likelihood computation by kdubovikov · Pull Request #171 · stan-dev/loo · GitHub

Hi!

Yes, lets keep the discussion here. Without looking into the code, there might be so that not all parts are parallelized yet. Hence the performance issue. Although, I need to check the code to be sure. Im currently busy and cannot check this right away. If it is urgent, you could run debugonce() in R and see where it get ”stuck” with one core.

Is missing paralellism the only reason for low performance?

Another hypothesis is that the performance issue might come from the glue package since the loglik function is computed for all observations.

Hence, you could try to create the whole dataset first with glue and in the llfun just access the data. Just a quick thought.

Also, have you tried to profile the code to see where the performance issue comes from?