Running speed - phylogenetic models

Hi all,

I’m having trouble with the running speed of my phylogenetic models and I wondered if people had some advice. The models I’m running are for 9835 species, so they are expected to take a very long time, but unfortunately I’m only allowed 72 hours on my university cluster in one go.

For anyone that doesn’t really use phylo-models, the main speed decrease comes from using a covariance matrix for the random species level effect. For a few species (~1000) it only takes about 30 mins, but often with the full 10k half the models don’t finish in the allotted 72 hours. I have to run a sample of 50 trees, and repeat for sensitivity analysis as well.

Even with 10,000 iterations I’m still getting Rhat values of 1.01, normally for the group level effect, or else I’d just run for less iterations.

So my question: is there anything I can do to speed up this model?

  1. Already run in parallel with 2 chains, 25 threads per chain so 50 cores total (benchmarking showed even more cores would’ve been quicker but resources are limited).
  2. Already randomised data so that each chunk should be similar processing speed (I hope).
  3. Using decomp = “QR”, which does it speed it up a bit.
  4. Init = 0 seems to work better than random.
  5. All variables are centered and scaled to the same units already.
  6. Priors are weakly informative. I didn’t want to set N(0,1) because I get an effect size of ~3 for one trait. However I get the feeling that with this much data the priors probably won’t affect the posterior. Plus with this much data I doubt they would speed it up, but I did put a stronger prior on the group level effect.
  7. pp_check is really good fit, so seems like models are parametrized correctly.
  8. I know using “threshold = equidistant” slows down my model, but it fits better if I do.

This is perhaps a rogue suggestion, but could I try and decrease the max_treedepth? I know increasing it slows down sampling speed, but my models are converging. Or should I try and increase tree_depth, but then have less iterations? Same applies for adapt_delta. Benchmarking speed of a sample shows they speed up models, but also looked like convergence was slightly worse.

Should I increase warmup? I guess I don’t need all those iters? Thinning is 20 for file size afterwards.

Anything else you can think of? Using Max Likelihood didn’t speed it up so i stopped trying with that.

###############################################################################
                     # Ordinal brms models on cluster  #
###############################################################################

# Packages to load.
library(magrittr)
library(caper)
library(dplyr)
library(effectsize)
library(phytools)
library(brms)
library(graph4lg)

# Clear the workspace.
rm(list=ls())
###############################################################################
                       #### Read in the data #####

# Functions.
source("Code/functions.R")

# Read in the tree.
model_tree <- read.tree("Data/Trees/prum_trees.tre")[[tree_number]]

# Read in the life history traits.
model_data <- read.csv("Data/eco_traits.csv")
model_data$tree_tip <- gsub(" ", "_", model_data$birdtree_name)

# Drop tips on the tree.
model_tree <- drop.tip(model_tree, setdiff(model_tree$tip.label, model_data$tree_tip))

# Make a covariance matrix, and order data the same.
model_covar <- ape::vcv.phylo(model_tree)

# Reorder the matrix so it's random, to maximise parallel processing speed.
mat_order <- sample(1:nrow(model_covar), size = nrow(model_covar), replace = FALSE)
model_covar <- reorder_mat(model_covar, rownames(model_covar)[mat_order])
row.names(model_data) <- model_data$tree_tip
model_data <- model_data[row.names(model_covar),]


###############################################################################
              #### Prepare predictor variables ######

# Set as factor, then re-level for appropriate reference group.
model_data %<>% mutate(
  trait_1 = relevel(as.factor(trait_1), ref = "No"),
  trait_2 = relevel(as.factor(trait_2), ref = "Weak"),
  trait_3 = relevel(as.factor(trait_3), ref = "Primary"),
)

# Center categorical predictors.
model_data %<>% mutate(
  trait_1 _c = center_categorical(trait_1),
  trait_2 _c = center_categorical(trait_2),
  trait_3 _c = center_categorical(trait_3),
)

# Scale continuous predictors to two SD.
model_data %<>% mutate(
  trait_4_z = standardize(trait_4, two_sd = TRUE),
)

###############################################################################
                #### Set model formula ######

# Model forumla. 
# (3 categorical dummy variables, 1 continuous, 2 interaction terms, group level effect).
model_formula <- "response_score  ~  
                       trait_1_c + trait_2_c +
                       trait_3_c + trait_4_z
                       trait_1_c:trait_3_c + 
                       trait_1_c:trait_4_z+ 
                       (1|gr(tree_tip, cov=A))"
 
# brms formula.
brms_formula <- brmsformula(model_formula, 
                            family = cumulative(threshold = "equidistant"), 
                            decomp = "QR")

# Add weak priors.
normal_priors <- c(prior(normal(0,5), class="Intercept"),
                   prior(normal(0,5), class="b"),
                   prior(gamma(2,1), "sd"))

# Model pathway.
model_pathway <- paste0("Results/Models/, example_model, ".rds") 

# Run models.
  brms_model <- brm(
    brms_formula,
    data = model_data,
    data2 = list(A=model_covar),
    prior = normal_priors,
    iter = 10000,
    warmup = 5000,
    chains = 2,
    thin = 20,
    cores = 50,
    init = 0,
    file = model_pathway,
    normalize = FALSE,
    backend = "cmdstanr",
    control = list(max_treedepth = 5),
    threads = threading(25),
  )

### OS + BRMS INFO ###

> Sys.info()
sysname  =  "Linux"             
release = "4.18.0-348.20.1.el8_5.x86_64"
version = "#1 SMP Tue Mar 8 12:56:54 EST 2022"                            
machine  = "x86_64" 

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.5 (Ootpa)

Matrix products: default
BLAS/LAPACK: /rds/general/user/home/anaconda3/envs/env1/lib/libopenblasp-r0.3.12.so

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] brms_2.18.0  Rcpp_1.0.8.3


If the only problem is fitting jobs into 72 hours, then you should be able to run these by checkpointing as in:
https://cran.r-project.org/web/packages/chkptstanr/vignettes/chkpt_brms.html

Thinking about ways to speed up the fitting (your actual question :)… Assuming that it’s not worth delving into the Stan code itself here to try to find speed-ups (either because brms is already using very efficient code and/or because the conveniences of fitting and post-processing with brms are important to you), I see one main place to squeeze out a potentially large speed-up. Hopefully and presumably, inference doesn’t vary wildly across the 50 trees. If that’s the case, you’ll probably get some speedup early in warmup by using a sample from a previous fit as the inits and using the inverse metric from a previous chain as the initial inverse metric. Assuming it works as intended, that’ll be a modest speedup by itself, because it will (hopefully) avoid the need to saturate the treedepth with tiny step sizes early during warmup.
The bigger potential speedup will involve the possibility that you can substantially shorten warmup by reusing this information. In the best-case scenario, if the posteriors are really quite similar across trees, you could get away with dropping the windowed phase entirely and running just an init buffer and a term buffer.
Maybe this is too optimistic, but even then you might be able to get away with running just an init buffer, a single long window, and a term buffer, which would cut your warmup by roughly half.

In addition, a few extra thoughts:

  1. I commend you for paying close attention to possible non-convergence represented by r-hat of 1.01, but I think that might be overkill here. Personally, I would happily make confident inference at r-hat 1.01.
  2. Given that you are increasing the number of iterations due to dissatisfaction with r-hat, it is probably not helpful to decrease the maximum treedepth. The reason you need so many iterations is quite possibly/probably because you are several factors of two short of reaching the stopping criterion, if running at max_treedepth = 5.
  3. As long as you are not seeing divergences, don’t mess with adapt_delta.
1 Like

Thanks!
I actually had no idea you could checkpoint brms, despite searching for years for the possibility. I probably started using brms before it was a function, and missed it’s introduction. So thanks already! That’s super useful.

As for the trees, they don’t converge on the exact same values, and the rhat of combined models is sometimes ~1.30-1.40. I tend to ignore this because each individual model converges and i’m trying to make my models robust to uncertainty by combining all the slightly different values. Sometimes a predictor could vary from 1.1-1.5, so I don’t know if it’s useful to supply as new inits. Do you think it would be much quicker than simply inits = 0?

I’m open to trying it because it seems cool, but it’s something I’ve never tried before.

EDIT

I actually didn’t mean to include that control = list(max_treedepth = 5) argument.
They’re not in my original models, which use the default of 10, and default adapt_delta of 0.8.
I was just playing around with it hoping it’d speed up my code. It definitely does but looks like it also increases rhats too so defeats the purpose.

Is it possible to bypass chkptstanr and do this directly in cmdstanr? I ask as chkptstanr needs R 4.1.0 which, in turn, requires more than linux mint 18.3. I know I’ll need to upgrade at some point but would like to postpone that.

thanks

You can bypass chkptstanr by handling the adaptation manually in a series of repeated calls to cmdstanr::sample. To do this, you need to understand how adaptation in Stan works (init buffer, windowed phase, term buffer), and you’ll need to understand how to extract and pass the inverse metric (or compute the inverse metric yourself from unconstrained draws), step size, and inits corresponding to the parameters on the unconstrained scale. You’ll also need to understand how Stan boosts the step-size at the end of each windowed adaptation phase.

If the above more-or-less makes sense to you, then good luck! If not, I strongly recommend against trying to checkpoint directly in cmdstanr.

1 Like

thanks. I may try it!

It makes sense, and I can do this in a simpler case, letting cmdstanr proceed through the full warmup and a single sample, then extracting the step size, inv_metric, and parameters to feed back as initial values in a new run without warmup. That works, and may help in my particular problem. As to checkpointing during warmup, I was unable to learn how the step size is boosted, though I can see it happens. I couldn’t find an explanation of this in the manual. Can you point me to something?

I want to say it’s boosted 10x at the beginning of each window, but I don’t really know details or where to find it in the code. Tagging @Bob_Carpenter

@Bob_Carpenter @jsocolar
thanks. I know the best plan is just to upgrade linux and R, and then use chkptstanr but having opened this pandora’s box I’m curious. Everything works fine on a test file with two smooths and a factor with 10 levels, if I allow cmdstanr to go through the full warmup + one sample and then extract stepsize and inv_metric. So really this is just curiosity now.

For the stepsize adjustment during warmup, it looks complicated.
attached is a plot of some test output with save_warmup = TRUE.
one of many things I don’t understand is why the output file has 753 rows when the warmup was 2000.
test
out_cmd2-1.csv (357.1 KB)

thanks again

Can you provide a reproducible example?

what is the ESS? group level effects are hard for these models. Rhat of 1.01 should be OK. get the ESS you need for the precision you want, no more.

init = 0 initializes all parameters to zero, otherwise all parameters are initialized from uniform (-2, 2). if this interval is too wide (probably), you could make it closer to 0 - init = 0.1 or something like that.

CmdStanR doesn’t save warmup iterations by default.

1 Like

that problem re the output file disappeared. I’m putting the whole thing as a reproducible example in a new topic

thanks

@mitzimorris
Hi, thanks for the feedback, sorry been away for a month so v slow to reply.
What do you mean by the ESS you need for the precision you want? Maybe it’s a fundamental part of bayesian models, but I always just run models until they’ve converged, and didn’t think running any longer would increase the precision of model estimates.

Assuming that you’ve achieved convergence in the sense of drawing somewhat mixed samples from the target distribution, then the more samples you draw the greater the precision you’ll have in understanding the properties of the target distribution. ESS attempts to answer the question “how many independent samples would I need to draw from the target to characterize the target to the same degree of precision as what I can do using my posterior sample”.

The simplest concrete example to see how our ability to characterize the distribution scales with ESS is the Gaussian distribution, where the standard error to our estimate of its mean scales as \frac{\sigma}{\sqrt{ESS}}, where \sigma is the true posterior standard deviation.