How to choose warmup length for very large models

Hello all,

I provide a description of my model and Stan code at the end of this message. However, I believe that my question is potentially of general interest irrespective of the details of the model, and so I first ask the question in general form.

I am preparing to fit a complex model to a large dataset. I expect a runtime on the order of weeks (at least), and I expect the effective number of posterior samples to be substantially lower than the nominal number of posterior samples, even with reasonably well-tuned HMC sampling. In addition to ensuring that I parameterize the likelihood efficiently and write efficient Stan code, I am concerned with selecting a reasonable number of warmup iterations to use, balancing the duration of the warmup period with the efficiency of posterior exploration later on.

Here are some initial ideas that I have for how to choose a plausible warmup length, and I would really appreciate feedback or further thoughts from this community.

  • Use campfire for adaptive warmup?
  • rstan’s default is to use iter/2 iterations for warmup and the other half for sampling. I could make a reasonable guess about how many sampling iterations I might need to achieve reasonable effective sample sizes and set the length of warmup to be equal to that number?
  • Study the behavior on smaller datasets (either with fewer data points per level of the main random effects terms or with fewer levels in those terms), and use some intuition (you’ll need to clue me in as to what this intuition might be!) about how the appropriate warmup period scales with model size to guess at an appropriate warmup period for the large data?

Gory model details
I am fitting a multi-species site occupancy model similar to this example from Dorazio et al, translated to Stan by Bob Carpenter. My model differs in four key respects. First, I am not using data-augmentation to model never-observed species. Second, I have site- and visit-specific covariates on detection that require a more complicated parameterization of the likelihood (I need visit-specific Bernoulli sampling, rather than site-specific binomial sampling). Third, and relatedly, I ultimately want to fit a model that incorporates fairly rich covariate information, with covariates involving site characteristics, visit characteristics, species characteristics, and interactions between site characteristics and species characteristics. Fourth, the size of my data is MUCH larger: 850 sites, 4 visits, and (the kicker) 1500 species. Note that ‘species’ in this model is a grouping factor for random effects (intercepts and slopes) on both occupancy and detection.

So far, I have generated simulated datasets with 900 sites and ranging in size from 30 to 240 species, and I have analyzed these using a toy model similar to (but simpler than) what I ultimately want to fit. For the time being, I have fit these models using rstan with 2000 iterations (i.e. 1000 warmup and 1000 sampling). These models run without divergences, and r-hat diagnostics indicate convergence across four chains. The runtime seems to scale roughly linearly with the number of species (with substantial variance across different simulated datasets and smaller variance between chains when I fit the model), but additionally the number of effective samples decreases substantially as the data size grows (from ~800 for the worst-case parameter in the 30-species case to ~200 for the worst-case parameter in the 240-species case). Ultimately, I expect to have access to high-performance computing resources and I hope to take advantage of map_rect when I fit the final model.

R and Stan code follows. Because I write the Stan code in a string directly in R, I provde the full R script first, and then just the Stan model pasted below with Stan code formatting
The following code is neither pretty nor fully optimized, but in case anybody is still reading here it is. This will simulate data an analyze it under a toy model.

  • I expect that I will be back with more questions related to this model in the coming weeks/months!
  • I am aware of some minor optimizations still available (vectorizing priors over the model coefficients). While not the focus of my current post, if anybody spots a major potential efficiency gain, I would of course be very interested!
  • The current toy model doesn’t perfectly match the data simulation because (look closely at the linear predictor in the transformed data block), I am currently ignoring the effects of some of the simulated coefficients.

Full R Script

rstan_options(auto_write = T)

# i indexes points, j indexes visits, k indexes species.

# Define data size
n_cluster <- 300 # points are grouped into clusters, and cluster-by-species will ultimately be a random effect on occupancy
ppc <- 3 # three points per cluster
n_point <- n_cluster*ppc
n_species <- 1500
n_visit <- rep(4, n_point)

# Define covariates (covariates associated with species visit and point; point covariates include continuous covariates that 
# do not vary within clusters as well as continuous covariates that do vary within clusters)
clusterID <- rep(c(1:n_cluster), ppc)
sp_cov1 <- runif(n_species) - .5
sp_cov2 <- runif(n_species) - .5
pt_cov1 <- runif(n_point) - .5
cl.cov.1 <- runif(n_cluster) - .5
pt_cov2 <- rep(NA, n_point)
for(i in 1:n_point){
  pt_cov2[i] <- cl.cov.1[clusterID[i]]
}

vis_cov1 <- matrix(data = NA, nrow = n_point, ncol = max(n_visit))
for(i in 1:n_point){
  vis_cov1[i, ] <- runif(n_visit[i]) - .5
}

exists_visit <- matrix(data = 0, nrow = n_point, ncol = max(n_visit))
for(i in 1:n_point){
  for(j in 1:n_visit[i]){
    exists_visit[i,j] <- 1
  }
}

# Define hyperparameters
occ.hyper <- list(b0 = c(0, .5), b1 = c(0, 1), b2 = c(1, .5), b3 = c(-1, 1), b4 = c(0, .2),
                  b5 = c(1,2), b6 = c(2, 2))
b0 <- rnorm(n_species, occ.hyper$b0[1], occ.hyper$b0[2])
b1 <- matrix(data = rnorm(n_cluster*n_species, occ.hyper$b1[1], occ.hyper$b1[2]), nrow = n_cluster)
b2 <- rnorm(n_species, occ.hyper$b2[1], occ.hyper$b2[2])
b3 <- rnorm(n_species, occ.hyper$b3[1], occ.hyper$b3[2])
b4 <- rnorm(n_species, occ.hyper$b4[1], occ.hyper$b4[2])
b5 <- rnorm(n_species, occ.hyper$b5[1], occ.hyper$b5[2])
b6 <- rnorm(n_species, occ.hyper$b6[1], occ.hyper$b6[2])

det.hyper <- list(d0 = c(-2, .5), d1 = c(0, 1), d2 = c(0, .5), d3 = c(1, 1), d4 = c(0, .2),
                  d5 = c(2, .5))
d0 <- rnorm(n_species, det.hyper$d0[1], det.hyper$d0[2])
d1 <- matrix(data = rnorm(n_point*n_species, det.hyper$d1[1], det.hyper$d1[2]), nrow = n_point)
d2 <- rnorm(n_species, det.hyper$d2[1], det.hyper$d2[2])
d3 <- rnorm(n_species, det.hyper$d3[1], det.hyper$d3[2])
d4 <- rnorm(n_species, det.hyper$d4[1], det.hyper$d4[2])
d5 <- rnorm(n_species, det.hyper$d5[1], det.hyper$d5[2])

# The below simulation is super inefficient, but the version with for-loops helps me keep straight
# everything that is going on.

# Simulate parameters from hyperparameters 
logit.occ <- psi <- matrix(NA, nrow = n_point, ncol = n_species)
for(i in 1:n_point){
  for(k in 1:n_species){
    logit.occ[i, k] <- b0[k] + b1[clusterID[i],k] + b2[k]*sp_cov1[k] + b3[k]*sp_cov2[k] + 
      b4[k]*pt_cov1[i] + b5[k]*pt_cov2[i] + b6[k]*sp_cov1[k]*pt_cov2[i]
    psi[i, k] <- boot::inv.logit(logit.occ[i, k])
  }
}

logit.det <- theta <- array(NA, dim = c(n_point, max(n_visit), n_species))
for(i in 1:n_point){
  for(j in 1:n_visit[i]){
    for(k in 1:n_species){
      logit.det[i, j, k] <- d0[k] + d1[i,k] + d2[k]*sp_cov1[k] + d3[k]*sp_cov2[k] + 
        d4[k]*pt_cov1[i] + d5[k]*vis_cov1[i,j]
      theta[i, j, k] <- boot::inv.logit(logit.det[i, j, k])
    }
  }
}


# simulate data from parameters
Z <- matrix(NA, nrow = n_point, ncol = n_species)
for(i in 1:n_point){
  for(k in 1:n_species){
    Z[i, k] <- rbinom(1, 1, psi[i, k])
  }
}

det_data <- array(NA, dim = c(n_point, max(n_visit), n_species))
for(i in 1:n_point){
  for(j in 1:n_visit[i]){
    for(k in 1:n_species){
      det_data[i,j,k] <- Z[i, k] * rbinom(1, 1, theta[i,j,k])
    }
  }
}

Q <- apply(det_data, c(1,3), function(x){return(as.numeric(sum(x) > 0))})


stan.model <- '
data {
  int<lower=1> n_point; //number of sites
  int<lower=1> n_visit; //fixed number of visits
  int<lower=1> n_species; //number of species
  int<lower=1> n_cluster; //number of clusters (points are grouped into clusters)

  int<lower=0, upper=1> det_data[n_point, n_visit, n_species]; //detection history
  int<lower=0, upper=1> Q[n_point, n_species]; //at least one detection

  int<lower=0, upper=n_cluster> clusterID[n_point]; //cluster identifier (for random effects)
  vector[n_species] sp_cov1; //species covariate 1
  vector[n_species] sp_cov2; //species covariate 2
  vector[n_point] pt_cov1; //point covariate 1
  vector[n_point] pt_cov2; //point covariate 2
  matrix[n_point, n_visit] vis_cov1; //visit covariate 1
}

parameters {
  real mu_b0;
  real<lower=0> sigma_b0;
  vector[n_species] b0_raw;
  
  real<lower=0> sigma_b1;
  matrix[n_species, n_cluster] b1_raw;
  
  real b2;
  
  real b3;
  
  real mu_b4;
  real<lower=0> sigma_b4;
  vector[n_species] b4_raw;
  
  real mu_b5;
  real<lower=0> sigma_b5;
  vector[n_species] b5_raw;
  
  real mu_b6;
  real<lower=0> sigma_b6;
  vector[n_species] b6_raw;
  
  real mu_d0;
  real<lower=0> sigma_d0;
  vector[n_species] d0_raw;
  
  real<lower=0> sigma_d1;
  matrix[n_species, n_point] d1_raw;
  
  real d2;
  real d3;
  real mu_d4;
  real<lower=0> sigma_d4;
  vector[n_species] d4_raw;
  
  real mu_d5;
  real<lower=0> sigma_d5;
  vector[n_species] d5_raw;
}

transformed parameters{
  vector[n_species] b0 = mu_b0 + b0_raw * sigma_b0;
  matrix[n_species, n_cluster] b1 = b1_raw * sigma_b1;

  vector[n_species] b4 = mu_b4 + b4_raw * sigma_b4;
  vector[n_species] b5 = mu_b5 + b5_raw * sigma_b5;
  vector[n_species] b6 = mu_b6 + b6_raw * sigma_b6;
  
  vector[n_species] d0 = mu_d0 + d0_raw * sigma_d0;
  matrix[n_species, n_point] d1 = d1_raw * sigma_d1;

  vector[n_species] d4 = mu_d4 + d4_raw * sigma_d4;
  vector[n_species] d5 = mu_d5 + d5_raw * sigma_d5;

  real logit_psi[n_point, n_species];
  real logit_theta[n_point, n_visit, n_species];
  matrix[n_point, n_species] log_prob_increment;
  for(i in 1:n_point){
    for(k in 1:n_species){
      logit_psi[i,k] = b0[k] + b1[k, clusterID[i]] + //b2*sp_cov1[k] + b3*sp_cov2[k] + 
                            b4[k]*pt_cov1[i]; // + b5[k]*pt_cov2[i] + b6[k]*sp_cov1[k]*pt_cov2[i]);
    }
  }
  for(i in 1:n_point){
    for(j in 1:n_visit){
      for(k in 1:n_species){
        logit_theta[i,j,k] = d0[k] + d2*sp_cov1[k] + // d3*sp_cov2[k] + 
                        // d4[k]*pt_cov1[i] + 
                        d5[k]*vis_cov1[i,j];
      }
    }
  }
  
  for(i in 1:n_point){
    for(k in 1:n_species){
      if(Q[i,k] == 1)
        log_prob_increment[i,k] = log_inv_logit(logit_psi[i,k]) + 
                                    bernoulli_logit_lpmf(det_data[i,1,k] | logit_theta[i,1,k]) + 
                                    bernoulli_logit_lpmf(det_data[i,2,k] | logit_theta[i,2,k]) +
                                    bernoulli_logit_lpmf(det_data[i,3,k] | logit_theta[i,3,k]) +
                                    bernoulli_logit_lpmf(det_data[i,4,k] | logit_theta[i,4,k]);
      else
        log_prob_increment[i,k] = log_sum_exp(log_inv_logit(logit_psi[i,k]) + log1m_inv_logit(logit_theta[i,1,k]) + 
                                                    log1m_inv_logit(logit_theta[i,2,k]) + log1m_inv_logit(logit_theta[i,3,k]) + 
                                                    log1m_inv_logit(logit_theta[i,4,k]), 
                                                log1m_inv_logit(logit_psi[i,k]));
    }
  }
}
 
model {
  //Hyper-priors:
  mu_b0 ~ normal(0,10);
  b2 ~ normal(0,10);
  b3 ~ normal(0,10);
  mu_b4 ~ normal(0,10);
  mu_b5 ~ normal(0,10);
  mu_b6 ~ normal(0,10);
  
  mu_d0 ~ normal(0,10);
  d2 ~ normal(0,10);
  d3 ~ normal(0,10);
  mu_d4 ~ normal(0,10);
  mu_d5 ~ normal(0,10);
  
  sigma_b0 ~ normal(0,10);
  sigma_b1 ~ normal(0,10);
  sigma_b4 ~ normal(0,10);
  sigma_b5 ~ normal(0,10);
  sigma_b6 ~ normal(0,10);
  
  sigma_d0 ~ normal(0,10);
  sigma_d1 ~ normal(0,10);
  sigma_d4 ~ normal(0,10);
  sigma_d5 ~ normal(0,10);
  
  //Random Effects
  b0_raw ~ normal(0, 1);
  to_vector(b1_raw) ~ normal(0, 1);
  b4_raw ~ normal(0, 1);
  b5_raw ~ normal(0, 1);
  b6_raw ~ normal(0, 1);
  
  d0_raw ~ normal(0, 1);
  to_vector(d1_raw) ~ normal(0, 1);
  d4_raw ~ normal(0, 1);
  d5_raw ~ normal(0, 1);
  
  //Likelihood (data level)
  target += sum(log_prob_increment);
}'


stan.data <- list(n_point = n_point, n_species = n_species, 
                  n_cluster = n_cluster, n_visit = 4,
                  det_data = det_data, 
                  Q = Q, 
                  clusterID = clusterID, 
                  sp_cov1 = sp_cov1, sp_cov2 = sp_cov2, 
                  pt_cov1 = pt_cov1, pt_cov2 = pt_cov2,
                  vis_cov1 = vis_cov1)

nc <- 4

start.time <- Sys.time()
stan.samples <- stan(model_code = stan.model, data = stan.data, iter = 2000, chains = nc, cores = nc,
                     pars = c('logit_psi', 'logit_theta', 'log_prob_increment', 'd1_raw', 'd1', 'b1_raw', 'b1'),
                     include = FALSE, refresh = 10)
elapsed <- Sys.time() - start.time

summary.samples <- summary(stan.samples)
which(summary.samples$summary[,9] == min(summary.samples$summary[,9]))
max(summary.samples$summary[,10])
object.size(stan.samples)

######

# 30 species, all params saved, small model:
#    min neff = 862
#    max R-hat = 1.004
#    grad. eval. range: .046 - .072
#    elapsed range: 7427 - 7490
#    object.size: 15178522976 bytes

# 30 species, few params saved, small model:
#    min neff = 809
#    max R-hat = 1.012
#    grad. eval. range: .049 - .077
#    elapsed range: 5649 - 6839


# 60 species, few params saved, 2000 iterations, small model:
#    min neff = 303.35
#    max R-hat = 1.022
#    grad. eval. range: .088 - .152
#    elapsed range: 13573 - 16148
#    object.size: 104002808 bytes

# 120 species, few params saved, 2000 iterations, small model:
#    min neff = 453.734
#    max R-hat = 1.014
#    grad. eval. range: .180 - .312
#    elapsed range: 25435 - 33187

# 240 species, few params saved, 2000 iterations, small model:
#    min neff = 229
#    max R-hat = 1.017
#    grad. eval. range: .363 - .531
#    elapsed range: 89908 - 97478

# 1500 species, few params saved, 2000 iterations, small model, CXX flags updated:
# Started near 1410h on 22/4/20
#    min neff =
#    max R-hat = 
#    grad. eval. range: 2.90 - 3.63
#    elapsed range: 

Just the Stan code

data {
  int<lower=1> n_point; //number of sites
  int<lower=1> n_visit; //fixed number of visits
  int<lower=1> n_species; //number of species
  int<lower=1> n_cluster; //number of clusters (points are grouped into clusters)

  int<lower=0, upper=1> det_data[n_point, n_visit, n_species]; //detection history
  int<lower=0, upper=1> Q[n_point, n_species]; //at least one detection

  int<lower=0, upper=n_cluster> clusterID[n_point]; //cluster identifier (for random effects)
  vector[n_species] sp_cov1; //species covariate 1
  vector[n_species] sp_cov2; //species covariate 2
  vector[n_point] pt_cov1; //point covariate 1
  vector[n_point] pt_cov2; //point covariate 2
  matrix[n_point, n_visit] vis_cov1; //visit covariate 1
}

parameters {
  real mu_b0;
  real<lower=0> sigma_b0;
  vector[n_species] b0_raw;
  
  real<lower=0> sigma_b1;
  matrix[n_species, n_cluster] b1_raw;
  
  real b2;
  
  real b3;
  
  real mu_b4;
  real<lower=0> sigma_b4;
  vector[n_species] b4_raw;
  
  real mu_b5;
  real<lower=0> sigma_b5;
  vector[n_species] b5_raw;
  
  real mu_b6;
  real<lower=0> sigma_b6;
  vector[n_species] b6_raw;
  
  real mu_d0;
  real<lower=0> sigma_d0;
  vector[n_species] d0_raw;
  
  real<lower=0> sigma_d1;
  matrix[n_species, n_point] d1_raw;
  
  real d2;
  real d3;
  real mu_d4;
  real<lower=0> sigma_d4;
  vector[n_species] d4_raw;
  
  real mu_d5;
  real<lower=0> sigma_d5;
  vector[n_species] d5_raw;
}

transformed parameters{
  vector[n_species] b0 = mu_b0 + b0_raw * sigma_b0;
  matrix[n_species, n_cluster] b1 = b1_raw * sigma_b1;

  vector[n_species] b4 = mu_b4 + b4_raw * sigma_b4;
  vector[n_species] b5 = mu_b5 + b5_raw * sigma_b5;
  vector[n_species] b6 = mu_b6 + b6_raw * sigma_b6;
  
  vector[n_species] d0 = mu_d0 + d0_raw * sigma_d0;
  matrix[n_species, n_point] d1 = d1_raw * sigma_d1;

  vector[n_species] d4 = mu_d4 + d4_raw * sigma_d4;
  vector[n_species] d5 = mu_d5 + d5_raw * sigma_d5;

  real logit_psi[n_point, n_species];
  real logit_theta[n_point, n_visit, n_species];
  matrix[n_point, n_species] log_prob_increment;
  for(i in 1:n_point){
    for(k in 1:n_species){
      logit_psi[i,k] = b0[k] + b1[k, clusterID[i]] + //b2*sp_cov1[k] + b3*sp_cov2[k] + 
                            b4[k]*pt_cov1[i]; // + b5[k]*pt_cov2[i] + b6[k]*sp_cov1[k]*pt_cov2[i]);
    }
  }
  for(i in 1:n_point){
    for(j in 1:n_visit){
      for(k in 1:n_species){
        logit_theta[i,j,k] = d0[k] + d2*sp_cov1[k] + // d3*sp_cov2[k] + 
                        // d4[k]*pt_cov1[i] + 
                        d5[k]*vis_cov1[i,j];
      }
    }
  }
  
  for(i in 1:n_point){
    for(k in 1:n_species){
      if(Q[i,k] == 1)
        log_prob_increment[i,k] = log_inv_logit(logit_psi[i,k]) + 
                                    bernoulli_logit_lpmf(det_data[i,1,k] | logit_theta[i,1,k]) + 
                                    bernoulli_logit_lpmf(det_data[i,2,k] | logit_theta[i,2,k]) +
                                    bernoulli_logit_lpmf(det_data[i,3,k] | logit_theta[i,3,k]) +
                                    bernoulli_logit_lpmf(det_data[i,4,k] | logit_theta[i,4,k]);
      else
        log_prob_increment[i,k] = log_sum_exp(log_inv_logit(logit_psi[i,k]) + log1m_inv_logit(logit_theta[i,1,k]) + 
                                                    log1m_inv_logit(logit_theta[i,2,k]) + log1m_inv_logit(logit_theta[i,3,k]) + 
                                                    log1m_inv_logit(logit_theta[i,4,k]), 
                                                log1m_inv_logit(logit_psi[i,k]));
    }
  }
}
 
model {
  //Hyper-priors:
  mu_b0 ~ normal(0,10);
  b2 ~ normal(0,10);
  b3 ~ normal(0,10);
  mu_b4 ~ normal(0,10);
  mu_b5 ~ normal(0,10);
  mu_b6 ~ normal(0,10);
  
  mu_d0 ~ normal(0,10);
  d2 ~ normal(0,10);
  d3 ~ normal(0,10);
  mu_d4 ~ normal(0,10);
  mu_d5 ~ normal(0,10);
  
  sigma_b0 ~ normal(0,10);
  sigma_b1 ~ normal(0,10);
  sigma_b4 ~ normal(0,10);
  sigma_b5 ~ normal(0,10);
  sigma_b6 ~ normal(0,10);
  
  sigma_d0 ~ normal(0,10);
  sigma_d1 ~ normal(0,10);
  sigma_d4 ~ normal(0,10);
  sigma_d5 ~ normal(0,10);
  
  //Random Effects
  b0_raw ~ normal(0, 1);
  to_vector(b1_raw) ~ normal(0, 1);
  b4_raw ~ normal(0, 1);
  b5_raw ~ normal(0, 1);
  b6_raw ~ normal(0, 1);
  
  d0_raw ~ normal(0, 1);
  to_vector(d1_raw) ~ normal(0, 1);
  d4_raw ~ normal(0, 1);
  d5_raw ~ normal(0, 1);
  
  //Likelihood (data level)
  target += sum(log_prob_increment);
5 Likes

Nah, that’s just for development, and it’ll probably be really slow in practice (it’ll definitely be super slow for models more than 100 parameters or so cause of the way it computes Hessians)

Definitely as you add more data your posterior changes, but I guess if you’re gonna generalize advice from somewhere then doing it from a subset of your data makes about as much sense as anything. This is probly the way to go.

Cool!

Is your simulated data set is about the same size as what you expect to measure? If you didn’t change the dataset size as you added more parameters, then this might just be reflective of the sampler having difficulty sampling the prior. If that’s the case, is it possible to tighten up your prior (maybe using prior predictives as a guide: https://www.youtube.com/watch?v=ZRpo41l02KQ&t=2694)?

reduce_sum just got released yesterday and is easier to work with: Reduce Sum: A Minimal Example

2 Likes

Out of curiosity (and I’m probably revealing my ignorance here), does 100 parameters here count the number of top-level hyperparameters or the entire sum of the number of levels of each random effect? I should be well below 100 of the former, but waaaaay above 100 of the latter (in fact, I hope to have a single random effect term with a level for each species-by-siteCluster, or ~400,000 levels).

What would ‘going this way’ look like in practice? In this model, there is one binary detection/non-detection data point for each species-site-visit combination. Both species and site are involved in random effects terms, and reducing the number of visits below 3 (in my real data I have 4) runs into identifiability issues. So there’s no way to study a smaller dataset without changing the number of levels in the random effects terms. I’ve already found (see below) that varying the data size by varying the number of species leads to substantial changes in the mixing behavior, so it’s very unclear to me how to generalize results on optimal warmup time from these smaller cases to the big case (if that’s even possible).

By the way, something like “just use 5000 warmup iterations and hope for the best” is ultimately an acceptable answer to me if that’s genuinely best practice for my use case.

No, what I’ve done so far is to add new data for new levels of ‘species’, so that the 240-species example is 8 times more data than the 30-species example (in this model, a data are binary detection/non-detection for a species-site-visit combination; I’ve held sites and visits constant and varied the number of species in the model).

Sounds like this might not be relevant anyway, but is there really any way that a bunch of independent normal-distributions on random-effect hypermeans and independent half-normals on the variances could yield a prior that is substantially difficult to sample?

Wow that’s fantastic!!! Guess it’s time for me to move over to CmdStanR!

1 Like

I think an example of this is the centred versus non-centered parameterization where depending on whether your data is of the tall regime type or wide regime type one tends to sample better than the other. When you have a multivariate model this can get complicated as I suspect that some parameters might be better sampled with non-centered and some parameters might be better sampled with centred, But I haven’t seen any case studies supporting this intuition.

1 Like

I think I must be confused about terminology here. When I read:

I thought that it literally just meant drawing samples from the prior (i.e. simulating from the prior, which in my case is weak and not highly curved regardless of how it’s parameterized; it’s the posterior that might have high curvature that’s ameliorated by reparameterization). I guess “difficulty sampling the prior” is properly interpreted to mean something else here?

In addition of reduce_sum, some indexing magic would probably make this much faster. @Bob_Carpenter and @bbbales2 recently made one big hierarchical logistic regression model many times faster, and they may have some ideas for your code, too.

My main points are for how to make warmup shorter.

  • if the number of parameters is the same with smaller and bigger data, use the smaller data to get mass matrix estimate and guess for initial values
  • if the number of parameters changes for bigger data, use the smaller experiments to figure out the approximate location and scale for different types of parameters. It doesn’t matter if you get them exactly but if you can the, e.g. the magnitude of the scale about right it helps. Then use offset and multiplier as shown, e.g. at https://mc-stan.org/docs/2_23/reference-manual/univariate-data-types-and-variable-declarations.html. Use of offset and scaling will make the posterior to have scale closer to 1, which is the initial guess for the mass matrix and step size.

Also

  • use cmdstanr and follow the sampling by looking at the output files, so that if something is clearly wrong you can stop early and not wait for a week.

These two help to start the sampling with mass matrix and step size closer to useful values, and the warmup can then be made shorter.

4 Likes

All of these would be the parameters I’m counting :D.

Well probably this then lol. And what Aki said. Use cmdstan (probably a good idea for long running jobs anyway) and watch the output as it comes in.

What are your slowest mixing parameters? Is there a way to simplify the model structure there? One binary term per interaction sounds really hard to fit, honestly. Like binary data isn’t very informative.

If we want to estimate basketball shooting percentage and we have one basketball player and we just have them shoot shots, they need to shoot a few hundreds of shots before we have low single-digit percentage estimates.

A basic hierarchical model is a funnel, and that’s the prototypical bad example.

Then you non-center it, so I guess if everything you have is already non-centered and whatnot it’s okay.

But if there’s any deviations from that it still might be difficult.

It just means running your model without any likelihood terms.

I agree with Aki’s points if that’s possible given the difficulty with taking a data subset, or whatever.

The slowest mixing parameters tend to be the hypermeans for the coefficients on occupancy or detection. I haven’t noticed a strong pattern in exactly which of these coefficients are worst, except that they always involve a random effect of species, and most (but not all) of the time they involve just a 1500-level random effect with one level per species (each level then has 900 sites x 4 visits binary data points which are used to separately identify the occupancy probability and the detection probability).

None of my random effects is quite so bad as to contain only one binary data point (or even one site, which is effectively one data-point in the occupancy sub-model but consists of four binary detection/nondetection data points), but in the occupancy sub-model I have a random intercept for species-by-siteCluster, with each cluster consisting of three sites; i.e. three “four-visit data points” per level. This sort of small sample size within the levels isn’t unusual for logistic regressions predicting species occurrence in ecology. In my case, the sites come in blocks of three, and we need to account for this in analysis. The hope is that with well-specified covariate terms the variance of the massively multi-level random effect terms will be sufficiently small to yield a substantial amount of pooling.

Total brain fart on my part. For some reason my brain only went as far as sampling the hyperparameters from their priors, rather than the full model. D’oh.

Did you ever get the full model to run? Any tips learnt along the way? What compute platform did you settle on?

I did get the full model to run (and even more complicated variations with up to 250K parameters), but the Stan code ultimately only vaguely resembles the code in my original post here. Is your question specific to occupancy modeling, or is it generally about running long-compute-time models?

I mostly ran the model on a linux cluster allocating (depending on the run) between 5 and 10 cores per chain with reduce-sum–this gave me wall times of less than a week for 1500 warmup and 1500 sampling, with good convergence and effective sample sizes. More recently, I’ve done a couple of chains parallelized on 2 cores each from a new apple M1, and this was only marginally slower than using 5 cores per chain on the cluster.

How did you decide on the number of samples in the end? Why didn’t you just resample until you had reasonable Neff?

That’s roughly what I did, but importantly I managed to get a model that yielded better ESS per iteration so it was less of an issue. The bigger question here was how to choose the warmup length. And in the end I just played with the model until I had something that worked and that fit into the resources and timeframe that worked for me, and went with that.

2 Likes

Have you seen campfire?

Yeah; at the time @bbbales2 told me that it would scale very poorly to models with huge numbers of parameters, so i didn’t pursue it further…

@bbales2 Would that be because it’s computing the diagnostics so frequently? If so, would it make sense to compute the diagnostics only on lp__?

Or maybe @yizhang could chime in on the lots-of-parameters case?

@mike-lawrence from way upthread, here’s what @bbbales2 had to say about campfire (second post on this thread, for context). No idea whether this information is still current.

1 Like

Huh. I guess I don’t understand campfire (or warmup more generally) as much as I thought then; I didn’t think it was computing much extra compared to standard warmup other than rhat & ess (which looking just now seems to be using lp__ only, at least in @yizhang ’s implementation) plus a pooled e matrix ??

1 Like

I don’t think anyone can say anything definitive on how the algorithm will fare on large-scale models. On the other hand I’m not computing Hessians in my implementation.

2 Likes