For my work in genomics, I want to fit a model that has about 10,000 parameters, and the posterior is probably going to be multimodal. I am aware that this may not be possible using any out-of-the-box software, so I am currently trying to assess my options and break it down into smaller questions. I combed through the examples on this site, and some seem to regard even 100 parameters as high dimensional. But, I don’t have much of a sense for that yet. So, one of the smaller questions I have identified is: how quickly does Stan’s performance degrade as the dimension increases? And what exactly is the limiting factor that makes higher dimensions hard?
I ran the following code to test Stan on an iid Gaussian distribution of arbitrary dimension. It generates two samples: one from Stan and the other using rnorm
. It compares the samples by trying to discriminate between them with a random forest classifier.
Up to D=3, accuracy is about 50%, meaning Stan and rnorm
are indistinguishable. At D=4 and 5, accuracy creeps up to 51%. At D=10, it is 53%. At D=100, it’s 58%. At D=1000, it’s 63%, meaning there are regions where Stan’s samples or rnorm
's samples are enriched, but those regions affect a low fraction of the data. In each case, sampling is very fast. The random forest is actually taking longer to run!
This is an easy case – unimodal with contours of constant curvature – but it’s still much better performance than I expected. It seems to me there’s nothing intrinsically bad about running HMC in high dimensional parameter spaces, and I should focus on other issues such as multimodality and funnels.
Do you agree with my interpretation? How would you make this experiment more informative?
Here’s my code (with Rstan).
data {
// Dimension will be provided by an R wrapper.
int<lower=0> D;
}
parameters {
// A big fat iid standard Gaussian
vector[D] X;
}
model {
for (d in 1:D){
X[d] ~ normal(0, 1);
}
}
set.seed(0)
library("rstan")
library("randomForest")
library("magrittr")
setwd("~/Desktop/software_projects/cellmatch_notes/experiments/iid_gaussian")
stanfile = "stan_high_dim.stan"
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
run_gaussian_benchmark = function(D){
fitted_mc = rstan::stan(file = stanfile, data = list(D=D), seed = 0)
samples_mc = as.matrix(fitted_mc)[,-(D+1)]
samples_genuine = matrix( rnorm(prod(dim(samples_mc))), nrow = nrow(samples_mc), ncol = ncol(samples_mc) )
dimnames(samples_genuine) = dimnames(samples_mc)
both = rbind(samples_mc, samples_genuine)
label = c(
rep(0, nrow(samples_mc)),
rep(1, nrow(samples_genuine))
)
forest_fitted = randomForest::randomForest(x = both, y = factor(label))
conf = forest_fitted$confusion/sum(forest_fitted$confusion)
acc = sum(diag(conf))
print(conf)
print(acc)
plot(colMeans(samples_mc) - colMeans(both), forest_fitted$importance)
return(list(forest = forest_fitted, acc = acc))
}
run_gaussian_benchmark(D = 2)
run_gaussian_benchmark(D = 3)
run_gaussian_benchmark(D = 4)
run_gaussian_benchmark(D = 5)
run_gaussian_benchmark(D = 10)
run_gaussian_benchmark(D = 100)
run_gaussian_benchmark(D = 1000)