Hello,
I am trying to fit a Poisson regression in Stan, through brms in R. My prior is specified as a multivariate normal distribution, with non-zero mean, and non-zero correlations between different variables.
I have two models, one with ~350 features, the other with ~700. I noticed the larger one was taking significantly longer to sample from (about 30x, even with the same number of data points). I would expect the larger model to take longer (maybe 4x or 8x) but not 30x. So I tried to debug by setting the sample_prior = "only"
argument in brms, which, as the name suggests, ignores data and samples from the prior only. I fed the prior into the first model, and it took about 52 seconds for one chain, 500 warmup samples, 1750 total. For the second model, it takes well over 30 minutes [EDIT: first chain clocked in at 44 minutes] to achieve the same number of samples.
Here is the R code:
library(brms)
library(mvtnorm) # for initFunctions
load("priors.Rda")
prior1 <- set_prior("multi_normal(mean1, vcov1)")
stanvars1 <- stanvar(mean1) + stanvar(vcov1)
initFunction1 <- function() {
list(b = rmvnorm(1, mean = mean1, sigma = vcov1)[1, ])
}
# Fake data, but shouldn't be used anyway
X = data.frame(t(as.matrix(c(rep(0, 100), 1, 1, rep(0, 261), 75), ncol = 1)))
colnames(X)[364] = 'y'
model1 <- brm(
y ~ . + 0,
family = brmsfamily("poisson", link = "identity"),
data = X,
prior = prior1,
stanvars = stanvars1,
init = initFunction1,
chains = 4,
iter = 1750, warmup = 500, cores = 4, refresh = 50,
sample_prior = "only"
)
prior2 <- set_prior("multi_normal(mean2, vcov2)")
stanvars2 <- stanvar(mean2) + stanvar(vcov2)
initFunction2 <- function() {
list(b = rmvnorm(1, mean = mean2, sigma = vcov2)[1, ])
}
# Fake data, but shouldn't be used anyway
X = data.frame(t(as.matrix(c(rep(0, 100), 50, rep(0, 300), 50, rep(0, 324), 100), ncol = 1)))
colnames(X)[727] = 'y'
model2 <- brm(
y ~ . + 0,
family = brmsfamily("poisson", link = "identity"),
data = X,
prior = prior2,
stanvars = stanvars2,
init = initFunction2,
chains = 4,
iter = 1750, warmup = 500, cores = 4, refresh = 50,
sample_prior = "only"
)
priors.Rda
is an Rda file containing mean1
, vcov1
, mean2
, vcov2
(before I realized I couldn’t upload Rda). But the individual objects are attached to this post as CSVs. The sample X
is fake data, but it does reflect the sparsity of my actual data (not that that matters when just sampling from the prior).
You’ll notice the scales of the two datasets are different. I tried to increase the scale of the second dataset to match that of the first, and it didn’t make any difference. I also tried to scale the first column in the second dataset (to make it more in line with the rest of the columns), and that also made no difference.
What gives? My model is just a simple Poisson GLM, so I doubt it is misspecified, plus I am just sampling from the prior anyway, so the model specification shouldn’t matter at all.
mean1.csv (8.0 KB)
mean2.csv (16.4 KB)
vcov1.csv (2.2 MB)
vcov2.csv (7.5 MB)