Sample_prior = "only" and size of dataset


I am running some prior predictions using sample_prior ‘only’.

Just now, everything is very slow, I suspect this is because I have over 100k datapoints. As I’m only sampling the prior at this stage, I should be able to save a lot of time by handing the model a much smaller dataset that only contains 1 observation per cell.

is this reasonable?

on a related note, can anybody recommend any resources on setting priors for random effect structures? Just now I am leaving all those to the defaults, but they appear to give me a really wild range of prior predictions.


are you sure it’s because n=1e5? What are your priors? When sampling only priors then you don’t have to run that many iterations (and perhaps even use one chain only).


and thanks for your reply. I think it is due to the sample size, as I’ve reduced things to 1 row per cell (I’m modelling a cog psych experiment, so there are many repetitions of each experimental condition)and things run pretty quickly.

But I am a little puzzled as to why the size of the dataset should matter, so perhaps I am wrong on this!

Just now, my model is:

m <- brm(
rt ~ 0 + d_feature + log(Nd):d_feature + (log(Nd):d_feature|p_id),
data = df,
family = lognormal(),
prior = my_priors,
chains = 1,
sample_prior = ppc,
iter = n_itr)


my_priors <- c(
prior_string(“normal(-0.5, 0.1)”, class = “b”, coef = intercepts),
prior_string(“normal(0, 1)”, class = “b”, coef = slopes))

(d_feature has 6 levels, and I’m basically fitting y = bxNd + c for each level of d_feature.

This when I plot this, things look reasonable (first row… the other rows are exploring using a lognormal and shifted_lognormal).

I’m now fitting the model to the data… and it seems unreasonably slow. I suspect need to either simplify my random effect structure or, use better priors.

The time required to sample the prior will depend on the size of the data set (really, the ‘prior’ is conditional on the specific, observed values of the covariates). There are two contributors:

  • The amount of time it takes to compute X\beta / the linear predictor (this can be large even when doing simple Monte Carlo to sample the prior predictive).
  • Very broad / unevenly scaled priors that necessitate a large number of HMC leapfrog steps per iteration.

The line between prior and posterior gets very fuzzy here; It is useful to keep in mind that to Stan there is only a ‘log target density’, and you can fall into all of the usual pitfalls whether your target is a prior or a posterior.

I’m not sure this is true – the prior can be just as difficult to sample as the posterior, particularly for high dimensional parameters. You don’t know until you run your MCMC sampler (in cases where you can’t do simple Monte Carlo)

Some questions / code to run:

  • How many rows/column are left in your reduced matrix for your prior (and how many in your posterior? You can use prior_data <- make_standata(....) from BRMS to generate the Stan data, and call dim(prior_data$X) and dim(prior_data$Z_*) (I don’t remember the naming convention for Z off the top of my head)
  • How many leapfrog steps per iteration are being taken for the prior/posterior (you can check with rstan::get_num_leapfrog_per_iteration(brms_fit@fit)

you could do prior checks without sampling from the data. Simply randomly sample from each prior and bunch them together with you likelihood (perhaps using a link function) and then plot them?

Are there any useful wrapper functions for this? I’ve done things this way before, but it sometimes feels like it requires quite a bit of long-winded code! [Not terribly difficult, but sometimes a little fiddly]

The nice thing about using sample_priors = “only” is that I can then plot my prior and posterior in the same way.

No, sorry to say I do it manually always :( And I agree, it’s always nicer to use the methods we already have to plot posteriors :)

Unless your model explicitly includes measurement error, then it is rare to put a ‘prior’ on the design matrix. When doing linear regression: Y = X\beta + \varepsilon, we simulate from the prior predictive distribution for Y by simulating values of \beta and \varepsilon from their prior densities, then computing the appropriate value of Y. We (well, at least I do not) typically do not simulate a new design matrix X for each simulated value of (\beta, \varepsilon), mostly because there is no explicit generative model for X. I think ‘is X data or not’ is more of a philosophical question, but say the columns of X represent continuous covariates and I add a large number to every entry in X. If we keep the prior on \beta the same, then the prior predictive distribution for Y will change substantially.

Sometimes it is possible to code the generative model in Stan (using the generated quantities block and the relevant _rng functions) and call Stan with algorithm = 'fixed_param', which at least gets the samples into a known format (which can be passed to bayesplot, etc).

Of course I meant when we don’t put them in X, i.e., a design matrix.

1 Like