Simulating Data Using

In order to overcome some of my bad habits, I’ve begun to simulate data before fitting a model to my actual data.

Where the models are simple, this is fine. But it can become arduous quickly. In the case of some of brms’ functions, I am aware of what they are for, but have no clue how I could simulate data for them. This problem compounds itself when more than one of these functions are included in the model.

The obvious solution is: learn how to simulate the data. Yes, that’s true and worth doing. But there must still be a case for a convenience function that does the same job.

My question, then, is whether it’s possible to specify arbitrary parameters for any brms model, for example by specifying priors for each parameter in the model, from which I can simulate a data set for later use?

1 Like

The workflow is to specify priors and set sample_prior = "only" in brm. Then, use posteror_predict to obtain simulated responses.

4 Likes

Thanks, Paul. That’s excellent.

Hi there, I’m trying this out at the moment as well. For a hierarchical model. Something is not right… Is it meant to work for hierarchical models? I’ve put something like this:
To make the priors:

bprior <- c(
prior(normal(-0.12,0.04),class = "Intercept"),
prior(normal(-0.14,0.01),class="b",coef="Coef1"),
prior(normal(0.66,0.03),class="sd",coef="Intercept",group="ParticipantID"),
prior(normal(0.07,0.01),class="sd",coef="Coef1",group="ParticipantID"),
prior(normal(0.58,0.01),class="sigma"))

Then I’ve made simulations like this:

simulModel= brm(y~1+Coef1+(1+Coef1|ParticipantID),data=df,sample_prior="only")

newY=predict(simulModel)

And finally I tried to fit simulation like this (first using adding newY to data frame df->‘newdf’)

fit= brm(newY ~ 1+Coef1+(1+Coef1|ParticipantID),data=newdf)

From the results it looks like the parameters that were used to make the simulation were without any noise:
sd(Intercept) and sd(Coef1) is 0.01.
Est.Error on Intercept and Coef1 are 0 and sigma is 0.01

Did I do something wrong trying to make the simulation or would this anyway not work for a hierarchical simulation? What I would try to do coding manually would be to first draw regression weights for each person from the normal distribution across people and then create predictions for each measurement per person.

1 Like

Hi Jacquie,

I think your approach should work for hierarchical models as well. One thing i noticed is that you dont use your priors in simulModel which is probably just a typo.

I think what might be the problem here might be the newY = predict(simulModel) uses the default summary = TRUE which gives the means of all samples and therefore does already average out the sampling-noise, which is probably why your parameters do not show the correct SE. Changing this to newY=predict(simulModel, newdata = df, summary=FALSE, nsamples=1)[1,] should do the trick.

Long answer + example

Data generation

library(sjPlot)
library(ggplot2)
library(MASS)
library(brms)
library(plyr)
library(gridExtra)


# data generation function
gen_design <- function(n_participant, n_stimulus, n_levels){
  participants <- rep(1:n_participant, each = n_stimulus)
  #stims <- rep(1:n_stimulus, times = n_participant)
  x <- c()
  stims <- as.vector(sapply(seq(1, n_participant), function(x) sample(seq(1,n_stimulus), size = n_stimulus)))
  for(i in 1:n_participant){
    x <- append(x, seq(0,(n_levels-1)))
    #x <- append(x, sapply(seq(1,n_stimulus), function(x) sample(seq(1,n_levels), size = n_levels)))
  }
  expdesign <- data.frame(participant = participants, stimulus = stims, x = x)
  return(expdesign)
}

sp <- list(n_participant = 30, n_stimulus = 20, n_levels =2) # data with 30 participants each seeing 20 stimuli (which will not be modelled in this example) with 2 levels (10 obs per participant per cell)
exp1 <- gen_design(n_participant = sp$n_participant, n_stimulus = sp$n_stimulus, n_levels = sp$n_levels)

priors

all_priors <- c(set_prior("normal(10,1)", class = "Intercept"), 
                set_prior("normal(1,1)", class = "b"),
                set_prior("normal(5,1)", class = "sd", coef = "Intercept", group = "participant"),
                set_prior("normal(2,1)", class = "sd", coef = "x", group = "participant"),
                set_prior("normal(10,1)", class = "sigma"))

fit intial model

y <- rnorm(1:nrow(exp1), 1, 0) # just some arbitrary values for this column
m1 <- brm(y ~ x + (1 + x | participant), data = exp1, sample_prior = "only", prior = all_priors)

make and fit predictions

new_y_sum  <- predict(m1)[,1]
new_y<- predict(m1, newdata = exp1, summary = F, nsamples = 1)[1,]
exp1$new_y_sum <- new_y_sum
exp1$new_y <- new_y
m1_new_sum <- brm(new_y_sum ~ x + (1 + x | participant), data = exp1, chains = 4, cores = 4)
m1_new <- brm(new_y ~ x + (1 + x | participant), data = exp1, chains = 4, cores = 4)

model output for predict(m1) - too small SEs

fixef(m1_new_sum)
          Estimate  Est.Error      Q2.5    Q97.5
Intercept 9.488095 0.01631371 9.4562210 9.520331
x         1.009386 0.01550281 0.9784959 1.039939

model output for predict(m1, summary = F) intended SEs

fixef(m1_new)
            Estimate Est.Error      Q2.5     Q97.5
Intercept 10.0320100 1.1200567  7.892897 12.145149
x          0.3468883 0.8816665 -1.388313  2.085174
More formal simualtion of SEs for predict(m1, summary = F)
set.seed(1)
fixef_vec_int <- list()
fixef_vec_x <- list()
fixef_vec_x_lb <- c()

for(i in 1:1000){
  new_y <- predict(m1, newdata = exp1, summary = F, nsamples = 1)[1,]
  exp1$new_y <- new_y
  m1_tmp <- update(m1_new, newdata = exp1, chains = 1)
  fixef_vec_int[[i]] <- c(fixef(m1_tmp)[1], fixef(m1_tmp)[3])
  fixef_vec_x[[i]] <- c(fixef(m1_tmp)[2], fixef(m1_tmp)[4])
  fixef_vec_x_lb[i] <- fixef(m1_tmp)[6]
}

fixef_x_means <- unlist(lapply(1:100, function(x) fixef_vec_x[[x]][1]))
fixef_x_sds <- unlist(lapply(1:100, function(x) fixef_vec_x[[x]][2]))

mean(fixef_x_means) # the mean of the coef-prior
mean(fixef_x_sds) # the mean of the sampled SE

That sounds correct to me as well. Schad, Betancourt, & Vasishth have a good paper + script for these kind of simulations for mixed models.

Hope this helps,
Julian

I would just add that you should thoroughly investigate the re_formula and sample_new_levels arguments to predict to ensure you’re getting what you expect.

1 Like

thank you, that’s really useful!

1 Like