Simulating Data Using

jackbailey · October 4, 2018, 6:34pm

In order to overcome some of my bad habits, I’ve begun to simulate data before fitting a model to my actual data.

Where the models are simple, this is fine. But it can become arduous quickly. In the case of some of brms’ functions, I am aware of what they are for, but have no clue how I could simulate data for them. This problem compounds itself when more than one of these functions are included in the model.

The obvious solution is: learn how to simulate the data. Yes, that’s true and worth doing. But there must still be a case for a convenience function that does the same job.

My question, then, is whether it’s possible to specify arbitrary parameters for any brms model, for example by specifying priors for each parameter in the model, from which I can simulate a data set for later use?

paul.buerkner · October 4, 2018, 6:42pm

The workflow is to specify priors and set sample_prior = "only" in brm. Then, use posteror_predict to obtain simulated responses.

jackbailey · October 4, 2018, 7:03pm

Thanks, Paul. That’s excellent.

Jacquie · June 29, 2020, 9:22am

Hi there, I’m trying this out at the moment as well. For a hierarchical model. Something is not right… Is it meant to work for hierarchical models? I’ve put something like this:
To make the priors:

bprior <- c(
prior(normal(-0.12,0.04),class = "Intercept"),
prior(normal(-0.14,0.01),class="b",coef="Coef1"),
prior(normal(0.66,0.03),class="sd",coef="Intercept",group="ParticipantID"),
prior(normal(0.07,0.01),class="sd",coef="Coef1",group="ParticipantID"),
prior(normal(0.58,0.01),class="sigma"))

Then I’ve made simulations like this:

simulModel= brm(y~1+Coef1+(1+Coef1|ParticipantID),data=df,sample_prior="only")

newY=predict(simulModel)

And finally I tried to fit simulation like this (first using adding newY to data frame df->‘newdf’)

fit= brm(newY ~ 1+Coef1+(1+Coef1|ParticipantID),data=newdf)

From the results it looks like the parameters that were used to make the simulation were without any noise:
sd(Intercept) and sd(Coef1) is 0.01.
Est.Error on Intercept and Coef1 are 0 and sigma is 0.01

Did I do something wrong trying to make the simulation or would this anyway not work for a hierarchical simulation? What I would try to do coding manually would be to first draw regression weights for each person from the normal distribution across people and then create predictions for each measurement per person.

julianquandt · July 8, 2020, 2:23pm

Hi Jacquie,

I think your approach should work for hierarchical models as well. One thing i noticed is that you dont use your priors in simulModel which is probably just a typo.

I think what might be the problem here might be the newY = predict(simulModel) uses the default summary = TRUE which gives the means of all samples and therefore does already average out the sampling-noise, which is probably why your parameters do not show the correct SE. Changing this to newY=predict(simulModel, newdata = df, summary=FALSE, nsamples=1)[1,] should do the trick.

Long answer + example

Data generation

library(sjPlot)
library(ggplot2)
library(MASS)
library(brms)
library(plyr)
library(gridExtra)


# data generation function
gen_design <- function(n_participant, n_stimulus, n_levels){
  participants <- rep(1:n_participant, each = n_stimulus)
  #stims <- rep(1:n_stimulus, times = n_participant)
  x <- c()
  stims <- as.vector(sapply(seq(1, n_participant), function(x) sample(seq(1,n_stimulus), size = n_stimulus)))
  for(i in 1:n_participant){
    x <- append(x, seq(0,(n_levels-1)))
    #x <- append(x, sapply(seq(1,n_stimulus), function(x) sample(seq(1,n_levels), size = n_levels)))
  }
  expdesign <- data.frame(participant = participants, stimulus = stims, x = x)
  return(expdesign)
}

sp <- list(n_participant = 30, n_stimulus = 20, n_levels =2) # data with 30 participants each seeing 20 stimuli (which will not be modelled in this example) with 2 levels (10 obs per participant per cell)
exp1 <- gen_design(n_participant = sp$n_participant, n_stimulus = sp$n_stimulus, n_levels = sp$n_levels)

priors

all_priors <- c(set_prior("normal(10,1)", class = "Intercept"), 
                set_prior("normal(1,1)", class = "b"),
                set_prior("normal(5,1)", class = "sd", coef = "Intercept", group = "participant"),
                set_prior("normal(2,1)", class = "sd", coef = "x", group = "participant"),
                set_prior("normal(10,1)", class = "sigma"))

fit intial model

y <- rnorm(1:nrow(exp1), 1, 0) # just some arbitrary values for this column
m1 <- brm(y ~ x + (1 + x | participant), data = exp1, sample_prior = "only", prior = all_priors)

make and fit predictions

new_y_sum  <- predict(m1)[,1]
new_y<- predict(m1, newdata = exp1, summary = F, nsamples = 1)[1,]
exp1$new_y_sum <- new_y_sum
exp1$new_y <- new_y
m1_new_sum <- brm(new_y_sum ~ x + (1 + x | participant), data = exp1, chains = 4, cores = 4)
m1_new <- brm(new_y ~ x + (1 + x | participant), data = exp1, chains = 4, cores = 4)

model output for predict(m1) - too small SEs

fixef(m1_new_sum)

          Estimate  Est.Error      Q2.5    Q97.5
Intercept 9.488095 0.01631371 9.4562210 9.520331
x         1.009386 0.01550281 0.9784959 1.039939

model output for predict(m1, summary = F) intended SEs

fixef(m1_new)

            Estimate Est.Error      Q2.5     Q97.5
Intercept 10.0320100 1.1200567  7.892897 12.145149
x          0.3468883 0.8816665 -1.388313  2.085174

More formal simualtion of SEs for predict(m1, summary = F)

set.seed(1)
fixef_vec_int <- list()
fixef_vec_x <- list()
fixef_vec_x_lb <- c()

for(i in 1:1000){
  new_y <- predict(m1, newdata = exp1, summary = F, nsamples = 1)[1,]
  exp1$new_y <- new_y
  m1_tmp <- update(m1_new, newdata = exp1, chains = 1)
  fixef_vec_int[[i]] <- c(fixef(m1_tmp)[1], fixef(m1_tmp)[3])
  fixef_vec_x[[i]] <- c(fixef(m1_tmp)[2], fixef(m1_tmp)[4])
  fixef_vec_x_lb[i] <- fixef(m1_tmp)[6]
}

fixef_x_means <- unlist(lapply(1:100, function(x) fixef_vec_x[[x]][1]))
fixef_x_sds <- unlist(lapply(1:100, function(x) fixef_vec_x[[x]][2]))

mean(fixef_x_means) # the mean of the coef-prior
mean(fixef_x_sds) # the mean of the sampled SE

That sounds correct to me as well. Schad, Betancourt, & Vasishth have a good paper + script for these kind of simulations for mixed models.

Hope this helps,
Julian

franzsf · July 8, 2020, 2:42pm

I would just add that you should thoroughly investigate the re_formula and sample_new_levels arguments to predict to ensure you’re getting what you expect.

Jacquie · July 13, 2020, 7:11am

thank you, that’s really useful!

Topic		Replies	Views
Simulation using brms General brms	3	726	February 18, 2021
How to use fit models to create new data Modeling simulation-based-calibration , brms	1	56	August 5, 2024
Advice on model structure brms	3	676	October 30, 2019
Help understanding prior simulation in brms - dummy coding not giving expected result Modeling brms	1	1101	May 25, 2023
Simulating datasets given a model with brms brms	4	1488	July 30, 2020

Simulating Data Using

Related topics