How to simulate data?

Jordan_Howell · January 19, 2020, 8:22pm

As I learn more and more about STAN and the bayesian workflow, I am getting stuck on the step of simulating the data. It’s easy enough to simulate my target variable as it follows a poisson or gamma distribution for a GLM. But how about the input variables? Once you get to up to 10 or 15 variables, how can one accurately simulate the different combinations that a matrix of those variables can take? Is ther e a best practice out there to accomplish this?

jtimonen · January 20, 2020, 1:29am

It depends on what your input variables are. In the most general case you would need to learn the joint distribution of the input variables without parametric assumptions, a difficult task on its own. If you are using GLMs, I assume that your input variables are continuous. Then if you assume that the input variables are jointly Gaussian, you could compute a sample mean vector \mu and sample covariance matrix \Sigma, and then simulate your input data by sampling from the corresponding multivariate Gaussian.

Jordan_Howell · January 20, 2020, 2:36am

Thank you @jtimonen. Do you have any resources or links that demonstrate how to do that?

Most of my input variables are continous although some are categorical binary columns.

jtimonen · January 20, 2020, 10:00pm

I don’t have any good resources in mind. But the multivariate Gaussian thing is not applicable for those binary covariates, since they cannot be generated from a normal distribution. If it is realistic to assume that your binary and continuous variables are independent, then you can do the simulation separately for binary and continuous variables. And if the binary variables themselves are independent of each other, you can simulate each one of them separately by drawing from Bernoulli distribution, where the success parameter is set to proportion of ones for that covariate.

Jordan_Howell · January 20, 2020, 10:53pm

Thank you. Last question. If I have 49 binary variables that aren’t dependent (a dummy variable for all but one state in the United States) and only one out of those can be a “1” is there a way to set that up in python?

Jordan

jtimonen · January 21, 2020, 3:46am

This will give you an N x 49 matrix X where each row contains only one 1 and rest are zeros.

import numpy as np
N = 10 # number of data points to simulate
S = 49 # number of states
p = np.repeat(1/S, S) # probabilities of each state
X = np.random.multinomial(n=1, pvals=p, size=N)

If you want to simulate your data so that some states appear more frequently, then you can set the pvals argument to something other than uniform distribution.

Bob_Carpenter · January 23, 2020, 9:49pm

You can always just use your real predictors x if they’re not modeled.

If you really do want to generate predictors, then try to generate them with similar structure to the real predictors (all positive? some 0/1 ones? heavy correlation? etc.)

Topic		Replies	Views
Simulate design matrix using Stan General	1	991	May 9, 2017
How to model mixed variables Gaussian, Ranking binary and categorical in stan all together to generate a joint distribution of them? Modeling	2	453	October 14, 2023
Generating quantities from Gaussian processes Developers gaussian-process	2	280	March 28, 2024
Generating values from multivariate normal distribution Modeling cognitive-science	4	916	December 21, 2018
How to simulate the confidence which was measured by participants' self-reports? General	6	63	April 17, 2025

How to simulate data?

Related topics