How to simulate data?

As I learn more and more about STAN and the bayesian workflow, I am getting stuck on the step of simulating the data. It’s easy enough to simulate my target variable as it follows a poisson or gamma distribution for a GLM. But how about the input variables? Once you get to up to 10 or 15 variables, how can one accurately simulate the different combinations that a matrix of those variables can take? Is ther e a best practice out there to accomplish this?

It depends on what your input variables are. In the most general case you would need to learn the joint distribution of the input variables without parametric assumptions, a difficult task on its own. If you are using GLMs, I assume that your input variables are continuous. Then if you assume that the input variables are jointly Gaussian, you could compute a sample mean vector \mu and sample covariance matrix \Sigma, and then simulate your input data by sampling from the corresponding multivariate Gaussian.

Thank you @jtimonen. Do you have any resources or links that demonstrate how to do that?

Most of my input variables are continous although some are categorical binary columns.

I don’t have any good resources in mind. But the multivariate Gaussian thing is not applicable for those binary covariates, since they cannot be generated from a normal distribution. If it is realistic to assume that your binary and continuous variables are independent, then you can do the simulation separately for binary and continuous variables. And if the binary variables themselves are independent of each other, you can simulate each one of them separately by drawing from Bernoulli distribution, where the success parameter is set to proportion of ones for that covariate.

Thank you. Last question. If I have 49 binary variables that aren’t dependent (a dummy variable for all but one state in the United States) and only one out of those can be a “1” is there a way to set that up in python?

Jordan

This will give you an N x 49 matrix X where each row contains only one 1 and rest are zeros.

import numpy as np
N = 10 # number of data points to simulate
S = 49 # number of states
p = np.repeat(1/S, S) # probabilities of each state
X = np.random.multinomial(n=1, pvals=p, size=N)

If you want to simulate your data so that some states appear more frequently, then you can set the pvals argument to something other than uniform distribution.

You can always just use your real predictors x if they’re not modeled.

If you really do want to generate predictors, then try to generate them with similar structure to the real predictors (all positive? some 0/1 ones? heavy correlation? etc.)