Mix of continuous and categorical predictors


How to model mix of continuous and categorical predictors in stan? There are 4 continuous and 8 categorical predictors. In https://stackoverflow.com/questions/29183577/how-to-represent-a-categorical-predictor-rstan dummy variables are suggested. Ben recommends having design matrix of predictors. I wonder if categorical predictor should be a simplex instead. What is the best practice?

Another question is on the choice of distribution. If the outcome is 0 or 1 but at each design point the experiment is replicated 10 times: should bernoulli or binomial be used?


If you want to work with your predictors in a regression framework, I suggest, as did Ben, to work with design matrices and use, for instance, dummy coding for the categorical predictors (although the type of coding depends on what you are interested in).

If you don’t expect anything to change for these 10 trials, you can use a binomial model, which is faster than a bernoulli model for the single trials, but otherwise equivalent in this case.

1 Like

Many thanks. Is it OK if instead of dummies I assigned predictor values “red” and “blue” to 1 & 2?

The response is binary - 0 or 1. Any suggestions how to speedup the attached stan code?
I am still not sure that logistic regression is a way to go. Are there alternative approaches to handle binary responses?

The design matrix x has N rows and Nx columns. The response is a matrix consisting of 0s & 1s with Ny columns.

Thanks for any suggestions.


ablation.stan (1.68 KB)

I don’t see any good reason why you should prefer 1 & 2 instead of 0 & 1 for coding of your categorical predictors

If you are an R user and what to do logistic regression with Stan, I recommend the brms or rstanarm packages. For instance:

brms::brm(response ~ predictor1 + predictor2 + …, data = your_data, family = bernoulli())


rstanarm::stan_glm(response ~ predictor1 + predictor2 + …, data = your_data, family = binomial())

1 Like

Many thanks. Can I calculate predictive posterior in rstanarm? I am familiar with rstan but maybe it is time to explore…

What I meant by alternative - is logistic regression the only way to fit binary responses?


See for instance posterior_predict and pp_check. You can change the link function of the binomial / bernoulli family. It’s not the only model, but I would go for it unless you have strong objections to why you should use something else (which you know is better in a certain situation).

Thanks a lot.

The reason I want to try different models is that I am bit not satisfied with predictive power of logistic regression. I am new to such problems (I was in continuous domain before) therefore value any insight. In my problem I have binary response vector of length Ny for the each design point. What I really want is for model to predict the # of 0s in the response vector. A good model in my case is the one which matches well with observed # of 0s in the response vector. For some reason I thought that logistic regression will help but posterior predictive of # of 0s didn’t mach with observed # of 0s.

Any ideas why?


Mabye your expectations are unrealistic for this data, or maybe your predictors have not enough predictive power, or they may be non-linear relationships which you didn’t model yet. You might as well try something like random forests or similar typical machine learning models, but that’s not an area of my expertise.