I am not an expert in statistics but I have used some models in STAN before. Currenlty, I am trying to run a multivariate outcome logistic regression model (multivariate outcome logit model).

The below link contains information on how I can perfom this for probit model and linear regression. I was wondering if eny code is available for the logit version?

Personally, I cannot really answer from reading the documentation alone since Iâ€™m not familiar with the â€śtrick introduced by Albert and Chib (1993)â€ť. Iâ€™m sure there will be others here who are, but as an alternative you could unpack how that is equivalent to a multivariate probit, and it may become easier to formulate a logit equivalent.

You mean a logit regression with categorical outcomes? This code should work.

data {
int<lower=1> N; // n of observations (number of individuals times number of choice scenarios per individual)
int<lower=1> I; // n of individuals
int<lower=2> J; // n of alternatives
int<lower=0> K; // n of covariates/variables
array[N] int<lower=1, upper=J> choices; // choice or outcome index for each observation
array[N] matrix[J, K] X; // covariates for each choice situation
}
parameters {
vector[J - 1] alpha_raw; // J-1 free intercepts, one intercept is always = 0
vector[K] beta; // choice covariates
}
transformed parameters {
vector[J] alpha = append_row(0, alpha_raw); // alpha[1] = 0, alpha[2]...alpha[J] are free
}
model {
alpha_raw ~ normal(0, 2);
beta ~ normal(0, 2);
for (i in 1:N) {
choices[i] ~ categorical_logit(alpha + X[i] * beta);
}
}

I mean the multivariate probit model. The documentation description is quite detailed, but for the purpose of getting useful responses here it may help to summarized the extensive description into the core of what makes it a multivariate probit.
The difference between the regular logistic and probit regression is trivial (using one or the other function), so maybe by distilling the essence of this â€śtrickâ€ť it can be easier to formulate its multivariate version.

Where all y are binary and correlated. You can think of yi as personâ€™s response to a number of life satisfaction questions. Although all questions are separate, they are correlated to some degree. Each model is a logistic regression (regression with binary outcome). I want to model them simultaneously.

The problem with coding this model with logits and not probits is that Stan doesnâ€™t have a multivariate logistic distribution to replace the multivariate normal. I think it should be possible to take a multivariate normal, rescale it so that its margins are logistic rather than normal and use that with the code from the users guide to do a multivariate logit regression, but thatâ€™s just an intuition, not a knowledgable claim. FWIW, Stan does have a built-in multivariate student t, so I think it should be readily achievable to implement multivariate robit regression.

@caesoma the model in the doc is equivalent to a multivariate probit because the inverse probit is the standard normal cdf. So to sample y from a bernoulli distribuiton with probability parameter equal to the inverse probit of x is the same thing as sampling from a standard normal distribution and asking whether the response is less than x. Given this scheme for sampling a bernoulli variate from the inverse probit of x, we can generalize to sampling multiple bernoulli variates from the inverse probit of x, where correlation between the multiple variates can be induced by sampling from a multivariate normal (whose margins are still all standard normals).

Thanks, @jsocolar. Would you know if there is any package that does this task in R without requring me to write the code of the model myself? (not necessarily in the Bayesian form)

@tomas.rossetti , can you provide any guidance for how one might estimate such a model with these packages? Since blavaan can use Stan on the backend, it seems likely that if it can estimate this model we could generate the Stan code.

@MaryBo I donâ€™t know of any particular packages for this, but thatâ€™s not to say they donâ€™t exist.

Iâ€™m past the limits of my expertise here and so this is speculative, but based on this post Multivariate logistic distribution - Cross Validated it seems like it might not be too hard to implement a multivariate logistic distribution in Stan code, in which case the example from the users guide might be readily adaptable.

Edit: A question that I should have asked at the very beginning: @MaryBo, are you able to write down R code that would simulate data under your desired model? If so, we should take a look at that code and figure out what itâ€™s doing, and weâ€™ll almost surely be able to implement the model in Stan. In that case, figuring out whether or not that model is equivalent to the model Iâ€™m suggesting based on a multivariate logistic will be something that we should verify rather than assume. Alternatively, if you are on the hunt for a multivariate model whose output is interpretable as log-odds and that otherwise is similar to the multivariate probit outline in the userâ€™s guide, but you donâ€™t have a clear pre-conceived notion of exactly what the data-generating process would look like, then I feel somewhat confident in suggesting that the proposed model, based on a multivariate logistic distribution, is probably what you seek.

I just remembered that blavaan only implements probits for ordinal/binary outcomes, but lavaan does have the logit link. I guess itâ€™s a starting point.

If I understand the model correctly, x1 + x2 + ... is the structural equation of latent variable that represents life satisfaction, according to what @MaryBo mentioned earlier. y1 through y4 would be indicators for that latent variable. In other words, this is a confirmatory factor analysis.

The code for (b)lavaan would look like this:

# Measurement equation
life_satisfaction =~ y1 + y2 + y3 + y4
# Structural equation
life_satisfaction ~ x1 + x2 + x3
# Explicit residuals (redundant, but just for the sake of clarity)
y1 ~~ y2 + y3 + y4
y2 ~~ y3 + y4
y3 ~~ y4

Implementing the logit version of this in Stan shouldnâ€™t be difficult, although I would expect it to be quite slow.

Cool! This is a neat and potentially relevant approach. FWIW, Iâ€™m pretty sure that it is not the same model presented in the users guide. The SEM approach looks like itâ€™s inducing correlation in the y by modifying the probabilities underlying each y in the same direction when sampling the latent variable, whereas the seemingly-unrelated regression approach induces correlation by retaining the marginal probabilities associated with each y but modifying the correlation structure. If the modification of the probabilities in the SEM approach is chosen to be â€śjust soâ€ť, then the two-step approach of modify-then-sample might turn out to be equivalent to â€śsample with a given correlation structureâ€ť, but this would need to be verified, first to ensure that the distribution chosen for the degree of modification preserves the mean of the inverse-logit, and second to ensure that it induces the desired correlation structure. For example, taking the latent variable to be Gaussian would not preserve the mean of the inverse logit.

A second key difference is that seemingly unrelated regression does not require that the linear predictors for the different ys have the same estimated coefficients or even the same covariates. Again, Iâ€™m not too familiar with SEM and I could be missing something about the flexibility of the approach.

Thanks, @tomas.rossetti. I know that SEM models can be used to estimate a system of models (similar to my example). But, I would likt to use my model for prediction. SEM is generally not well-defined for prediction. So, I donâ€™t think it can be used in this case.

@Fatih_Bozdag , it looks like a link is missing from your post.

Note that categorical regression is easy in Stan as well, including via brms, which provides the families categorical and multinomial. What @MaryBo is asking about is not multinomial/categorical regression, but is the â€śseemingly unrelated regressionâ€ť model from the Userâ€™s Guide, where multiple binary outcomes are modeled as correlated.

Thanks a lot for sharing this. Yes, it seems multinomial regression models are readily available in a few packages and the code is also less complicated to write. But, as @jsocolar kindly mentioned, this is not a solution for this question. The thing is that, I can combine all the four binary outcomes and create 2^4 categories and then model them using multinomial models or similar approaches. But, such methods clearly do not capture the dependence between the four binary outcomes and just treat them as categories.

Having studied the possible approaches for a few days, I think I have two options now:

Use the multiple outcome probit model detailed in the user guide. It should be possible to modify the distribution to account for logit (which I am not sure how to do for now).

Use copula to model multiple outcomes by creating a joint cumulative distribution.

in this case, p(x,y) function as the response term tells the model that the outcome is the proportion resulting from dividing x over y, like x = working hours, y=sleeping hours etcâ€¦

However, since your outcomes are binary already, not sure if it will work. Iâ€™ll update the response if I find the actual example in pyMC archives, which are not very tidy, though.