Multivariate outcome logit

Hello all,

I am not an expert in statistics but I have used some models in STAN before. Currenlty, I am trying to run a multivariate outcome logistic regression model (multivariate outcome logit model).

The below link contains information on how I can perfom this for probit model and linear regression. I was wondering if eny code is available for the logit version?

Thanks

1 Like

Personally, I cannot really answer from reading the documentation alone since I’m not familiar with the “trick introduced by Albert and Chib (1993)”. I’m sure there will be others here who are, but as an alternative you could unpack how that is equivalent to a multivariate probit, and it may become easier to formulate a logit equivalent.

1 Like

You mean a logit regression with categorical outcomes? This code should work.

data {
  int<lower=1> N; // n of observations (number of individuals times number of choice scenarios per individual) 
  int<lower=1> I; // n of individuals 
  int<lower=2> J; // n of alternatives 
  int<lower=0> K; // n of covariates/variables
  array[N] int<lower=1, upper=J> choices; // choice or outcome index for each observation 
  array[N] matrix[J, K] X; // covariates for each choice situation   
}
parameters {
  vector[J - 1] alpha_raw; // J-1 free intercepts, one intercept is always = 0 
  vector[K] beta; // choice covariates   
}
transformed parameters {
  vector[J] alpha = append_row(0, alpha_raw); // alpha[1] = 0, alpha[2]...alpha[J] are free
}
model {  
  alpha_raw ~ normal(0, 2); 
  beta ~ normal(0, 2); 

  for (i in 1:N) {    
    choices[i] ~ categorical_logit(alpha + X[i] * beta);
  }
}

I also recommend checking Jim Savage’s blog.

1 Like

I mean the multivariate probit model. The documentation description is quite detailed, but for the purpose of getting useful responses here it may help to summarized the extensive description into the core of what makes it a multivariate probit.
The difference between the regular logistic and probit regression is trivial (using one or the other function), so maybe by distilling the essence of this “trick” it can be easier to formulate its multivariate version.

1 Like

Thanks for your response. No, I do not mean categorical outcome. Consider the below system of models:

y1 ~ x1 + x2 + …
y2 ~ x1 + x2 + …
y3 ~ x1 + x2 + …
y4 ~ x1 + x2 + …

Where all y are binary and correlated. You can think of yi as person’s response to a number of life satisfaction questions. Although all questions are separate, they are correlated to some degree. Each model is a logistic regression (regression with binary outcome). I want to model them simultaneously.

The problem with coding this model with logits and not probits is that Stan doesn’t have a multivariate logistic distribution to replace the multivariate normal. I think it should be possible to take a multivariate normal, rescale it so that its margins are logistic rather than normal and use that with the code from the users guide to do a multivariate logit regression, but that’s just an intuition, not a knowledgable claim. FWIW, Stan does have a built-in multivariate student t, so I think it should be readily achievable to implement multivariate robit regression.

2 Likes

@caesoma the model in the doc is equivalent to a multivariate probit because the inverse probit is the standard normal cdf. So to sample y from a bernoulli distribuiton with probability parameter equal to the inverse probit of x is the same thing as sampling from a standard normal distribution and asking whether the response is less than x. Given this scheme for sampling a bernoulli variate from the inverse probit of x, we can generalize to sampling multiple bernoulli variates from the inverse probit of x, where correlation between the multiple variates can be induced by sampling from a multivariate normal (whose margins are still all standard normals).

1 Like

Thanks, @jsocolar. Would you know if there is any package that does this task in R without requring me to write the code of the model myself? (not necessarily in the Bayesian form)

I believe you could use blavaan or lavaan to estimate this model.

@tomas.rossetti , can you provide any guidance for how one might estimate such a model with these packages? Since blavaan can use Stan on the backend, it seems likely that if it can estimate this model we could generate the Stan code.

@MaryBo I don’t know of any particular packages for this, but that’s not to say they don’t exist.

I’m past the limits of my expertise here and so this is speculative, but based on this post Multivariate logistic distribution - Cross Validated it seems like it might not be too hard to implement a multivariate logistic distribution in Stan code, in which case the example from the users guide might be readily adaptable.

Edit: A question that I should have asked at the very beginning: @MaryBo, are you able to write down R code that would simulate data under your desired model? If so, we should take a look at that code and figure out what it’s doing, and we’ll almost surely be able to implement the model in Stan. In that case, figuring out whether or not that model is equivalent to the model I’m suggesting based on a multivariate logistic will be something that we should verify rather than assume. Alternatively, if you are on the hunt for a multivariate model whose output is interpretable as log-odds and that otherwise is similar to the multivariate probit outline in the user’s guide, but you don’t have a clear pre-conceived notion of exactly what the data-generating process would look like, then I feel somewhat confident in suggesting that the proposed model, based on a multivariate logistic distribution, is probably what you seek.

1 Like

I just remembered that blavaan only implements probits for ordinal/binary outcomes, but lavaan does have the logit link. I guess it’s a starting point.

If I understand the model correctly, x1 + x2 + ... is the structural equation of latent variable that represents life satisfaction, according to what @MaryBo mentioned earlier. y1 through y4 would be indicators for that latent variable. In other words, this is a confirmatory factor analysis.

The code for (b)lavaan would look like this:

# Measurement equation
life_satisfaction =~ y1 + y2 + y3 + y4

# Structural equation
life_satisfaction ~ x1 + x2 + x3                                                                                                                                                                                                               

# Explicit residuals (redundant, but just for the sake of clarity)
y1 ~~ y2 + y3 + y4
y2 ~~ y3 + y4
y3 ~~ y4

Implementing the logit version of this in Stan shouldn’t be difficult, although I would expect it to be quite slow.

1 Like

Cool! This is a neat and potentially relevant approach. FWIW, I’m pretty sure that it is not the same model presented in the users guide. The SEM approach looks like it’s inducing correlation in the y by modifying the probabilities underlying each y in the same direction when sampling the latent variable, whereas the seemingly-unrelated regression approach induces correlation by retaining the marginal probabilities associated with each y but modifying the correlation structure. If the modification of the probabilities in the SEM approach is chosen to be “just so”, then the two-step approach of modify-then-sample might turn out to be equivalent to “sample with a given correlation structure”, but this would need to be verified, first to ensure that the distribution chosen for the degree of modification preserves the mean of the inverse-logit, and second to ensure that it induces the desired correlation structure. For example, taking the latent variable to be Gaussian would not preserve the mean of the inverse logit.

A second key difference is that seemingly unrelated regression does not require that the linear predictors for the different ys have the same estimated coefficients or even the same covariates. Again, I’m not too familiar with SEM and I could be missing something about the flexibility of the approach.

Thanks, @tomas.rossetti. I know that SEM models can be used to estimate a system of models (similar to my example). But, I would likt to use my model for prediction. SEM is generally not well-defined for prediction. So, I don’t think it can be used in this case.

Not stan or R, but the categorical (multinominal) regression is easier with pymc and bambi. Here is the example

@Fatih_Bozdag , it looks like a link is missing from your post.

Note that categorical regression is easy in Stan as well, including via brms, which provides the families categorical and multinomial. What @MaryBo is asking about is not multinomial/categorical regression, but is the “seemingly unrelated regression” model from the User’s Guide, where multiple binary outcomes are modeled as correlated.

1 Like

Sorry for the missing link;

Here it is.

https://bambinos.github.io/bambi/notebooks/categorical_regression.html

Cool! This is the same model that brms supports with the categorical family :)

2 Likes

Thanks a lot for sharing this. Yes, it seems multinomial regression models are readily available in a few packages and the code is also less complicated to write. But, as @jsocolar kindly mentioned, this is not a solution for this question. The thing is that, I can combine all the four binary outcomes and create 2^4 categories and then model them using multinomial models or similar approaches. But, such methods clearly do not capture the dependence between the four binary outcomes and just treat them as categories.

2 Likes

Having studied the possible approaches for a few days, I think I have two options now:

  1. Use the multiple outcome probit model detailed in the user guide. It should be possible to modify the distribution to account for logit (which I am not sure how to do for now).

  2. Use copula to model multiple outcomes by creating a joint cumulative distribution.

2 Likes

Something like that is also possible with PyMC, but I cannot remember where I saw the actual examples.

For instance, you can write a model as below.

bmb.Model("p(x, y) ~ 0 + Var1", df, family="binomial")

in this case, p(x,y) function as the response term tells the model that the outcome is the proportion resulting from dividing x over y, like x = working hours, y=sleeping hours etc…

However, since your outcomes are binary already, not sure if it will work. I’ll update the response if I find the actual example in pyMC archives, which are not very tidy, though.