Multivariate priors for dichotomous variables

Maximilian_Ernst · November 27, 2019, 11:08pm

I am not really into bayesian statistics, but I face a problem of which I think you guys are probably into.

I have a set of dichotomous Variables A,B,C,… and I have assumptions about their probabilities P(A), P(B),… and also about the probabilities of each pair P(A&B), P(A&C), P(B&C), …
Or in other words, I have their variance-covariance matrix.
But I have no further assumptions about P(A&B&C) for example.

From this, I would like to construct the joint distribution of my variables with the maximum entropy.

How would you do this? Or how would you construct a reasonable joint distribution with this information? Is there an easy way to do this in stan?

I hope this question is not too obvious, but I did not find the answer in BDA 3, so if you would like to recommend literature, I would also be happy.

Max_Mantei · November 28, 2019, 10:25am

Hi @Maximilian_Ernst! Welcome to the Stan forums!

So, just to make sure I get this right: You have data A, B, C, which are binary and correlated. I think the lingo is usually, that when you assign a prior to parameters you speak of priors and when you assign a distribution to data you usually call it likelihood. It’s not hugely important, since these two are entangled anyways and the distinction between variable and data can get blurry, too.

If you have something like a variance-covariance matrix, you can fit a multivariate Probit regression (scroll down a bit). Specifying priors can be a bit tricky, but usually in this model you treat the correlation matrix (variances are set to 1 to identify the model) as a parameter and estimate it.

Is that what you are looking for? If not feel free to follow up with questions, or maybe some details on the problem or what you are trying to achieve.

Cheers! :)
Max

Maximilian_Ernst · November 28, 2019, 10:51am

Hi @Max_Mantei !
Thanks! Yes, I would like to assign a distribution to data.

I will take a look at the link later (since it takes me a while to understand it), but I maybe can already try to give some details.

My problem is as follows: I have a bunch of diseases A,B,C,… and from the scientific literature, I know their probabilities (12% of people get disease A in their lifetime) and their pairwise relations (e.g. If you have disease A, it is more likely to also have disease B).
I have no real dataset at all! Only those published probabilities.

What I would like to estimate is how likely it is to have disease A and B and C and not D and not F … all together. Unfortunately, I can not directly compute them from the information I have - but the information I have restricts the possible values of those probabilities.

Therefore, I would like to estimate a joint distribution of all my diseases that is in accordance with the information I have. As I said, there are many distributions that satisfy those conditions, and with linear programming, I would be able to obtain the range of possible values.
However, I think the distribution that satisfies those conditions and has the maximum entropy would be the best one for further use in my model.

Max_Mantei · November 28, 2019, 11:16am

Ah ok!

What does the variance-covariance matrix that you have look like?

Or do you “only” have conditional probabilities, such as Pr(B | A) = 0.2

Maximilian_Ernst · November 28, 2019, 11:23am

Yes, I have “only” conditional probabilities, but as P(A) determines the variance of A, and P(A|B) and P(B) determine their pairwise joint distribution, I think I also have their variance-covariance matrix.

I need the model to work for arbitrary diseases, but if it would help I could post a realistic example.

Max_Mantei · November 28, 2019, 11:42am

I see. I was just thinking about doing something like this (edit: I hope you know some R, sorry for assuming that):

# a little helper
freq_pos <- function(x) {
  sum(x > 0) / length(x)
}

# assume some marginal probs
p_A <- 0.3
p_B <- 0.6

p <- c(p_A, p_B)

# convert to z-score
mu <- qnorm(p)

# number of simulations
n_sims <- 1e6

# diagonal correlation matrix, implying Pr(B) = Pr(B|A)
S_uncorr <- diag(c(1, 1))

# draw from multivariate Normal
Y_uncorr <- MASS::mvrnorm(n = n_sims, mu = mu, Sigma = S_uncorr)

# compute marginal probs
apply(Y_uncorr, 2, freq_pos)

# identifier for the case A = 1
A_true <- Y_uncorr[, 1] > 0

# computing Pr(B|A)
apply(Y_uncorr[A_true,], 2, freq_pos)

# specify positive correlation, implying Pr(B) < Pr(B|A)
S_corr <- matrix(c(1, 0.5,
                   0.5, 1), nrow = 2)

Y_corr <- MASS::mvrnorm(n = n_sims, mu = mu, Sigma = S_corr)

apply(Y_corr, 2, freq_pos)

A_true <- Y_corr[, 1] > 0

apply(Y_corr[A_true,], 2, freq_pos)

I’m just not sure how to cleverly map from specific conditional probabilities to correlations… But if this is figured out, then generating data is as easy as drawing from a (multivariate) Normal distribution.

That being said… there is probably a more straightforward way to do all this…

maxbiostat · November 28, 2019, 1:46pm

Let p(A) and p(B) denote the marginal probabilities of events A and B and let p(A, B) denote the joint probability of these events.
Then the correlation coefficient is
\rho_{AB} = \frac{p(A, B) - p(A)p(B)}{\sqrt{p(A) \left[1-p(A) \right] p(B) \left[1-p(B) \right]}}

Maximilian_Ernst · November 28, 2019, 4:48pm

Probably there is a more straightforward way, but I think this way would be also much more complicated… so maybe your approach would be a good idea.

However, I tried it for 3 variables - but it does not give the right conditional probabilities.

Has somebody an idea why?

# 3 Variables -------------------------------------------------------------

cor_bin <- function(p1, p2, p12){
  (p12-p1*p2)/sqrt(p1*(1-p1)*p2*(1-p2))
}

var_bin <- function(p){
  sqrt(p*(1-p))
}

p_A <- 0.2
p_B <- 0.2
p_C <- 0.3

mu <- qnorm(c(p_A, p_B, p_C))

p_AcondB <- 0.3 # positive correlation
p_AcondC <- 0.1 # negative correlation
p_BcondC <- 0.2 # no correlation

p_AandB <- p_B*p_AcondB
p_AandC <- p_C*p_AcondC
p_BandC <- p_C*p_BcondC

cor_AB <- cor_bin(p_A, p_B, p_AandB)
cor_AC <- cor_bin(p_A, p_C, p_AandC)
cor_BC <- cor_bin(p_B, p_C, p_BandC)

S <- matrix(c(1, cor_AB, cor_AC,
              cor_AB, 1, cor_BC,
              cor_AC, cor_BC, 1), nrow = 3)

n_sims <- 1e6

Y_corr <- MASS::mvrnorm(n = n_sims, mu = mu, Sigma = S)

apply(Y_corr, 2, freq_pos)

A_true <- Y_corr[, 1] > 0
B_true <- Y_corr[, 2] > 0
C_true <- Y_corr[, 3] > 0


apply(Y_corr[C_true,], 2, freq_pos) #P(A|C) should be 0.1, but is 
#estimated as 0.15

Max_Mantei · November 28, 2019, 5:58pm

Hey! Well, the correlation of the two binary variables A, B that Max pointed out is something different from the correlation of the latent variables in the multivariate Probit approach.

Right now I don’t really have time to think that much about it, but if there’s a way to express the correlation induced by the marginal and conditional probabilities in terms of Kendall’s \tau, then I can point you to a paper, where they discuss a conversion for the latent variable approach. (I’m on my phone right now, so, I will link that paper probably tomorrow.)

Maximilian_Ernst · November 28, 2019, 9:01pm

I think I found the correlation we were looking for:

At first glance, this seems to give a usefull distribution. I will now start to test it, but I have a good feeling about it :)

@Max_Mantei thank you really, really much for the idea!

# polychoric correlation solution -----------------------------------------

library(tidyverse)
library(polycor)

p_A <- 0.2
p_B <- 0.2
p_C <- 0.3

mu <- qnorm(c(p_A, p_B, p_C))

p_AcondB <- 0.3 # positive correlation
p_AcondC <- 0.1 # negative correlation
p_BcondC <- 0.2 # no correlation

p_AandB <- p_B*p_AcondB
p_AandC <- p_C*p_AcondC
p_BandC <- p_C*p_BcondC

## obtain latent correlations

corl_AB <- polychor(matrix(c(
  1 - p_A - p_B + p_AandB, p_B - p_AandB,
  p_A - p_AandB, p_AandB
), nrow = 2))

corl_AC <- polychor(matrix(c(
  1 - p_A - p_C + p_AandC, p_C - p_AandC,
  p_A - p_AandC, p_AandC
), nrow = 2))

corl_BC <- polychor(matrix(c(
  1 - p_B - p_C + p_BandC, p_C - p_BandC,
  p_B - p_BandC, p_BandC
), nrow = 2))


S_lat <- matrix(c(1, corl_AB, corl_AC,
                  corl_AB, 1, corl_BC,
                  corl_AC, corl_BC, 1), nrow = 3)

n_sims <- 1e6

Y_corr <- MASS::mvrnorm(n = n_sims, mu = mu, Sigma = S_lat)

daty <- Y_corr %>% as.data.frame()

names(daty) <- c("A", "B", "C")

daty %<>% mutate_all(~ifelse(. > 0, 1, 0))

get_p <- function(df){
  probs <- 
    data.frame(name = c("p(A)", 
                        "p(B)",
                        "p(C)",
                        "p(A,B)",
                        "p(A,C)",
                        "p(B,C)",
                        "p(A,B,C)"),
               value = c(
                 sum(df$A)/nrow(df),
                 sum(df$B)/nrow(df),
                 sum(df$C)/nrow(df),
                 sum(df$A&df$B)/nrow(df),
                 sum(df$A&df$C)/nrow(df),
                 sum(df$C&df$B)/nrow(df),
                 sum(df$A&df$B&df$C)/nrow(df)
               )
    )
  return(probs)
}

get_p(daty)

Topic		Replies	Views
"Pairwise" alternative to multivariate normal isn't behaving for the hierarchical case; help? Modeling techniques , specification	13	885	June 10, 2021
New paper on prior distributions in latent variable models Publicity irt , priors	2	728	April 21, 2023
How to model mixed variables Gaussian, Ranking binary and categorical in stan all together to generate a joint distribution of them? Modeling	2	451	October 14, 2023
Priors for binomial trials in generalized linear models Modeling	3	512	July 6, 2018
Fitting a multivariate Bernoulli distribution Modeling specification	1	633	August 26, 2022

Multivariate priors for dichotomous variables

Related topics