Multivariate priors for dichotomous variables

I am not really into bayesian statistics, but I face a problem of which I think you guys are probably into.

I have a set of dichotomous Variables A,B,C,… and I have assumptions about their probabilities P(A), P(B),… and also about the probabilities of each pair P(A&B), P(A&C), P(B&C), …
Or in other words, I have their variance-covariance matrix.
But I have no further assumptions about P(A&B&C) for example.

From this, I would like to construct the joint distribution of my variables with the maximum entropy.

How would you do this? Or how would you construct a reasonable joint distribution with this information? Is there an easy way to do this in stan?

I hope this question is not too obvious, but I did not find the answer in BDA 3, so if you would like to recommend literature, I would also be happy.

Hi @Maximilian_Ernst! Welcome to the Stan forums!

So, just to make sure I get this right: You have data A, B, C, which are binary and correlated. I think the lingo is usually, that when you assign a prior to parameters you speak of priors and when you assign a distribution to data you usually call it likelihood. It’s not hugely important, since these two are entangled anyways and the distinction between variable and data can get blurry, too.

If you have something like a variance-covariance matrix, you can fit a multivariate Probit regression (scroll down a bit). Specifying priors can be a bit tricky, but usually in this model you treat the correlation matrix (variances are set to 1 to identify the model) as a parameter and estimate it.

Is that what you are looking for? If not feel free to follow up with questions, or maybe some details on the problem or what you are trying to achieve.

Cheers! :)
Max

Hi @Max_Mantei !
Thanks! Yes, I would like to assign a distribution to data.

I will take a look at the link later (since it takes me a while to understand it), but I maybe can already try to give some details.

My problem is as follows: I have a bunch of diseases A,B,C,… and from the scientific literature, I know their probabilities (12% of people get disease A in their lifetime) and their pairwise relations (e.g. If you have disease A, it is more likely to also have disease B).
I have no real dataset at all! Only those published probabilities.

What I would like to estimate is how likely it is to have disease A and B and C and not D and not F … all together. Unfortunately, I can not directly compute them from the information I have - but the information I have restricts the possible values of those probabilities.

Therefore, I would like to estimate a joint distribution of all my diseases that is in accordance with the information I have. As I said, there are many distributions that satisfy those conditions, and with linear programming, I would be able to obtain the range of possible values.
However, I think the distribution that satisfies those conditions and has the maximum entropy would be the best one for further use in my model.

Ah ok!

What does the variance-covariance matrix that you have look like?

Or do you “only” have conditional probabilities, such as Pr(B | A) = 0.2

Yes, I have “only” conditional probabilities, but as P(A) determines the variance of A, and P(A|B) and P(B) determine their pairwise joint distribution, I think I also have their variance-covariance matrix.

I need the model to work for arbitrary diseases, but if it would help I could post a realistic example.

I see. I was just thinking about doing something like this (edit: I hope you know some R, sorry for assuming that):

# a little helper
freq_pos <- function(x) {
  sum(x > 0) / length(x)
}

# assume some marginal probs
p_A <- 0.3
p_B <- 0.6

p <- c(p_A, p_B)

# convert to z-score
mu <- qnorm(p)

# number of simulations
n_sims <- 1e6

# diagonal correlation matrix, implying Pr(B) = Pr(B|A)
S_uncorr <- diag(c(1, 1))

# draw from multivariate Normal
Y_uncorr <- MASS::mvrnorm(n = n_sims, mu = mu, Sigma = S_uncorr)

# compute marginal probs
apply(Y_uncorr, 2, freq_pos)

# identifier for the case A = 1
A_true <- Y_uncorr[, 1] > 0

# computing Pr(B|A)
apply(Y_uncorr[A_true,], 2, freq_pos)

# specify positive correlation, implying Pr(B) < Pr(B|A)
S_corr <- matrix(c(1, 0.5,
                   0.5, 1), nrow = 2)

Y_corr <- MASS::mvrnorm(n = n_sims, mu = mu, Sigma = S_corr)

apply(Y_corr, 2, freq_pos)

A_true <- Y_corr[, 1] > 0

apply(Y_corr[A_true,], 2, freq_pos)

I’m just not sure how to cleverly map from specific conditional probabilities to correlations… But if this is figured out, then generating data is as easy as drawing from a (multivariate) Normal distribution.

That being said… there is probably a more straightforward way to do all this…

2 Likes

Let p(A) and p(B) denote the marginal probabilities of events A and B and let p(A, B) denote the joint probability of these events.
Then the correlation coefficient is
\rho_{AB} = \frac{p(A, B) - p(A)p(B)}{\sqrt{p(A) \left[1-p(A) \right] p(B) \left[1-p(B) \right]}}

Probably there is a more straightforward way, but I think this way would be also much more complicated… so maybe your approach would be a good idea.

However, I tried it for 3 variables - but it does not give the right conditional probabilities.

Has somebody an idea why?

# 3 Variables -------------------------------------------------------------

cor_bin <- function(p1, p2, p12){
  (p12-p1*p2)/sqrt(p1*(1-p1)*p2*(1-p2))
}

var_bin <- function(p){
  sqrt(p*(1-p))
}

p_A <- 0.2
p_B <- 0.2
p_C <- 0.3

mu <- qnorm(c(p_A, p_B, p_C))

p_AcondB <- 0.3 # positive correlation
p_AcondC <- 0.1 # negative correlation
p_BcondC <- 0.2 # no correlation

p_AandB <- p_B*p_AcondB
p_AandC <- p_C*p_AcondC
p_BandC <- p_C*p_BcondC

cor_AB <- cor_bin(p_A, p_B, p_AandB)
cor_AC <- cor_bin(p_A, p_C, p_AandC)
cor_BC <- cor_bin(p_B, p_C, p_BandC)

S <- matrix(c(1, cor_AB, cor_AC,
              cor_AB, 1, cor_BC,
              cor_AC, cor_BC, 1), nrow = 3)

n_sims <- 1e6

Y_corr <- MASS::mvrnorm(n = n_sims, mu = mu, Sigma = S)

apply(Y_corr, 2, freq_pos)

A_true <- Y_corr[, 1] > 0
B_true <- Y_corr[, 2] > 0
C_true <- Y_corr[, 3] > 0


apply(Y_corr[C_true,], 2, freq_pos) #P(A|C) should be 0.1, but is 
#estimated as 0.15

Hey! Well, the correlation of the two binary variables A, B that Max pointed out is something different from the correlation of the latent variables in the multivariate Probit approach.

Right now I don’t really have time to think that much about it, but if there’s a way to express the correlation induced by the marginal and conditional probabilities in terms of Kendall’s \tau, then I can point you to a paper, where they discuss a conversion for the latent variable approach. (I’m on my phone right now, so, I will link that paper probably tomorrow.)

1 Like

I think I found the correlation we were looking for:

At first glance, this seems to give a usefull distribution. I will now start to test it, but I have a good feeling about it :)

@Max_Mantei thank you really, really much for the idea!

# polychoric correlation solution -----------------------------------------

library(tidyverse)
library(polycor)

p_A <- 0.2
p_B <- 0.2
p_C <- 0.3

mu <- qnorm(c(p_A, p_B, p_C))

p_AcondB <- 0.3 # positive correlation
p_AcondC <- 0.1 # negative correlation
p_BcondC <- 0.2 # no correlation

p_AandB <- p_B*p_AcondB
p_AandC <- p_C*p_AcondC
p_BandC <- p_C*p_BcondC

## obtain latent correlations

corl_AB <- polychor(matrix(c(
  1 - p_A - p_B + p_AandB, p_B - p_AandB,
  p_A - p_AandB, p_AandB
), nrow = 2))

corl_AC <- polychor(matrix(c(
  1 - p_A - p_C + p_AandC, p_C - p_AandC,
  p_A - p_AandC, p_AandC
), nrow = 2))

corl_BC <- polychor(matrix(c(
  1 - p_B - p_C + p_BandC, p_C - p_BandC,
  p_B - p_BandC, p_BandC
), nrow = 2))


S_lat <- matrix(c(1, corl_AB, corl_AC,
                  corl_AB, 1, corl_BC,
                  corl_AC, corl_BC, 1), nrow = 3)

n_sims <- 1e6

Y_corr <- MASS::mvrnorm(n = n_sims, mu = mu, Sigma = S_lat)

daty <- Y_corr %>% as.data.frame()

names(daty) <- c("A", "B", "C")

daty %<>% mutate_all(~ifelse(. > 0, 1, 0))

get_p <- function(df){
  probs <- 
    data.frame(name = c("p(A)", 
                        "p(B)",
                        "p(C)",
                        "p(A,B)",
                        "p(A,C)",
                        "p(B,C)",
                        "p(A,B,C)"),
               value = c(
                 sum(df$A)/nrow(df),
                 sum(df$B)/nrow(df),
                 sum(df$C)/nrow(df),
                 sum(df$A&df$B)/nrow(df),
                 sum(df$A&df$C)/nrow(df),
                 sum(df$C&df$B)/nrow(df),
                 sum(df$A&df$B&df$C)/nrow(df)
               )
    )
  return(probs)
}

get_p(daty)

1 Like