Inferring Latent Discrete Variables from Marginal Distributions

I have a tricky problem that I implemented a solution for previously in JAGS but I’m now looking to either convert that solution to Stan (if this is possible) or try an alternative model. The full background is in this paper https://doi.org/10.12688/wellcomeopenres.15849.3, under " Long-term condition prevalence and correlation models.", and our implementation is fully described on page 114-118 here https://theses.gla.ac.uk/83512/1/2022ChadwickPhD.pdf with the JAGS code here covid19_yll_final/Scripts/Extracting_comorbidity_Distributions/0201_JAGS_Model_Unidentifiable_Converged.R at master · dmcalli2/covid19_yll_final · GitHub.

Essentially, we want to understand the correlations between binary variables using only their marginal counts. The application was COVID-19, we had two summary tables of data on those who had died (individual-level data was not available). Table A gave the number of people who died with COPD, Diabetes, Heart Disease etc., Table B gave the number of people who died with 1 disease, 2 diseases, or 3+ diseases. We wanted to estimate the plausible range of correlations between the diseases given the constraints presented in the two tables. Our medic collaborators told us that a multivariate probit was an adequate way to model the latent individual-level data and the distribution of comorbidity counts (Table B) was Poisson distributed. Given the complexity of the problem, we were happy to make these assumptions!

We approaches the problem by imagining the individual level data as a binary table (Table C) filled with NAs with a row for each patient and a column for each disease. Table A gave us the column sums (number of people with each disease) and Table B gave us the row sums (the number of 1s in each row). We needed to estimate the parameters of a multivariate probit which would generate a Table C that is consistent with Tables A and B.

Can anyone suggest the best way to address this problem in Stan? In JAGS, we essentially simulate Table C from an MV Probit and do rejection sampling, plus some slightly hacky things to make things equal to each other…

Interesting problem! I don’t have a good answer to your question, but maybe some follow-up clarification/confirmation would help. I put together some R code (below) that I think represents the data situation you have in mind. Hopefully I got that right.

library(dplyr)
library(purrr)
library(tidyr)

population_size <- 10000

mu <- qnorm(c(0.05, 0.12, 0.03))
Sigma <- matrix(c(1, 0.2, 0.4,
                  0.2, 1, 0.1,
                  0.4, 0.1, 1), nrow = 3)

diseases <- c('COPD', 'Diabetes', 'Heart Disease')
names(mu) <- diseases
colnames(Sigma) <- diseases
rownames(Sigma) <- diseases

# Unobserved person-level data...
# ...unobservable disease score
latent_X <- MASS::mvrnorm(population_size, mu, Sigma) %>%
  as_tibble()

# ...observable disease diagnosis
Table_C <- latent_X %>%
  map_df(function(x) { as.numeric(x >= 0)}) 


# Observed population-level data...
# ...count of people who have a diagnosis for each disease
Table_A <- Table_C %>%
  gather('Disease', 'Diagnosis') %>%
  group_by(Disease) %>%
  summarize(Num_Diagnoses = sum(Diagnosis)) %>%
  ungroup()

# ...count of people who have a certain number of disease diagnoses
Table_B <- Table_C %>%
  mutate(Num_Diseases = rowSums(.)) %>%
  group_by(Num_Diseases) %>%
  count() %>%
  ungroup()

Table A

Disease Num_Diagnoses
COPD 501
Diabetes 1203
Heart Disease 305

Table B

Num_Diseases n
0 8197
1 1610
2 180
3 13

Table C

COPD Diabetes Heart Disease
1 0 0
0 1 1
0 0 0
0 0 0
0 1 0
0 0 0
0 0 0

Are these two assumptions compatible? My understanding is that for the comorbidity counts to be Poisson-distributed, the latent multivariate variables would need to be distributed as iid normal. At the extreme, imagine 3 perfectly-correlated variables such that the only comorbidity counts were 0 and 3. Or imagine three independent conditions, though one has a rate of 99% and the others have a rate of 0.01%.

I suspect modelling the unobserved person-level data is inefficient, at least for a reasonably small number of conditions. I wonder if you could instead model the probabilities of each combination of diseases and then calculate the correlation from the posterior distributions of those cell probabilities. For example, if you have two conditions, the model might look like something below. (Note that this model isn’t the correct joint distribution of Table A and Table B, but it’s just meant to illustrate the idea.)

data{
    int N;
    array[2] int TableA;
    array[3] int TableB;
}
parameters{
    simplex[4] cell_probs; // 1=[0,0], 2=[1,0], 3=[0,1], 4=[1,1]
}
transformed_parameters{
     vector[2] TableA_probs;
     vector[3] TableB_probs;

    TableA_probs[1] = cell_probs[2] + cell_probs[4];
    TableA_probs[2] = cell_probs[3] + cell_probs[4];

    TableB_probs[1] = cell_probs[1];
    TableB_probs[2] = cell_probs[2] + cell_probs[3];
    TableB_probs[3] = cell_probs[4];
}
model{
    TableA ~ binomial(N, TableA_probs);
    TableB ~ multinomial(TableB_probs);
}
generated_quantities{
    real corr = some_function(cell_probs);
}
2 Likes

It’s really hard to read code that’s not indented and has more comments than code. ChatGPT sorted it out for me:

for(i in 1:markers[2]){
  zeroes[i] ~ dnorm(sum(pres[i,]) - comorb_count[i], 4)
  comorb_count[i] ~ dpois(lambda)
}
for(j in (markers[2]+1):(markers[3])){
  zeroes[j] ~ dnorm(sum(pres[j,]) - comorb_count[j], 4)
  comorb_count[j] ~ dpois(lambda)T(3,11)
}
for(D in 1:n_dis){ 
  diseases[D] ~ dnorm(sum(pres[,D]), 0.5)
}
for(p in 1:pop_size){ 
  Z[p, 1:n_dis] ~ dmnorm(mu[p, 1:n_dis], Sigma[,])
  for(d in 1:n_dis){ 
    pres[p,d] <- step(Z[p,d])
    mu[p,d] <- intercept[d]
  }
}
for(ipd in 1:ipd_pop_size){ 
  Z_ipd[ipd, 1:n_dis] ~ dmnorm(mu_ipd[ipd, 1:n_dis], Sigma[,])
  for(d_ipd in 1:n_dis){ 
    data_ipd[ipd,d_ipd] ~ dbern(ifelse(step(Z_ipd[ipd,d_ipd]), 0.999, 0.001))
    mu_ipd[ipd,d_ipd] <- intercept[d_ipd]
  }
  comorb_count_ipd[ipd] ~ dpois(lambda)
}
for (D in 1:n_dis){
  intercept[D] ~ dnorm(0, 0.2)
}
lambda ~ dgamma(0.1, 0.1)
Sigma ~ dscaled.wishart(rep(1,n_dis), 50)

One of the reasons I designed Stan the way I did was to make it clear what was data and what was a parameter. Are there discrete paramers anywhere here or is all the discrete stuff observed (e.g., data_ipd, comorb_count, pres[p,d], comorb_count_ipd, etc.).

This can mostly be translated to Stan directly, but there are discontinuities due to the use of step functions that will cause problems for Hamiltonian Monte Carlo.

If you do translate to Stan, it would help to reparameterize using Cholesky factors. We suggest a scaled correlation matrix parameterization in the User’s Guide, but we’ve added Cholesky factored Wisharts, too. I would also recommend more weakly informative prior than dgamma(0.1, 0.1).

Most of this can be vectorized into more concise code in Stan. ChatGPT is reasonably adept at doing a crude translation to Stan and would be even better if you could say what was data and what was parameter and give it a hint to use a Cholesky-factor parameterization. But you have to understand Stan well enough to figure out if its translation is sound.

1 Like