Mixture of Multivariate Bernouilli

Hi everyone,

I’m trying to fit a mixture model of multivariate (conditional) independant bernouilli variable. Here’s the stan model used:

data {
  int<lower=1> K;          // number of mixture components
  int<lower=1> N;          // number of data points
  int<lower=0> D;          // Dimension of the data
  int y[N,D];              // Observations
}
parameters {
  simplex[K] theta;          // mixing proportions
  matrix<lower=0,upper=1>[K,D] p;             
}
model {
  vector[K] log_theta = log(theta);  // cache log calculation
  for (n in 1:N) {
    vector[K] lps = log_theta;
    for (k in 1:K)
        for (d in 1:D)
          lps[k] += bernoulli_lpmf(y[n,d] | p[k,D]);
    target += log_sum_exp(lps);
  }
}

And the python code for generating the data and compile the model

# Data creation
N1 = 200
N2 = 200
D = 4
K = 2
N = N1 + N2 

p1 = np.sort(np.random.random(D))
p2 = np.sort(np.random.random(D))

data1 = np.array([np.random.binomial(size=N1, n=1, p= probability) for probability in p1 ]).T
data2 = np.array([np.random.binomial(size=N2, n=1, p= probability) for probability in p2 ]).T

data = np.concatenate([data1,data2])
np.random.shuffle(data) # Inline operation to shuffle the data

# Model compilation
mixture_bernouilli_dat = {'N': N,
                          'K': K,
                  'D' : D,
                  'y': data}

sm = pystan.StanModel(model_code=mixture_bernouilli_code)
fit = sm.sampling(data=mixture_bernouilli_dat, iter=1000, chains=4, n_jobs=1, verbose=True)

The problem is that this model poorly perform (the probabilities of each components are not retrieved) eventhough the code compile without error. I suspect that the issue is in the target increment part but I didn’t manage to make it work.
Someone has an idea about what the problem can be in the definition of the model ?

Thanks in advance.
AD

There are a few problems.

(1) You have a typo in your model towards the end; you wrote | p[k,D] but presumably you meant | p[k,d]. Once that is fixed you’ll see your model doesn’t converge.

(2) A mixture of Bernoulli distributions is just another Bernoulli distribution if I remember correctly, so your model is literally impossible to identify as an inverse problem.

(3) If you were to have used something other than Bernoulli distributions your model would still have a massive label switching problem. Your priors are exchangeable between your mixtures and it’s possible to generate new modes in your model by just switching the dimensions in your data around. Have a read of Michael Betancourt’s case study on this and other mixture model problems. Eliminating just label switching from your model by imposing an ordering constraint may be impossible because your data is a mixture columns-wise (not row-wise) and it would require that every element in your K parameter vectors was ordered. I.e. that p[1,d] > ... > p[K,d], \forall d \in D, which wouldn’t fit your data.

1 Like

Hi emiruz,

Thanks for your answer, it helps a LOT : )

  1. Corrected

  2. You’re right. It seems to be an ill-defined problem. So, why the posterior sampling exhibit the right modes ?

  3. What do you mean by " your data is a mixture columns-wise (not row-wise) " ?
    Thank you for the link. I was wondering about the exchangeability issue.

My final goal is to study the following model: \mathbf{y}^{n} is the nth sample of a D-dimensional distribution drawn from the component k with parameter \theta_k (D-dimensional), \alpha and \beta being D-dimensional hyperparameters (notations: Multi=Multinomial, Ber=Bernouilli)

\begin{aligned} \boldsymbol{\pi} | \alpha & \sim \operatorname{Dir}(\alpha / K, \alpha / K, \ldots, \alpha / K) \\ \theta_{k} | \boldsymbol{\beta}, \boldsymbol{\gamma} & \sim \operatorname{Beta}(\boldsymbol{\beta}, \boldsymbol{\gamma}) \\ z^{n} | \boldsymbol{\pi} & \sim \operatorname{Multi}(\boldsymbol{\pi}) \\ \mathbf{y}^{n} | z^{n}, \boldsymbol{\theta} & \sim \operatorname{Ber}\left(\mathbf{X}^{n} | \boldsymbol{\theta}_{z^{n}}\right) \end{aligned}

and add some feature discovery (Feature Saliency) with \phi_l being the indicator that the dimension l is drawn from p_j and not q independant of the mixture-component k:

p\left(\mathbf{y} ^{n}| \Phi,\left\{\pi_{k}\right\},\left\{\theta_{k l}\right\},\left\{\lambda_{l}\right\}\right)=\sum_{k=1}^{K} \pi_{k} \prod_{l=1}^{D}\left(p\left(y_{l}^{n} | \theta_{kl}\right)\right)^{\phi_{l}}\left(q\left(y_{l}^{n} | \lambda_{l}\right)\right)^{1-\phi_{l}}.

In plain english, my goal is to discover K-component mixture of bernouilli multivariate samples with feature selection (more precisely, discovering if a dimension is dependant of the mixture-component H or not).

No sure what your graph shows. For any given column i in your data, and index j in that column, d_{i,j} = 1 with probability \lambda \times \theta_{i,1} + (1-\lambda)\times \theta_{i,2}=\theta_i and d_{i,j} = 0 with probability (1-\theta_i).

In your example, you’re generating data from two vectors of 4 Bernoulli parameters, which produces two datasets with 4 columns each. Your concatenation operation then joins corresponding columns together from the 2 datasets and you randomise the row order for good measure. Hence the mixing between each dataset happened in each column.

If the problem wasn’t degenerate, you could model each column to get the parameter in each vector like this:

data {
  int<lower=1> N;
  int y[N];
}
parameters {
  real<lower=0.0001,upper=.9999> lambda;
  ordered[2] theta;
}
model {
  theta~normal(0,6);
  for (n in 1:N)
    target += log_mix(lambda,
                      bernoulli_logit_lpmf(y[n]|theta[1]),
                      bernoulli_logit_lpmf(y[n]|theta[2]));
}
generated quantities {
  vector[2] p = inv_logit(theta);
}

You’d then end up with 4 well extracted pairs of parameters, and since your mixture is 50/50 you’d also end up with 2^4 possible ways of constructing your 2 vectors that’d give you exactly the same result due to label switching. Again, if the problem wasn’t degenerate (and I think it is) and the mixing proportion wasn’t 50/50, and since \lambda in the model above would come out the same for every column due to how you’ve generated your data, you’d have a natural way of deciding which parameters belong together. Hence, you’re only really mixing columns.

1 Like

Oh, I think I understand. You’re saying that the problem can be re-written independently for each columns (I though that you meant that there was some mixing between columns) and solved using the code that you’ve written. For each vector (column) i I’d obtain \theta_{i,1} and \theta_{i,2}.

Yes, it wasn’t clear. Here’s the complete python code and the complete outpout (which is the posterior distribution) taking D=2, \lambda=.8 (since in my real-world application it is unlikely to have even distribution between components).

Data Generation

N1 = 800
N2 = 200
D = 2
K = 2
N = N1 + N2 

p1 = [.2,.9]
p2 = [.4,.6]




data1 = np.array([np.random.binomial(size=N1, n=1, p= probability) for probability in p1 ]).T
data2 = np.array([np.random.binomial(size=N2, n=1, p= probability) for probability in p2 ]).T


data = np.concatenate([data1,data2])
np.random.shuffle(data)

print('The data has dimension {} and {} samples'.format(data.shape[0], data.shape[1]))
print('''
\t\t\t ----Component 1----
                theta[1,0]: {} \t  theta[1,1]: {} 
                
\t\t\t ----Component 2----
                theta[2,0]: {} \t  theta[2,1]: {} '''.format(p1[0], p1[1],p2[0],p2[1]))

Output:

The data has dimension 2 and 1000 samples

			 ----Component 1----
                theta[1,0]: 0.2 	  theta[1,1]: 0.9 
                
			 ----Component 2----
                theta[2,0]: 0.4 	  theta[2,1]: 0.6 

Model

mixture_bernouilli_code = """
data {
  int<lower=1> K;          // number of mixture components
  int<lower=1> N;          // number of data points
  int<lower=0> D; 
  int y[N,D];
}
parameters {
  real <lower = 0, upper = 1> lambda; // probabilities of being in one group
  real <lower = 0, upper = 1> theta[D, K];
  
}
model {
  for (n in 1:N) {
      target += log_mix(lambda,
                      bernoulli_lpmf(y[n, ] | theta[, 1]),
                      bernoulli_lpmf(y[n, ] | theta[, 2]));
  }
}
"""

mixture_bernouilli_dat = {'N': N,
                          'K': K,
                          'D': D,
                  'y': data}

sm = pystan.StanModel(model_code=mixture_bernouilli_code)
fit = sm.sampling(data=mixture_bernouilli_dat, iter=1000, chains=4, n_jobs=1, verbose=True)

print(fit)

fit.plot()

Output:

Inference for Stan model: anon_model_653e3e600c802a655e03a79c6055e890.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

             mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
lambda        0.7    0.06   0.22   0.11   0.57   0.76   0.88   0.97     14   1.32
theta[1,1]   0.21    0.05   0.15   0.01   0.14    0.2   0.23   0.66     10   1.61
theta[2,1]   0.88    0.04   0.13   0.49   0.88    0.9   0.94   0.99     11   1.56
theta[1,2]   0.49    0.05   0.21    0.1   0.36   0.46   0.63   0.95     18   1.26
theta[2,2]   0.61    0.05   0.24   0.04   0.47   0.68   0.78   0.96     27   1.16
lp__       -989.8    0.12    1.9 -994.7 -990.8 -989.4 -988.4 -987.4    269    1.0

Samples were drawn using NUTS at Sat Sep 28 21:01:11 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

So, the model seems to be able to recover the parameters in this case.

But, re-running the same model will often give this output:


Showing that the problem is indeed non-identifiable.

Imposing my mixture components to be in non-decreasing order wouldn’t solve my problem ?

Last question: if I had an informative prior distribution, the model would become identifiable, right ?

Notice that the Rhat values in your result indicate that your model is not mixing. I.e. that the individual chains are producing different results.

No because even if \theta_{i,1}>\theta_{i,2} there is still an infinite ways to mix two distributions to create exactly the same \theta_i for the Bernoulli. Here is some example code with a model that has ordering constraints.

import numpy as np
import pystan as st

np.random.seed(100)
data = np.concatenate([np.random.binomial(n=1,p=.1,size=50),
                       np.random.binomial(n=1,p=.7,size=100),])
np.random.shuffle(data)

model_str = """
data {
  int<lower=1> N;
  int y[N];
}
parameters {
  real<lower=0.0001,upper=.9999> lambda;
  ordered[2] theta;
}
model {
  theta~normal(0,6);
  for (n in 1:N)
    target += log_mix(lambda,
                      bernoulli_logit_lpmf(y[n] | theta[1]),
                      bernoulli_logit_lpmf(y[n] | theta[2]));
}
generated quantities {
  vector[2] p = inv_logit(theta);
}
"""

sm  = st.StanModel(model_code=model_str)
fit = sm.sampling(data={'N': len(data),'y': data},
                  iter=1000,
                  control=dict(adapt_delta=.99))

Notice that the probability implied by the generating model of some data point d_i=1 is \frac{0.1}{3}+\frac{0.7\times2}{3}=0.5. The stan model above mixes just fine. Notice that the recovered parameter are totally different due to non-identifiable but that the implied probability of d_i=1 is right (0.49 \times 0.14 + 0.51 \times 0.88 \approx 0.5) because the recovered parameters achieve equivalent results.

Inference for Stan model: anon_model_0fc2471cecb9a991a2257c55f1680512.
4 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=2000.

           mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
lambda     0.49  9.5e-3   0.19   0.08    0.4   0.48   0.57   0.93    417    1.0
theta[1]  -3.89    0.24   3.51 -12.21  -5.94  -2.99  -1.07    0.1    210   1.02
theta[2]   4.22    0.19   3.74   0.11   1.28   3.24   6.27  13.19    393   1.01
p[1]       0.14    0.01   0.17 5.0e-6 2.6e-3   0.05   0.26   0.52    218   1.02
p[2]       0.88  9.4e-3   0.15   0.53   0.78   0.96    1.0    1.0    254   1.01
lp__     -104.9    0.07   1.31 -108.2 -105.4 -104.5 -104.0 -103.5    365   1.01

If they were also non-exchangeable, then maybe. You need to introduce enough structure for the right solution to also be the overwhelmingly likely solution.

2 Likes