How to inform a semi-supervised mixture


#1

I’m trying to understand how I can incorporate the occasional known data into a model, ala semi-supervised.


Below is the data generating process. Notice the line stan y[1] = normal_rng(mu[3],1); where I make a hard assignment of a single datapoint to a specific cluster (3). All the other datapoints are treated as unknown cluster assignements.

inject_fit <- stan(file = 'data_generation_normal_mixture.stan', 
                   iter = 1, chains = 1, algorithm = "Fixed_param")

data_generation_normal_mixture.stan:

transformed data {
   
   int M = 100; // number of data points 
   int K = 3; // number of clusters 

   //LATENT PARAMETERS
   simplex[K] theta = [0.5,0.25,0.25]'; // proportion of each cluster
   vector[K] mu; //mean value of each cluster
   
   mu[1] = -5;
   mu[2] = 0;
   mu[3] = 5;
}

generated quantities {
   //theta ~ dirichlet(alpha);
   vector[M] y;

   //SEMISUPERVISED DATA
   //I KNOW FOR SURE THAT THE FIRST OBSERVATION IS FROM THE THIRD CLUSTER
   y[1] = normal_rng(mu[3],1);
   
   // GENERATE DATA FOR THE REST OF THE OBSERVATIONS
   for (m in 2:M) {
      int this_cluster = categorical_rng(theta);
      y[m] = normal_rng(mu[this_cluster],1);
   }
}


Now here is my failed attempt to provide this known cluster with the vector “cluster_assignment”. Note the line lambda[k] += positive_infinity(); which is my failed attempt to incorporate the known value. Any advice on the proper way to incorporate this information into a model? Thanks!

cluster_assignment=rep(0,M)
cluster_assignment[1]=3 #SET THE FIRST OBSERVATION AS THE 3RD CLUSTER

y = extract(inject_fit)$y[1,]

stan_data <- list(M=M, K=K, y=y, cluster_assignment=cluster_assignment)

fit <- stan(file = 'fit_normal_mixture.stan', data = stan_data,
               iter = 1000, chains = 1)  
data {
   
int<lower=0> M; // number of data points 
int<lower=1> K; // number of clusters 
vector[M] y;
vector<lower=0,upper=K>[M] cluster_assignment; //KNOWN CLUSTER ASSIGNMENT

}

transformed data {
   
   real sigma = 1.0;
}

parameters {
   //LATENT PARAMETERS
   simplex[K] theta; // proportion of each cluster
   vector[K] mu; //mean value of each cluster
}

model {
   
   //priors
   theta ~ dirichlet(rep_vector(1,K));
   mu ~ normal(0,5);

   
   // DATA FOR THE OBSERVATIONS
   for (m in 1:M) {
      
      vector[K] lambda = rep_vector(0.0,K);
      
      for (k in 1:K) {
         lambda[k] += categorical_lpmf(k | theta);
      
         //IF THIS IS A KNOWN CLUSTER ASSIGNMENT
         if(cluster_assignment[m] == k) {
            
            //!!!!!!THIS IS THE PART I DONT KNOW WHAT TO DO WITH!!!!!!!
            lambda[k] += positive_infinity();
         } else {
            //ALL OF THE UNKNOWN CLUSTER ASSIGNMENTS
            lambda[k] += normal_lpdf(y[m] | mu[k],sigma);
         }
      }
      target += log_sum_exp(lambda);
      
   }
}



#2

Hi,
I believe the problem you are trying to solve is misspecified - could you provide some more details on what is the real-world task you are trying to solve?

My best guess would be that you should split the input data into two parts: data points with known clusters and data points with unknown clusters. The data points in the known clusters only inform mu. For the data points in unknown clusters you should then explicitly model the probability they belong to each cluster as a simplex.

Note that this model would break if there are any clusters you don’t have any observations for.


#3

Thanks Martin, that was exactly the guidance I needed. I switched to bringing in the data in 2 different data structures and it works now!

   //DATA FOR THE KNOWN OBSERVATIONS
   for (m in 1:M_known_cluster) {
      int k = cluster_assignment[m];
      y_known_cluster[m] ~ normal(mu[k], sigma);
   }