# How to inform a semi-supervised mixture

I’m trying to understand how I can incorporate the occasional known data into a model, ala semi-supervised.

Below is the data generating process. Notice the line `stan y = normal_rng(mu,1);` where I make a hard assignment of a single datapoint to a specific cluster (3). All the other datapoints are treated as unknown cluster assignements.

``````inject_fit <- stan(file = 'data_generation_normal_mixture.stan',
iter = 1, chains = 1, algorithm = "Fixed_param")
``````

data_generation_normal_mixture.stan:

``````transformed data {

int M = 100; // number of data points
int K = 3; // number of clusters

//LATENT PARAMETERS
simplex[K] theta = [0.5,0.25,0.25]'; // proportion of each cluster
vector[K] mu; //mean value of each cluster

mu = -5;
mu = 0;
mu = 5;
}

generated quantities {
//theta ~ dirichlet(alpha);
vector[M] y;

//SEMISUPERVISED DATA
//I KNOW FOR SURE THAT THE FIRST OBSERVATION IS FROM THE THIRD CLUSTER
y = normal_rng(mu,1);

// GENERATE DATA FOR THE REST OF THE OBSERVATIONS
for (m in 2:M) {
int this_cluster = categorical_rng(theta);
y[m] = normal_rng(mu[this_cluster],1);
}
}

``````

Now here is my failed attempt to provide this known cluster with the vector “cluster_assignment”. Note the line `lambda[k] += positive_infinity();` which is my failed attempt to incorporate the known value. Any advice on the proper way to incorporate this information into a model? Thanks!

``````cluster_assignment=rep(0,M)
cluster_assignment=3 #SET THE FIRST OBSERVATION AS THE 3RD CLUSTER

y = extract(inject_fit)\$y[1,]

stan_data <- list(M=M, K=K, y=y, cluster_assignment=cluster_assignment)

fit <- stan(file = 'fit_normal_mixture.stan', data = stan_data,
iter = 1000, chains = 1)
``````
``````data {

int<lower=0> M; // number of data points
int<lower=1> K; // number of clusters
vector[M] y;
vector<lower=0,upper=K>[M] cluster_assignment; //KNOWN CLUSTER ASSIGNMENT

}

transformed data {

real sigma = 1.0;
}

parameters {
//LATENT PARAMETERS
simplex[K] theta; // proportion of each cluster
vector[K] mu; //mean value of each cluster
}

model {

//priors
theta ~ dirichlet(rep_vector(1,K));
mu ~ normal(0,5);

// DATA FOR THE OBSERVATIONS
for (m in 1:M) {

vector[K] lambda = rep_vector(0.0,K);

for (k in 1:K) {
lambda[k] += categorical_lpmf(k | theta);

//IF THIS IS A KNOWN CLUSTER ASSIGNMENT
if(cluster_assignment[m] == k) {

//!!!!!!THIS IS THE PART I DONT KNOW WHAT TO DO WITH!!!!!!!
lambda[k] += positive_infinity();
} else {
//ALL OF THE UNKNOWN CLUSTER ASSIGNMENTS
lambda[k] += normal_lpdf(y[m] | mu[k],sigma);
}
}
target += log_sum_exp(lambda);

}
}

``````

Hi,
I believe the problem you are trying to solve is misspecified - could you provide some more details on what is the real-world task you are trying to solve?

My best guess would be that you should split the input data into two parts: data points with known clusters and data points with unknown clusters. The data points in the known clusters only inform `mu`. For the data points in unknown clusters you should then explicitly model the probability they belong to each cluster as a `simplex`.

Note that this model would break if there are any clusters you don’t have any observations for.

Thanks Martin, that was exactly the guidance I needed. I switched to bringing in the data in 2 different data structures and it works now!

``````   //DATA FOR THE KNOWN OBSERVATIONS
for (m in 1:M_known_cluster) {
int k = cluster_assignment[m];
y_known_cluster[m] ~ normal(mu[k], sigma);
}
``````