I’m trying to understand how I can incorporate the occasional known data into a model, ala semi-supervised.

Below is the data generating process. Notice the line `stan y[1] = normal_rng(mu[3],1);`

where I make a hard assignment of a single datapoint to a specific cluster (3). All the other datapoints are treated as unknown cluster assignements.

```
inject_fit <- stan(file = 'data_generation_normal_mixture.stan',
iter = 1, chains = 1, algorithm = "Fixed_param")
```

data_generation_normal_mixture.stan:

```
transformed data {
int M = 100; // number of data points
int K = 3; // number of clusters
//LATENT PARAMETERS
simplex[K] theta = [0.5,0.25,0.25]'; // proportion of each cluster
vector[K] mu; //mean value of each cluster
mu[1] = -5;
mu[2] = 0;
mu[3] = 5;
}
generated quantities {
//theta ~ dirichlet(alpha);
vector[M] y;
//SEMISUPERVISED DATA
//I KNOW FOR SURE THAT THE FIRST OBSERVATION IS FROM THE THIRD CLUSTER
y[1] = normal_rng(mu[3],1);
// GENERATE DATA FOR THE REST OF THE OBSERVATIONS
for (m in 2:M) {
int this_cluster = categorical_rng(theta);
y[m] = normal_rng(mu[this_cluster],1);
}
}
```

Now here is my failed attempt to provide this known cluster with the vector “cluster_assignment”. Note the line `lambda[k] += positive_infinity();`

which is my failed attempt to incorporate the known value. Any advice on the proper way to incorporate this information into a model? Thanks!

```
cluster_assignment=rep(0,M)
cluster_assignment[1]=3 #SET THE FIRST OBSERVATION AS THE 3RD CLUSTER
y = extract(inject_fit)$y[1,]
stan_data <- list(M=M, K=K, y=y, cluster_assignment=cluster_assignment)
fit <- stan(file = 'fit_normal_mixture.stan', data = stan_data,
iter = 1000, chains = 1)
```

```
data {
int<lower=0> M; // number of data points
int<lower=1> K; // number of clusters
vector[M] y;
vector<lower=0,upper=K>[M] cluster_assignment; //KNOWN CLUSTER ASSIGNMENT
}
transformed data {
real sigma = 1.0;
}
parameters {
//LATENT PARAMETERS
simplex[K] theta; // proportion of each cluster
vector[K] mu; //mean value of each cluster
}
model {
//priors
theta ~ dirichlet(rep_vector(1,K));
mu ~ normal(0,5);
// DATA FOR THE OBSERVATIONS
for (m in 1:M) {
vector[K] lambda = rep_vector(0.0,K);
for (k in 1:K) {
lambda[k] += categorical_lpmf(k | theta);
//IF THIS IS A KNOWN CLUSTER ASSIGNMENT
if(cluster_assignment[m] == k) {
//!!!!!!THIS IS THE PART I DONT KNOW WHAT TO DO WITH!!!!!!!
lambda[k] += positive_infinity();
} else {
//ALL OF THE UNKNOWN CLUSTER ASSIGNMENTS
lambda[k] += normal_lpdf(y[m] | mu[k],sigma);
}
}
target += log_sum_exp(lambda);
}
}
```