Hi all,
I have data generated by a two component gaussian mixture. I want to estimate the mixing proportion.
y \sim \theta \mathcal{N}(\mu_1, \sigma_1^2) + (1 - \theta) \mathcal{N}(\mu_2, \sigma_2^2)
I can ‘label’ some of the datapoints. i.e. sample them and reveal which component they belong to.
I want to use both labeled and unlabeled data to estimate the mixing proportion. How can I correct if weighted sampling was used to sample datapoints for labeling? If e.g. the probability of picking a datapoint for sampling is given by:
exp(y_i)/ \sum_i{exp(y_i)}
I get to decide how the sampling weights are generated, so we know them exactly. N_labeled >> N_unlabeled, so I’m hoping to just ignore the complication that sampling is done without replacement.
I’ve tried weighting the log likelihood as below. This kind of re-weighting worked on estimating the parameters of a single gaussian from which a biased sample was taken, but doesn’t seem to work for my gaussian mixture.
data {
int<lower=0> N_unlabeled;
int<lower=0> N_labeled;
vector[N_unlabeled] y_unlabeled;
vector[N_labeled] y_labeled;
array[N_labeled] int<lower=0, upper=1> labels;
vector[N_labeled] sampling_weight; // p(sampling)
}
transformed data {
real<lower=0> weights_mult; // used for re-weighting
weights_mult = N_labeled / sum(sampling_weight);
}
parameters {
ordered[2] mu;
vector<lower=0>[2] sigma;
real<lower=0, upper=1> theta;
}
model {
for (i in 1:2) {
mu[i] ~ normal(0, 3);
}
sigma ~ cauchy(0, 5);
for (n_unlab in 1:N_unlabeled) {
target += log_mix(theta,
normal_lpdf(y_unlabeled[n_unlab] | mu[2], sigma[2]),
normal_lpdf(y_unlabeled[n_unlab] | mu[1], sigma[1]));
}
for (n_lab in 1:N_labeled) {
if (labels[n_lab] == 0) {
target += (log1m(theta) + normal_lpdf(y_labeled[n_lab] | mu[1], sigma[1])) / (sampling_weight[n_lab]*weights_mult);
} else {
target += (log(theta) + normal_lpdf(y_labeled[n_lab] | mu[2], sigma[2])) / (sampling_weight[n_lab]*weights_mult);
}
}
}