Weighted sample of a gaussian mixture

Hi all,

I have data generated by a two component gaussian mixture. I want to estimate the mixing proportion.

y \sim \theta \mathcal{N}(\mu_1, \sigma_1^2) + (1 - \theta) \mathcal{N}(\mu_2, \sigma_2^2)

I can ‘label’ some of the datapoints. i.e. sample them and reveal which component they belong to.

I want to use both labeled and unlabeled data to estimate the mixing proportion. How can I correct if weighted sampling was used to sample datapoints for labeling? If e.g. the probability of picking a datapoint for sampling is given by:
exp(y_i)/ \sum_i{exp(y_i)}

I get to decide how the sampling weights are generated, so we know them exactly. N_labeled >> N_unlabeled, so I’m hoping to just ignore the complication that sampling is done without replacement.

I’ve tried weighting the log likelihood as below. This kind of re-weighting worked on estimating the parameters of a single gaussian from which a biased sample was taken, but doesn’t seem to work for my gaussian mixture.

data {
    int<lower=0> N_unlabeled;
    int<lower=0> N_labeled;
    vector[N_unlabeled] y_unlabeled;
    vector[N_labeled] y_labeled;
    array[N_labeled] int<lower=0, upper=1> labels;
    vector[N_labeled] sampling_weight; // p(sampling)
}
transformed data {
    real<lower=0> weights_mult; // used for re-weighting
    weights_mult = N_labeled / sum(sampling_weight);
}
parameters {
    ordered[2] mu;
    vector<lower=0>[2] sigma;
    real<lower=0, upper=1> theta; 
} 
model {
    for (i in 1:2) {
        mu[i] ~ normal(0, 3);
    }
    sigma ~ cauchy(0, 5);

    for (n_unlab in 1:N_unlabeled) {
            target += log_mix(theta,
                            normal_lpdf(y_unlabeled[n_unlab] | mu[2], sigma[2]),
                            normal_lpdf(y_unlabeled[n_unlab] | mu[1], sigma[1]));
                            }

    for (n_lab in 1:N_labeled) {
            if (labels[n_lab] == 0) {
                target += (log1m(theta) + normal_lpdf(y_labeled[n_lab] | mu[1], sigma[1])) / (sampling_weight[n_lab]*weights_mult);
            } else { 
                target += (log(theta) + normal_lpdf(y_labeled[n_lab] | mu[2], sigma[2])) / (sampling_weight[n_lab]*weights_mult);
            }
        }
}

I found the reversing of the mu[1] and mu[2] between the two loops a bit confusing.

The code you wrote looks OK for weighting. In what way is it failing? Did you simulate data and it failed to recover? There’s an issue recovering variances with weightings, I believe.

You can also vectorize this:

mu ~ normal(0, 3);  // vectorize

I would put a tighter prior on sigma unless you expect it to sometimes take values in the 1000s.

You can use compound declare and define in transformed data,

real<lower=0> weights_mult = N_labeled / sum(sampling_weight);

I wrote a blog post on Andrew’s blog simulating why weighting screws up estimates when @Lauren first explained to me what was going on! I wrote it back when LaTeX was still working on the blog—the hosting he has now is not ideal.

https://statmodeling.stat.columbia.edu/2019/10/29/non-random-missing-data-weights-generative-model/