Weighted sample of a gaussian mixture

louisfh · November 11, 2024, 2:51am

Hi all,

I have data generated by a two component gaussian mixture. I want to estimate the mixing proportion.

y \sim \theta \mathcal{N}(\mu_1, \sigma_1^2) + (1 - \theta) \mathcal{N}(\mu_2, \sigma_2^2)

I can ‘label’ some of the datapoints. i.e. sample them and reveal which component they belong to.

I want to use both labeled and unlabeled data to estimate the mixing proportion. How can I correct if weighted sampling was used to sample datapoints for labeling? If e.g. the probability of picking a datapoint for sampling is given by:
exp(y_i)/ \sum_i{exp(y_i)}

I get to decide how the sampling weights are generated, so we know them exactly. N_labeled >> N_unlabeled, so I’m hoping to just ignore the complication that sampling is done without replacement.

I’ve tried weighting the log likelihood as below. This kind of re-weighting worked on estimating the parameters of a single gaussian from which a biased sample was taken, but doesn’t seem to work for my gaussian mixture.

data {
    int<lower=0> N_unlabeled;
    int<lower=0> N_labeled;
    vector[N_unlabeled] y_unlabeled;
    vector[N_labeled] y_labeled;
    array[N_labeled] int<lower=0, upper=1> labels;
    vector[N_labeled] sampling_weight; // p(sampling)
}
transformed data {
    real<lower=0> weights_mult; // used for re-weighting
    weights_mult = N_labeled / sum(sampling_weight);
}
parameters {
    ordered[2] mu;
    vector<lower=0>[2] sigma;
    real<lower=0, upper=1> theta; 
} 
model {
    for (i in 1:2) {
        mu[i] ~ normal(0, 3);
    }
    sigma ~ cauchy(0, 5);

    for (n_unlab in 1:N_unlabeled) {
            target += log_mix(theta,
                            normal_lpdf(y_unlabeled[n_unlab] | mu[2], sigma[2]),
                            normal_lpdf(y_unlabeled[n_unlab] | mu[1], sigma[1]));
                            }

    for (n_lab in 1:N_labeled) {
            if (labels[n_lab] == 0) {
                target += (log1m(theta) + normal_lpdf(y_labeled[n_lab] | mu[1], sigma[1])) / (sampling_weight[n_lab]*weights_mult);
            } else { 
                target += (log(theta) + normal_lpdf(y_labeled[n_lab] | mu[2], sigma[2])) / (sampling_weight[n_lab]*weights_mult);
            }
        }
}

Bob_Carpenter · November 11, 2024, 11:30pm

I found the reversing of the mu[1] and mu[2] between the two loops a bit confusing.

The code you wrote looks OK for weighting. In what way is it failing? Did you simulate data and it failed to recover? There’s an issue recovering variances with weightings, I believe.

You can also vectorize this:

mu ~ normal(0, 3);  // vectorize

I would put a tighter prior on sigma unless you expect it to sometimes take values in the 1000s.

You can use compound declare and define in transformed data,

real<lower=0> weights_mult = N_labeled / sum(sampling_weight);

I wrote a blog post on Andrew’s blog simulating why weighting screws up estimates when @Lauren first explained to me what was going on! I wrote it back when LaTeX was still working on the blog—the hosting he has now is not ideal.

https://statmodeling.stat.columbia.edu/2019/10/29/non-random-missing-data-weights-generative-model/

Topic		Replies	Views
Weighted loglikelihoods in mixture model PyStan	1	977	February 3, 2020
Time-varying mixing weights in 2-component Gaussian mixture model Modeling	2	674	April 13, 2018
Non-identifiability of mixing weights in a 2-component Gaussian model Modeling	5	539	July 20, 2018
Nonsensical Mixing Proportions brms mixture	2	475	May 18, 2023
Using a single mixture weight parameter in a multivariate mixture model brms specification , mixture	3	1167	February 5, 2021

Weighted sample of a gaussian mixture

Related topics