# Finite mixture model where the sum of a group's characteristics is known

I am working on an unusual finite mixture model to try to tackle an historical source. The problem combines a normal 2 component finite mixture with some additional information, and I am unsure how to incorporate the additional information into my model. I want both to estimate the means of the mixtures, and to get posterior probabilities of which elements belong to which mixture. I’m posting here to beg for smart ideas.

I have an historical source which provides me with the following information: there are N companies indexed i divided into G groups. The companies are of two types z \in \{0,1\} and associated with each company is an outcome y_{ig} for which I strongly believe that y_{ig}^{(1)} \sim N(\mu_1, \sigma_1) and y_{ig}^{(0)}\sim N(\mu_0, \sigma_0) with \mu_1 > \mu_0. In other words a mixture of normals.

For each group g the source tells me how many companies are of type 0 (n_0) and how many are of type 1 (n_1). Thus within each group g \in G I know the mixing probability p_g = \frac{n_{1g}}{n_{0g} + n_{1g}} (really I’d rather say I have a very strong prior on \lambda_g \sim N(p_g, \sigma_p) as historical sources are error prone, and I’ll do that a bit further down).

So far, this is a straightforward finite-mixture problem with known mixing probabilities as is well-described in the user guide on finite mixtures as well as in the great case study by Betancourt on identifying mixture models. There is the small wrinkle of having a mixing parameter per group but that seems surmountable. In Stan I wrote this to look like:

data {
int<lower = 0> N;
vector[N] y;
vector[N] prop; //prior proportions
}

parameters {
ordered[2] mu;
real<lower=0> sigma[2];
}

model {
sigma ~ normal(.1, .1);
mu ~ normal(.3, .1);
for (n in 1:N) {
target += log_mix(prop[n], //mixture prob
normal_lpdf(y[n] | mu[1], sigma[1]),
normal_lpdf(y[n] | mu[2], sigma[2]));
}
}



and on simulated data this appears to work.

However, the source also tells me the total capitalization of each type of company within each group. Thus if a company’s nominal market capitalization is w_{ig} the source reports \sum_{i} z_{ig} w_{ig} = C_g. In words, the source reports information like “for this list of 20 companies, 7 are good, 13 are bad, and the aggregate capitalization of the good companies is 2 million”. Clearly, this is useful additional information and there should be some way for me to exploit it!

For instance, in one case the source reports 1 good company and there is only one company for which w_{ig} = C_g and therefore I can recover the vector of z_i's for that group. It seems reasonable to me that given the vector \mathbf{w_g} and the constraint C_g some elements w_{ig} are more likely to be part of the solution than others. For instance, a company who’s capitalization exceeds the constraint obviously is excluded, and one that is large would eliminate lots of other solutions thus making its inclusion unlikely. Googling suggests this is what computer scientists call the ‘subset sum’ problem, but so far that information hasn’t helped me find papers that estimate the probability that an element is a component of the solution.

One way I tried to tackle the problem was to create a length-N parameter \lambda_{ig} which I draw centered on the reported mixing probabilities p_g. I then add to the model the constraint that C_g = N(\mathbf{\lambda_g}^T\mathbf{w_g}, \sigma_C) where I just pick a plausible value for \sigma_C. Basically, trying to get a better estimate of \lambda_{ig} by forcing it to approximately solve the constraints. My efforts are below:

data {
int<lower = 0> N; // n obs
int<lower = 0> G; // n groups
int gs[G]; // vector of group sizes e.g. [20, 13, 22, ...]
vector[N] y; // outcomes
vector[N] w; // characteristics
vector[N] prop; //prior proportions
vector[G] C; // prior constraints
}

parameters {
ordered[2] mu;
real<lower=0> sigma[2];
real<lower=0,upper=1> lambda[N];
}

model {
int pos; // to cope with ragged group indexing
sigma ~ normal(.1, .1); // small variance prior
mu ~ normal(.3, .1); // tight mean prior
lambda ~ normal(prop, .1); // truncated normal around the reported proportion
pos = 1; // start indexing by position
for (g in 1:G) { //for each group, the constraint is normally
//distributed around the weighted sum
C[g] ~ normal(dot_product(to_vector(segment(lambda, pos, gs[g])),
pos = pos + gs[g]; // update position
}

for (n in 1:N) { // here we estimate the mixture model for each ob
target += log_mix(lambda[n], //mixture prob
normal_lpdf(y[n] | mu[1], sigma[1]),
normal_lpdf(y[n] | mu[2], sigma[2]));
}
}

generated quantities {
real p1[N]; // here I compute posterior probability that z_i = 1
for (n in 1:N) {
p1[n] = exp(normal_lpdf(y[n] | mu[1], sigma[1]) + log(lambda[n]) -
log_sum_exp(log(lambda[n]) + normal_lpdf(y[n] | mu[1], sigma[1]),
log1m(lambda[n]) + normal_lpdf(y[n] | mu[2], sigma[2])));
}
}


This is not working as I wished it might (although perhaps working as I expected it would!) insofar as the additional constraint doesn’t really seem to result in the individual parameters \lambda_i getting more accurately estimated. The means and variances are estimated accurately:

But in general the individual \lambda_i parameters don’t really depart from their priors as you can see in the figure below where I plot some of the posterior mixing coefficients against the true z_i values:

As you can see, there isn’t alot of variation in the values of \lambda_i and they look pretty much like their prior. This makes me think maybe they are not identified, or I am not parameterizing the problem in a way that is meaningful, or there are too many ways to satisfy the constraint and I am not adding to the model. I thought perhaps I would need to draw z_{ig} given \lambda, but don’t know how to do that in Stan as it would seem to entail a discrete parameter.

Any ideas about how to incorporate this constraint information would be greatly appreciated! I’m also open to any broader thoughts, suggestions, or criticisms of how I am approaching the problem. (Stan and R code attached).bayes_mixture.R (1.7 KB) bayes_mixture.stan (1.4 KB)

1 Like

Hi, that’s an interesting problem. It might actually be more algorithms/computer science/operation research/combinatorial than statistical. I am honestly unsure how to best use the sum-constraint, it sounds hard. The issue is that the structure enforced by the sum-constraint can be very complex - and the intuition that large capitalization is by itself unlikely is not completely supported. Just to show some of the complexity with a few examples.

1. w_g = \{2, 4,6,8,10,11\}, C_g = 13, n_1 = 2 - there is only one solution (and it involves the biggest number): \{2,11\}
2. w_g = \{1, 3, 4, 5, 7, 10\}, C_g = 15, n_1 = 3 - there are two completely disjoint solutions \{3,5,7\} and \{1,4,10\}
3. w_g = \{1,1,1,1,1,1,1,1\}, C_g = 3, n_1 = 3 - there are 56 solutions (any subset of three will do)
4. w_g = \{1,1,1,1,1,1,1,1\}, C_g = 4, n_1 = 3 - there are no solutions
5. w_g = \{1,1,1,1,1,1,1,2\}, C_g = 4, n_1 = 3 - there are 21 solutions, all including the last element

First, you may also notice how even a small change in the input can drastically alter the solution space (this is very typical of NP-complete problems of which this is an example).

Second, it is not very meaningful to speak about probabilities of a single element being in the solution - you need to consider the whole solution as a unit, because e.g. in case 2) each element is part of exactly one solution (in some sense has 50% probability of being in the solution), but once you choose a single element, all the other are determined.

Technically you have additional constraint on the subset-sum problem that you also specify the number of elements, so it is not exactly the subset-sum problem, but I am quite confident (although I cannot prove it right now) that this does not change the problem fundamentally.

So what to do about this? If you can trust your source to have the valuations and sums consistent, you can try to evaluate all solutions for each subset via exhaustive search (most likely outside of Stan). If there are not so many, you can treat the response as a mixture over all the possible solutions (e.g. if you have k solutions, define a simplex[k] solution_probs the defines the mixture probabilities of each solution). The solution then identifes the component for all data point. This will however be impractical, if k is large - in that case, I would probably look if there are some elements that are in all/no solutions and exclude them from the mixture and then just ignore the constraint.

If you can’t trust the source to have this correct, this introduces additional complexity (one could e.g. look for solutions within some tolerance, but this would likely notably inflate the number of solutions).

4 Likes

Dear @martinmodrak , thanks so much for your thoughtful response! It’s been very helpful in clarifying my thinking around this problem.

You are quite right that 1.) it’s not that meaningful to think about the probability that an element is a component of a solution as that cannot be assessed independently of all the other elements of the solution and that 2.) my intuitions about what makes an element likely to be in the solution were quite wrong.

Your suggestion of getting the sets of solutions outside the model and putting a simplex on the set of solutions is very clever but I just don’t trust the source enough to take the constraints as hard rather than soft, and with soft constraints I think this will blow up too much.

In the end, I was moving towards a slight re-parameterization where I draw the mixing weights uniformly for each observation and then update them given the data of mixing proportions offered in the source. That way, in those cases where the source lets me determine some company’s type for sure outside the model I just set the data in that instance to 0/1. So the mixing weights are parameterized:

\lambda = beta(1,1);
prop = \mathbb{N}(\lambda, \sigma_{prop})

where prop is what the source tells me the mixing weights are or what I can figure out from the source. Thus prop might look like [0 1 0.5 0.5 0.3 0.3 1 0 0 …] depending on how I manage to translate the source into actual classifications given the constraint. Hopefully this lets me take advantage of the extra information where I can!

1 Like