Finite mixture model where the sum of a group's characteristics is known

Hi, that’s an interesting problem. It might actually be more algorithms/computer science/operation research/combinatorial than statistical. I am honestly unsure how to best use the sum-constraint, it sounds hard. The issue is that the structure enforced by the sum-constraint can be very complex - and the intuition that large capitalization is by itself unlikely is not completely supported. Just to show some of the complexity with a few examples.

  1. w_g = \{2, 4,6,8,10,11\}, C_g = 13, n_1 = 2 - there is only one solution (and it involves the biggest number): \{2,11\}
  2. w_g = \{1, 3, 4, 5, 7, 10\}, C_g = 15, n_1 = 3 - there are two completely disjoint solutions \{3,5,7\} and \{1,4,10\}
  3. w_g = \{1,1,1,1,1,1,1,1\}, C_g = 3, n_1 = 3 - there are 56 solutions (any subset of three will do)
  4. w_g = \{1,1,1,1,1,1,1,1\}, C_g = 4, n_1 = 3 - there are no solutions
  5. w_g = \{1,1,1,1,1,1,1,2\}, C_g = 4, n_1 = 3 - there are 21 solutions, all including the last element

First, you may also notice how even a small change in the input can drastically alter the solution space (this is very typical of NP-complete problems of which this is an example).

Second, it is not very meaningful to speak about probabilities of a single element being in the solution - you need to consider the whole solution as a unit, because e.g. in case 2) each element is part of exactly one solution (in some sense has 50% probability of being in the solution), but once you choose a single element, all the other are determined.

Technically you have additional constraint on the subset-sum problem that you also specify the number of elements, so it is not exactly the subset-sum problem, but I am quite confident (although I cannot prove it right now) that this does not change the problem fundamentally.

So what to do about this? If you can trust your source to have the valuations and sums consistent, you can try to evaluate all solutions for each subset via exhaustive search (most likely outside of Stan). If there are not so many, you can treat the response as a mixture over all the possible solutions (e.g. if you have k solutions, define a simplex[k] solution_probs the defines the mixture probabilities of each solution). The solution then identifes the component for all data point. This will however be impractical, if k is large - in that case, I would probably look if there are some elements that are in all/no solutions and exclude them from the mixture and then just ignore the constraint.

If you can’t trust the source to have this correct, this introduces additional complexity (one could e.g. look for solutions within some tolerance, but this would likely notably inflate the number of solutions).

4 Likes