Solving a biased urn problem without hypergeometric distribution?


So my problem is this:

I have 40 balls of different weights (we can assume each weight is unique). The weights are the probability I draw that ball during a particular draw. I sequentially draw 30 balls without replacement. I want to know the probability of drawing each ball at every position (e.g. ball A has a 50% chance of being drawn first, a 30% chance of being drawn second, a 20% chance of being drawn third, and a 0% chance of being drawn anywhere else). I do not know the weight of the balls, but I do have a best-guess predicted draw order of the balls (ball A, first; ball B, third; ball C, second, etc.).

As far as I am aware, the categorical family will not work because I am sampling without replacement. The Wallenius Hypergeometric would probably be the most accurate, but I don’t believe it is implemented in Stan or BRMS, although some work has been done along those lines. Is there a way for me to solve this problem and calculate these probabilities by transforming the data in some way or by a novel application of one of the existing distribution families?

Here’s a toy version of the dataset Trostle_toy_data_set - Sheet1.csv (280 Bytes) . What I’m really after is the probabilities predicted by estimated_pick, and I’d be happy with any solution so long as it’s valid and extendable.

I don’t understand what data you have and what you want to estimate.

Do you mean that each “weight” is actually a vector of length 40, or that at each draw we normalize over the weights of the remaining balls? In the latter, case then this:

is impossible.

Do you mean that you want to find weights consistent with estimated_pick, and you use no other data? The likelihood is maximized by making the weight of the 39th pick arbitrarily large compared to that of the 40th, then making the weight of the 38th arbitrarily large compared to the 39th, and so forth. That seems like it can’t be what you want, and your toy data apparently includes an observation of a single realization from the process. Do you ultimately observe one realization? Many? And how do you want to weight this information relative to whatever information is encoded in estimated_pick?