Over time, I have realized that the Wallenius non-central hyoergeometric distribution is not the correct distribution for my needs. The reason is: I do not believe the probability of selection depends upon the number of items available.

An example that I constructed that enumerates the type of problem I am trying to solve:

Assume we have c different t-shirts, each with a different probability of being purchased, p_i, i=1...c. For each t-shirt, we have m_i, i=1...c copies of it.

Now, a customer is shown all of the t-shirts we have in stock (that is, each t-shirt where m_i > 0). They purchase one according to the probabilities of the shirts that are available. Then this t-shirt is selected, m_i is decremented by one and the next customer is shown the t-shirts.

This example that I listed seems to be a more “categorical sampling at each realization” where the probabilities only change if we “stick out” of an item.

Does this correspond to a Fisher non-central hypergeometric distribution? If not, does anyone know of a non-central hypergeometric distribution that satisfies this type of sampling?

If so, and c is large, does anyone have insight as to how to compute the denominator (the P_0) in the Fisher distribution?

I am a bit puzzled as what you describe sounds exactly as simple sampling without replacement which was already suggested in the original thread (Constraining Output of Multinomial Distribution - #3 by LucC). Could you explain once again why that would not work for you?

Hey Martin. The difference is that the probability of selection is different across the items, which is not accounted for in the multivariate hypergeometric distribution – there, each probability is uniform and only depends upon the number of “balls in the urn.” For my case, the probabilities do not depend upon the number of items total, only that there is at least one item, but they do differ across the items.

So the way I am modeling it is now to use the Wallenius noncentral hypergeometric distribution but under the assumption that there is a latent probability for each “ball”, p_i, i=1,...,k.. Then we define the “weight” of the distribution as omega_i = p_i / m_i, where m_i denotes the number of balls of color i,
so that the probability of drawing a ball is not dependent upon the number of items in the urn.

I don’t know if this is correct, but I have been unsuccessful in trying to derive the likelihood for a “sequential categorical distribution.”

Again, if anyone has any thoughts on this, I would be super appreciative!

The problem is that the order in which the t-shirts were chosen matters, so say we have three categories where the max counts are (1,5,5) then for example

So for each category where the count was exhausted you need to include a term for each possible position in the sequence when the last item was taken AND each possible split of the other items taken before and after this position. I don’t see any easy way how to sum that without explicitly enumerating all the terms… Maybe there is a way to express this in terms of some hypergeometric function (which I don’t really understand, but they crop up all over the place in this region of stats :-) , but those are themselves AFAIK usually not very tractable computationally anyway… A hint to the difficulty is that the Fisher’s noncentral hypergeometric distribution, which is close, but intuitively easier than what you have requires explicitly iterating over the whole support (or evaluating a hypergeometric function)

I presume you don’t have access the disaggegated data as the order in which the t-shirts were purchased can carry a significant amount of information (in the extreme when everything available is sold, the total count carries no information on the relative preference while the order contains all the information)?

Yes this is all totally correct @martinmodrak ! I thought by trying to “aggregate” my data (and not having individual sales) I would “help” the sampling.

Since then, I just tried out doing inference for the model you wrote down. That is, for each sale, I know which T-shirts I have in stock, and which one was selected.

So my model finds weights omega of size num_unique_tshirts where the selection probability is

softmax(omega[indicies_where_we_have_tshirts])

My data is: number of unique t-shirts (call it N) and for each selection event

which T-shirts I have (actually the indicies of the “unique T-shirts list” )

if there are k T-shirts at this selection, then I know which T-shirt is chosen (the index \in {1,...,k})

And the likelihood looks like:

for (selection in 1:number_of_selection_events) {
target += categorical_logit_lupmf(selected_index[n] | omega[indicies_for_tshirts_available[n])

but in reality I pad the vectors (at the end) to use reduce_sum so indicies_for_tshirts_available is the same size for each n

The model fits super fast (using reduce_sum and GPU on my Macbook I fit the entire model in under 10 minutes with over 50 million selection events and about 1000 unique T-shirts).

An issue I am encountering though: there are some T-shirts that are basically never chosen, and no prior that I have come up with (as yet) will set the omegas for these T-shirts small enough. (since they are never selected, their prior weight is basically not updated, and that pulls all of the other probabilities down).

If you have any insight as to what priors to try here (I don’t want to “tell” the model with T-shirts were never chosen!) that would be fantastic!

That’s the “fun” with modern computers - quite often doing a clever thing will have worse performance than doing a simple thing. Doing a lot of identical simple operations is something the modern hardware is just very good at. ¯_(ツ)_/¯

That sounds weird - not being selected should IMHO pull the weights down pretty quickly. If we look at just a binary model where either a specific t-short is selected or any other t-shirt is selected and ignore any covariates, we have a model that has analytical solution (Beta distribution - Wikipedia). Assuming a uniform prior, the posterior for the probability of being selected after never being selected in k trials should be \mathrm{Beta}(1, k + 1), i.e. it should go down relatively rapidly with increasing k.

So maybe there is something problematic going on in your model. One way to diagnose would be to just use no predictors and two t-shirt categories where both are always available and see if that matches the analytic solutions (moduolo any differences in priors) - but that probably warrants a new forum thread.