My data consists of words which can belong to one of several categories (let’s say 2 for simplicity: A and B, but it could be more). I have counts for each word of how many times I see that word in each category:
word A B
w1 100 0
w2 3 0
w3 1 100
w4 …
Now, I have the hypothesis that there are 3 distinct classes: words which can only belong to A, words which can only belong to B and words which belong to both.
I can also predict with relatively good accuracy the class assignment of each word based on several predictors.
Now, my problem is this: how can I properly handle words for which I have very low counts?
Something like a binomial model with counts and trials does not really seem to capture the idea that there are three discrete classes.
The issue with a categorical model is that for training, I have to assume that items like w2 in the example, belong to class A, but I do not have certainty about this.
Is there any way of including this type of uncertainty in a categorical model?
Are there other approaches I could look at for this particular problem?
You could also models this as A vs. everything else adn B vs. everything else, no? Then a Binomial model would do?
Hm, maybe you are looking for the Multinomial distribution?
In the binary case you might also want to look at the Beta-Binomial family of models. This generalizes to the Multivariate (Multi-Category) case as the Dirichlet-Multinomial. Maybe this helps to think in terms of “un/certainty” about categories.
I am not sure this helps. The main issue I (think I) have is that I am not sure about the ground truth of the labels. If I have an item which appears 10 times in the corpus with A and 0 times in the corpus with B, I know that it must belong to either A or AB, and I believe it likely belongs to A. If instead of 10 to 0 it is 1000 to 0, then my believe that that item is in fact of class A and not AB is much stronger. However, these counts are not part of the parameters I want to estimate, these are rather priors on my believe about the class assignment.
If I was certain about the labels the model would be something like Class ~ predictor_1 + predictor_2 …, wher predictor_1 has nothing to do with the counts.
Maybe? I want to be able to use those counts as priors on the probability of class assignment, with the particular property that I can have absolute certainty that an Item belongs to class AB, but I cannot have absolute certainty that an item will belong to class A or class B.
I’m still having difficulties to think about this. As I understand it, it’s a series of binary “decisions” (for lack of better terms…).
If you observe at least w_1 in A and at least one in B, then P(w_1 = AB) = 1 and P(w_1 = A) = P(w_1 = B) = 0. Like you said, you know the category with certainty.
If you observe at least one w_2 in A but you have 0 occurrences in B, then P(w_2 = A) = p_A and P(w_2 = B) = 0, clearly, because we have observed at least one w_2 in A. Thus, P(w_2 = AB) = 1 - p_A. Same thing hold if we switch the labels. You can estimate the p_A (or p_B) with a Binomial or Beta-Binomial.
I’m not sure, how you can restrict a categorical model in this way, but I’m pretty sure you can hand-code something in Stan using the above approach.