Modeling event count data

I would like to use Stan for modeling daily multivariate count data of events per user of an application (i.e. number of reads, creates, updates, etc.). My end goal is to detect anomalous users, for example, a user that has 20 “create” events when the mean is 2 or 3.

There are a few constraints that make it challenging:

  1. I need a single probability to “score” a user as anomalous for a number of events (15-20).
    This suggests to me that simple poisson or negative binomial outputs won’t do. They would need to be combined in some way.
  2. There is some covariance structure between all the events I’m modeling (they’re not independent).
    This suggests to me that a multinomial output would not reflect the observed values.
  3. There exist “groups” of users where one group’s counts of events might not be anomalous, but the same counts would be flagged as an anomaly for another group.
    This suggests that a hierarchical model would be appropriate

Does anyone have suggestions on how to model this?

1 Like

I’d aim for an unobserved multivariate normal that’s log-transformed to produce intensities that you can model further (adding covariance and hierarchy) and then used with a poisson (or similar) to relate it to the actual counts. You don’t really need to label the user as anomalous at this stage, you need to produce some output that can be used in such scoring but until you settle on the cost/consequences/benefits of flagging users you can’t really decide what the criteria should be.

1 Like

That sounds like a great approach. Thank you.

Always happy to give suggestions I don’t have to implement and check. I think this could be a great way to ID twitter bots but somehow I don’t have the time on my hands to do it and the meat of the problem is in how you model the MVN anyway :)