Predict outcome based on previous counts

Hi! I’m very new to Bayesian analysis and Stan (but 15 yrs as software dev), but have a bunch of problems that ‘feel’ like a good fit. I’ve gone though about 150 pages of “Statistical Rethinking”, so I kind of understand the very basics… but taking a real world data set and analyzing it pushing my limits.

Any help to understand how to approach problem (the name of the problem I’m trying to solve), or resources/articles that address this problem would be very appreciated.
Also if there is a better place to ask this… I’m new to the community, so chose the best looking place.

Problem

Here’s the first problem I’d like to look at:

If a person attempts an activity X times, and they failed Y times, whats the probability they will succeed (or fail) the next time?

The objective would be to penalize/down-rank(??) those who fail more over more attempts, without over-penalizing those who had only tried a few times.

Sample Data Set

  • opportunities - how many times they have the tried before the outcome (in this case I included the sample instance(+1) to move off 0, but I don’t know if this is good/bad - open to changing if bad)
  • failures - how many total times have they tried before the outcome
  • outcome - What was the outcome - success or failure
opportunities   failures  outcome
1               0         success   # Succeded 1 of 1 time
1               1         failure   # Failed 1 of 1 time
2               0         success   # Succeeded 2 of 2 times
2               2         failure   # Failed 2 of 2 times
10              2         failure   # Failed 2 of 10 times
100             3         success
200             25        success

The whole data set is highly skewed right - nearly ‘negative exponential’ looking - meaning:

  • There are a lot of people that have only tried 1,2,3 times, and some that have tried up to 800 times
  • Most people succeed most of the time - I think the avg fail rate it like 5-10%, maybe lower in this data set
  • ‘Newer people’, people with fewer tries, generally fail more often (after failing so many times, they are provided fewer opportunities)
  • Some people have 100+ tries, and succeeded every time, or failed only 1-2 times

Do I have to balance my dataset?? I know other ML algos require this - but I didn’t think Bayes needed that.

What about scaling? I assumed that Bayes could just use counts.

Setup

I’m using R, and the library(rstanarm) package.

After going through this article it seems like neg_binomial_2 might be better (but I really don’t have any idea).

I assume I want to do something like this to maybe get a prior for a logistic (binomial(link = "logit")) regression (but again I have no idea at this point):

model_bayes <- stan_glm(failures ~ opportunities,
                        data=train, 
                        family = neg_binomial_2,
                        seed=42)

Very Naive Logistic

Also threw this at the wall and … we’ll it doesn’t seem useful at all, so figured I’d ask for help.

model_bayes_log <- stan_glm(result ~ opportunities + failures,
                        data=train2,
                        family = binomial(link = "logit"),
                        seed=42)
model_bayes_log
# stan_glm
# family:       binomial [logit]
# formula:      result ~ opportunities + failures
# observations: 12263
# predictors:   3
# ------
#   Median MAD_SD
# (Intercept)   0.9    0.0   
# opportunities 0.0    0.0   
# failures      0.0    0.0   

Hi, to be able to help you, can clarify a few thinks?

  • What is the “result” variable in the brms model that you specified?
  • Do you have only the aggregated data format that you describe in your post or do you have the time of when each opportunity and fail or success occurred?

In advance of understanding those details, my first intuition are:

  • Start simple and leave the “learning” aspect for later. First test and explore a simple model.
  • The outcome you would want to model is actually the number of successes (total or as a function over time). The number of opportunities (or specific opportunities at specific time points) are either the denominator (logistic regression) or the offset (negative bimonial or Poisson model) of your model. On top of this, every individual can be considered to have an intrinsic ability (talent, before the first opportunity), which could be captured by a random intercept per individual (being mindful of whether this identifiable in a negative binomial model). I would

Now, if in addition to the above you want to model some sort of reinforcement / learning as a function of the number of fails and successes, it would be useful to know in what order these occurred, and maybe even better, when exactly in time. Only that way will you be able to disentangle the double-way causality between the total number of tries (which depends on the number of successes thus far?) and the number of successes at some point int time (which depends on the number of opportunities, conditional on the person’s ability). The person’s ability will be a latent (unobserved) variable that’s either fixed or increasing over time (assuming that practice doesn’t make worse).

@LucC Thank you very much for the response. That gave me a ton to think about - sounds like my approach was off. I think I’m following you theoretically, but I’m struggling to connect the dots and construct the model/formula. Some responses below. Any thoughts on how to get the first/simplest model working would be incredible.

  • What is the “result” variable in the brms model that you specified?

It is just a factor of “Success” or “Failure”. Its whether the ‘attempt’ was successful or not. (Sorry - may have confused ‘outcome’ and ‘result’ - they mean the same thing)

outcome = factor(ifelse(test= outcome == "SUCCESS", yes="success", no="failure")),
  • Do you have only the aggregated data format that you describe in your post or do you have the time of when each opportunity and fail or success occurred?

I have very detailed data - that was just sample to keep things simple. I can easily get a timestamp/epoch seconds when the ‘event’ occurred. I can make something like this, where time is ‘epoch’(int) or whatever:

person_id   time   opportunities   failures  outcome
23          100    1               0         success
23          250    2               0         success
23          300    3               0         failure
23          325    4               1         success
23          350    5               1         success

The outcome you would want to model is actually the number of successes (total or as a function over time).

Whoa! I’ve been trying to predict the ‘result’ of their next attempt - but because some will fail 1 of every 3 times, and some 1 or every 10 or 50 times, my predict accuracy and kappa is terrible (in other models).
That seems to make a lot of sense. Now… how to do that?!? lol.

Also over time would be potentially very useful too. Some people do the activity often, others less often.

The number of opportunities (or specific opportunities at specific time points) are either the denominator (logistic regression)

Don’t think I’m totally following here. By denominator, do you mean: NumberOfSuccesses / NumberOfOpportunities? Is that for the probability? How do I integrate that into the model?

Or the ‘formula’ should be successes ~ opportunities?? But successes needs to be a 0/1 in a logistic.

I think I see where you’re going, but not sure how to connect the dots in code.

every individual can be considered to have an intrinsic ability (talent, before the first opportunity)

Thats dead on! Think: we’re modeling whether a person shows up to their job and does a good job or not. In the ‘population’ there are responsible people and irresponsible people; those who do a good job and those who don’t. I’d imagine there would be changes/shifts to this over time for an individual.

How would a random intercept be built in to the model and associated with the person?

model some sort of reinforcement / learning as a function of the number of fails and successes

I assume you mean as the person did more, they learned and improved. Yes that would be excellent - probably as step 3 or 4 though.

Only that way will you be able to disentangle the double-way causality…

I’m just barely following you… but I think I’m tracking. In my problem, it is possible for some people not to have been given enough opportunities, so we don’t see their value yet. Or they had one or two failures early on, and that biases them moving forward - they arent given many more opportunities. (hope that makes sense)

The person’s ability will be a latent

If I understand ‘latent’ correctly, this is what I really want to ‘extract’ some how. Then I can give more opportunities to those with ‘better abilities’ and those where I don’t know their ability yet. Those with lower abilities - I can allocate to ‘different areas’ where they pose less of a risk.