Hi! I’m very new to Bayesian analysis and Stan (but 15 yrs as software dev), but have a bunch of problems that ‘feel’ like a good fit. I’ve gone though about 150 pages of “Statistical Rethinking”, so I kind of understand the very basics… but taking a real world data set and analyzing it pushing my limits.
Any help to understand how to approach problem (the name of the problem I’m trying to solve), or resources/articles that address this problem would be very appreciated.
Also if there is a better place to ask this… I’m new to the community, so chose the best looking place.
Problem
Here’s the first problem I’d like to look at:
If a person attempts an activity X times, and they failed Y times, whats the probability they will succeed (or fail) the next time?
The objective would be to penalize/down-rank(??) those who fail more over more attempts, without over-penalizing those who had only tried a few times.
Sample Data Set
-
opportunities
- how many times they have the tried before the outcome (in this case I included the sample instance(+1) to move off 0, but I don’t know if this is good/bad - open to changing if bad) -
failures
- how many total times have they tried before the outcome -
outcome
- What was the outcome - success or failure
opportunities failures outcome
1 0 success # Succeded 1 of 1 time
1 1 failure # Failed 1 of 1 time
2 0 success # Succeeded 2 of 2 times
2 2 failure # Failed 2 of 2 times
10 2 failure # Failed 2 of 10 times
100 3 success
200 25 success
The whole data set is highly skewed right - nearly ‘negative exponential’ looking - meaning:
- There are a lot of people that have only tried 1,2,3 times, and some that have tried up to 800 times
- Most people succeed most of the time - I think the avg fail rate it like 5-10%, maybe lower in this data set
- ‘Newer people’, people with fewer tries, generally fail more often (after failing so many times, they are provided fewer opportunities)
- Some people have 100+ tries, and succeeded every time, or failed only 1-2 times
Do I have to balance my dataset?? I know other ML algos require this - but I didn’t think Bayes needed that.
What about scaling? I assumed that Bayes could just use counts.
Setup
I’m using R, and the library(rstanarm)
package.
After going through this article it seems like neg_binomial_2
might be better (but I really don’t have any idea).
I assume I want to do something like this to maybe get a prior for a logistic (binomial(link = "logit")
) regression (but again I have no idea at this point):
model_bayes <- stan_glm(failures ~ opportunities,
data=train,
family = neg_binomial_2,
seed=42)
Very Naive Logistic
Also threw this at the wall and … we’ll it doesn’t seem useful at all, so figured I’d ask for help.
model_bayes_log <- stan_glm(result ~ opportunities + failures,
data=train2,
family = binomial(link = "logit"),
seed=42)
model_bayes_log
# stan_glm
# family: binomial [logit]
# formula: result ~ opportunities + failures
# observations: 12263
# predictors: 3
# ------
# Median MAD_SD
# (Intercept) 0.9 0.0
# opportunities 0.0 0.0
# failures 0.0 0.0