Hi! I’m very new to Bayesian analysis and Stan (but 15 yrs as software dev), but have a bunch of problems that ‘feel’ like a good fit. I’ve gone though about 150 pages of “Statistical Rethinking”, so I kind of understand the very basics… but taking a real world data set and analyzing it pushing my limits.
Any help to understand how to approach problem (the name of the problem I’m trying to solve), or resources/articles that address this problem would be very appreciated.
Also if there is a better place to ask this… I’m new to the community, so chose the best looking place.
Here’s the first problem I’d like to look at:
If a person attempts an activity X times, and they failed Y times, whats the probability they will succeed (or fail) the next time?
The objective would be to penalize/down-rank(??) those who fail more over more attempts, without over-penalizing those who had only tried a few times.
opportunities- how many times they have the tried before the outcome (in this case I included the sample instance(+1) to move off 0, but I don’t know if this is good/bad - open to changing if bad)
failures- how many total times have they tried before the outcome
outcome- What was the outcome - success or failure
opportunities failures outcome 1 0 success # Succeded 1 of 1 time 1 1 failure # Failed 1 of 1 time 2 0 success # Succeeded 2 of 2 times 2 2 failure # Failed 2 of 2 times 10 2 failure # Failed 2 of 10 times 100 3 success 200 25 success
The whole data set is highly skewed right - nearly ‘negative exponential’ looking - meaning:
- There are a lot of people that have only tried 1,2,3 times, and some that have tried up to 800 times
- Most people succeed most of the time - I think the avg fail rate it like 5-10%, maybe lower in this data set
- ‘Newer people’, people with fewer tries, generally fail more often (after failing so many times, they are provided fewer opportunities)
- Some people have 100+ tries, and succeeded every time, or failed only 1-2 times
Do I have to balance my dataset?? I know other ML algos require this - but I didn’t think Bayes needed that.
What about scaling? I assumed that Bayes could just use counts.
I’m using R, and the
After going through this article it seems like
neg_binomial_2 might be better (but I really don’t have any idea).
I assume I want to do something like this to maybe get a prior for a logistic (
binomial(link = "logit")) regression (but again I have no idea at this point):
model_bayes <- stan_glm(failures ~ opportunities, data=train, family = neg_binomial_2, seed=42)
Also threw this at the wall and … we’ll it doesn’t seem useful at all, so figured I’d ask for help.
model_bayes_log <- stan_glm(result ~ opportunities + failures, data=train2, family = binomial(link = "logit"), seed=42) model_bayes_log # stan_glm # family: binomial [logit] # formula: result ~ opportunities + failures # observations: 12263 # predictors: 3 # ------ # Median MAD_SD # (Intercept) 0.9 0.0 # opportunities 0.0 0.0 # failures 0.0 0.0