I’m looking for any practical advice for a newbie. I have ran the same model formula in the frequentist context with python’s statsmodels package and in pystan on the same data. I took the posterior samples and chose the sample that resulted in the smallest sum of squared errors.
I compared those predictions using the coefficients from that specific sample to the frequentist model and the frequentist model was a lot more accurate.
Will anyone give practical advice on how to diagnose why the Bayesian framework produced worse results?
data {
int<lower=0> N;
vector[N] BIlmt_model2;
vector[N] multi_policy_count_model;
vector[N] unit_value_model2;
vector[N] yrs_owned_model2;
vector[N] class_model2_Sport;
vector[N] class_model2_Street;
vector[N] class_model2_other;
vector[N] Single_Driver_Age_model;
vector[N] Driver_Age_model;
int<lower=0> y1[N];
}
parameters {
real BIlmt_coeff;
real unit_value_model2_coeff;
real yrs_owned_model2_coeff;
real class_model2_Sport_coeff;
real class_model2_Street_coeff;
real class_model2_other_coeff;
real Single_Driver_Age_model_coeff;
real Driver_Age_model_coeff;
real intercept; // parameter for mean or intercept
}
model {
BIlmt_coeff ~ normal(0, 1);
unit_value_model2_coeff ~ normal(0, 1);
yrs_owned_model2_coeff ~ normal(0, 1);
class_model2_Sport_coeff ~ normal(0, 1);
class_model2_Street_coeff ~ normal(0, 1);
class_model2_other_coeff ~ normal(0, 1);
Single_Driver_Age_model_coeff ~ normal(0, 1);
Driver_Age_model_coeff ~ normal(0, 1);
intercept ~ normal(0,5);
y1 ~ poisson_log(BIlmt_model2*BIlmt_coeff +
yrs_owned_model2*yrs_owned_model2_coeff +
class_model2_Sport*class_model2_Sport_coeff +
class_model2_Street*class_model2_Street_coeff +
class_model2_other*class_model2_other_coeff +
Single_Driver_Age_model*Single_Driver_Age_model_coeff +
Driver_Age_model*Driver_Age_model_coeff +
intercept);
}
Then I go through with my new data like so.
sm_fit = sm_liab.sampling(data=data1, warmup=500, iter = 1000, verbose=True, n_jobs = 1)
my_samples = sm_fit.to_dataframe()
#make arrays of coeffs
BIlmt_coeff = my_samples['BIlmt_coeff']
unit_value_model2_coeff = my_samples['unit_value_model2_coeff']
yrs_owned_model2_coeff = my_samples['yrs_owned_model2_coeff']
class_model2_Sport_coeff = my_samples['class_model2_Sport_coeff']
class_model2_Street_coeff = my_samples['class_model2_Street_coeff']
class_model2_other_coeff = my_samples['class_model2_other_coeff']
Single_Driver_Age_model_coeff = my_samples['Single_Driver_Age_model_coeff']
Driver_Age_model_coeff = my_samples['Driver_Age_model_coeff']
intercept = my_samples['intercept']
#make array of new 'test' values
BIlmt_model2_new = np.asarray(x_test['BIlmt_model2'])
multi_policy_count_model_new = np.asarray(x_test['multi_policy_count_model'])
unit_value_model2_new = np.asarray(x_test['unit_value_model2'])
yrs_owned_model2_new = np.asarray(x_test['yrs_owned_model2'])
class_model2_Sport_new = np.asarray(x_test['class_model2_Sport'])
class_model2_Street_new = np.asarray(x_test['class_model2_Street'])
class_model2_other_new = np.asarray(x_test['class_model2_other'])
Single_Driver_Age_model_new = np.asarray(x_test['marital_status_model_S:Driver_Age_model'])
Driver_Age_model_new = np.asarray(x_test['Driver_Age_model'])
posterior = []
for i in range(num_samples):
y_new = np.exp(intercept[i] + BIlmt_model2_new*BIlmt_coeff[i] +
unit_value_model2_new*unit_value_model2_coeff[i] +
yrs_owned_model2_new*yrs_owned_model2_coeff[i] +
class_model2_Sport_new*class_model2_Sport_coeff[i] +
class_model2_Street_new*class_model2_Street_coeff[i] +
class_model2_other_new*class_model2_other_coeff[i] +
Single_Driver_Age_model_new*Single_Driver_Age_model_coeff[i] +
Driver_Age_model_new*Driver_Age_model_coeff[i])
posterior.append(y_new)
#join to actual results (y_test)
final = y_test.join(pd.DataFrame(posterior, columns=y_test.index).T)
#find lowest SSE
sse = []
for column in columns:
final['error'] = (final['frequency'] - final[column])**2
error_squared = final['error'].sum()
#print(final[column].name, final['error_sq
uared'])
sse.append((column, error_squared))
min(sse, key= lambda t: t[1])
#merge to test data and compare
test_4 = test_4.merge(final[713], left_index = True, right_index = True, how = 'left')
That script goes through and finds sample 713 has the lowest SSE out of 2000 samples. Then i compare it to the frequentist model and the actual by sorting the test data by the ratio (old model/new model) of the two models, breaking up the data into deciles and see how the actual lines up. That looks like the below:
First of all, you should check if your model ran correctly. This is the workflow that is usually recommended. If the usual diagnostics are not well, there is (usually) no point in taking the model results seriously.
Second, are you sure you are running and comparing the same models (apart from priors)? With such simple models as GLMs are, Bayesian and Frequentist solutions are usually almost identical – especially when you have a lot of data and the relevance of the priors is negligible.
Third, what’s your reasoning behind something like that:
That script goes through and finds sample 713 has the lowest SSE out of 2000 samples.
I’m pretty sure that’s something no one should ever do, tbh… But I might be missing a point here.
In a Bayesian analysis you’d want to look at the whole predictive distribution (of which the whole collection of posterior draws is an “estimate”, loosely speaking).
Thanks so much for the reply. I will check out this link tonight. When I stated that I ran the same model, I meant the model formula was the same. The data is the same and the testing data is the same.
I took out sample 713 because in order to put something like this into production, I do need point estimates for the coefficients…a distribution, for production won’t work. Or I should say I can’t conceptualize how it will work. I realize this is counterintuitive to the Bayesian methodology but for production, I have to give point estimate coefficients.
That said, I want to learn Bayesian because the distributions in my opinion gives an easier explanation for the model.
Oh! You can still give point estimates from the whole collection of posteriors. And get the uncertainty intervals around those. That’s a better workflow to go after.
Also for your priors it would be nice to see what the justification is for each one. Right now each prior says you expect the mean to be around zero but it could also be slightly positive or slightly negative influence on y1.
Yes the mean! Sometimes I need to report out a point estimate for stakeholders to use for management actions. But I always caution them that they really need to use the uncertainty intervals as well.
Alright. I thought the model should take a form of a + x_1B + x_2B...xnB where a is the intercept. I removed and rerunning now. I also downloaded the stan utilitiy package to do the diagnostics @Max_Mantei git page references.
Ah okay. Yes, but for a Bayesian approach you wouldn’t just get one (arbitrary) draw from the posterior, but rather other summaries of the posterior predictive distribution. For example taking the mean over all posterior predictions (for any given new observation) as a summary would minimize squared loss in terms of decision theory. Taking the median of the same distribution would minimize absolute loss, etc… In any case you want your single number, that you use in production, summarize the entire distribution of predictions of any given new observation. Does that make sense?
Yes I think I’m starting to follow. Are there resources out there on putting a Stan model into production? Maybe, instead of taking point draw from the samples for point estimates for the coefficients, I just pull the mean posterior prediction from all of the draws and let that feed into a decision.
Hey @Jordan_Howell, I’d consider taking out all of your priors for coefficients and the intercept and running your model without. You presently have informative priors which suggest that the mean of all your coefficients is zero, that they are symmetrical and that they have unit variance. This may shape the fitting considerably. The frequentist alternative would operate in effect with flat priors (to my understanding), so this could be an important difference.
With well beyond 100,000 observations the prior would likely not make a huge difference. Also, centering the prior at zero is not so bad imo, especially if the goal is prediction - it’s basically encoding shrinkage, or the belief that coefficients are small (which they usually are in a Poisson GLM) - think of a special case of Ridge Regression.
@Jordan_Howell I’ll have to look it up, but I think there were 1-2 talks in recent StanCons about using Stan in production. Not sure, if it’s helpful to your case. I’ll post a link later, when I find it.