My model is not as predictive as my GLM....I‘M not sure why


I’m looking for any practical advice for a newbie. I have ran the same model formula in the frequentist context with python’s statsmodels package and in pystan on the same data. I took the posterior samples and chose the sample that resulted in the smallest sum of squared errors.

I compared those predictions using the coefficients from that specific sample to the frequentist model and the frequentist model was a lot more accurate.

Will anyone give practical advice on how to diagnose why the Bayesian framework produced worse results?

Can you post the model code, data (or a subset of the date), and the code you used to run the model? That will help in giving some advice.

Sure. Here is my model.

data {
    int<lower=0> N;
    vector[N] BIlmt_model2;
    vector[N] multi_policy_count_model;
    vector[N] unit_value_model2;
    vector[N] yrs_owned_model2;
    vector[N] class_model2_Sport;
    vector[N] class_model2_Street;
    vector[N] class_model2_other;
    vector[N] Single_Driver_Age_model;
    vector[N] Driver_Age_model;
    int<lower=0> y1[N];
parameters {            
    real BIlmt_coeff;      
    real unit_value_model2_coeff;
    real yrs_owned_model2_coeff;
    real class_model2_Sport_coeff;
    real class_model2_Street_coeff;
    real class_model2_other_coeff;
    real Single_Driver_Age_model_coeff;
    real Driver_Age_model_coeff;
    real intercept;        // parameter for mean or intercept   
model {
    BIlmt_coeff ~ normal(0, 1);  
    unit_value_model2_coeff ~ normal(0, 1); 
    yrs_owned_model2_coeff ~ normal(0, 1); 
    class_model2_Sport_coeff ~ normal(0, 1); 
    class_model2_Street_coeff ~ normal(0, 1); 
    class_model2_other_coeff ~ normal(0, 1); 
    Single_Driver_Age_model_coeff ~ normal(0, 1); 
    Driver_Age_model_coeff ~ normal(0, 1); 
    intercept ~ normal(0,5);
    y1 ~ poisson_log(BIlmt_model2*BIlmt_coeff + 
			yrs_owned_model2*yrs_owned_model2_coeff + 
                  	class_model2_Sport*class_model2_Sport_coeff + 
			class_model2_Street*class_model2_Street_coeff +
                  	class_model2_other*class_model2_other_coeff + 
                  	Single_Driver_Age_model*Single_Driver_Age_model_coeff + 
                  	Driver_Age_model*Driver_Age_model_coeff + 

Then I go through with my new data like so.

sm_fit = sm_liab.sampling(data=data1, warmup=500, iter = 1000,  verbose=True, n_jobs = 1)

my_samples = sm_fit.to_dataframe()

#make arrays of coeffs

BIlmt_coeff = my_samples['BIlmt_coeff']
unit_value_model2_coeff = my_samples['unit_value_model2_coeff']
yrs_owned_model2_coeff = my_samples['yrs_owned_model2_coeff']
class_model2_Sport_coeff = my_samples['class_model2_Sport_coeff']
class_model2_Street_coeff = my_samples['class_model2_Street_coeff']
class_model2_other_coeff = my_samples['class_model2_other_coeff']
Single_Driver_Age_model_coeff = my_samples['Single_Driver_Age_model_coeff']
Driver_Age_model_coeff = my_samples['Driver_Age_model_coeff']
intercept = my_samples['intercept']
#make array of new 'test' values

BIlmt_model2_new = np.asarray(x_test['BIlmt_model2'])
multi_policy_count_model_new = np.asarray(x_test['multi_policy_count_model'])
unit_value_model2_new = np.asarray(x_test['unit_value_model2'])
yrs_owned_model2_new = np.asarray(x_test['yrs_owned_model2'])
class_model2_Sport_new = np.asarray(x_test['class_model2_Sport'])
class_model2_Street_new = np.asarray(x_test['class_model2_Street'])
class_model2_other_new = np.asarray(x_test['class_model2_other'])
Single_Driver_Age_model_new = np.asarray(x_test['marital_status_model_S:Driver_Age_model'])
Driver_Age_model_new = np.asarray(x_test['Driver_Age_model'])

posterior = []
for i in range(num_samples):
    y_new = np.exp(intercept[i] + BIlmt_model2_new*BIlmt_coeff[i] + 
                                    unit_value_model2_new*unit_value_model2_coeff[i] +
                                    yrs_owned_model2_new*yrs_owned_model2_coeff[i] +
                                    class_model2_Sport_new*class_model2_Sport_coeff[i] + 
                                    class_model2_Street_new*class_model2_Street_coeff[i] +
                                    class_model2_other_new*class_model2_other_coeff[i] +
                                    Single_Driver_Age_model_new*Single_Driver_Age_model_coeff[i] +

#join to actual results (y_test)
final = y_test.join(pd.DataFrame(posterior, columns=y_test.index).T)
#find lowest SSE

sse = []

for column in columns:
    final['error'] = (final['frequency'] - final[column])**2
    error_squared = final['error'].sum()
    #print(final[column].name, final['error_sq

    sse.append((column, error_squared))
min(sse, key= lambda t: t[1])

#merge to test data and compare
test_4 = test_4.merge(final[713], left_index = True, right_index = True, how = 'left')

That script goes through and finds sample 713 has the lowest SSE out of 2000 samples. Then i compare it to the frequentist model and the actual by sorting the test data by the ratio (old model/new model) of the two models, breaking up the data into deciles and see how the actual lines up. That looks like the below:

As you can see above, the actual frequency lines up with my GLM model (model3_preds) and not the bayesian results 713.

I’m new to running this on my real world data so I’m looking to better my model ‘debugging’.

Hi! Glad to see that your model ran! :)

First of all, you should check if your model ran correctly. This is the workflow that is usually recommended. If the usual diagnostics are not well, there is (usually) no point in taking the model results seriously.

Second, are you sure you are running and comparing the same models (apart from priors)? With such simple models as GLMs are, Bayesian and Frequentist solutions are usually almost identical – especially when you have a lot of data and the relevance of the priors is negligible.

Third, what’s your reasoning behind something like that:

That script goes through and finds sample 713 has the lowest SSE out of 2000 samples.

I’m pretty sure that’s something no one should ever do, tbh… But I might be missing a point here.
In a Bayesian analysis you’d want to look at the whole predictive distribution (of which the whole collection of posterior draws is an “estimate”, loosely speaking).

1 Like

Thanks so much for the reply. I will check out this link tonight. When I stated that I ran the same model, I meant the model formula was the same. The data is the same and the testing data is the same.

I took out sample 713 because in order to put something like this into production, I do need point estimates for the coefficients…a distribution, for production won’t work. Or I should say I can’t conceptualize how it will work. I realize this is counterintuitive to the Bayesian methodology but for production, I have to give point estimate coefficients.

That said, I want to learn Bayesian because the distributions in my opinion gives an easier explanation for the model.

Oh! You can still give point estimates from the whole collection of posteriors. And get the uncertainty intervals around those. That’s a better workflow to go after.

Also for your priors it would be nice to see what the justification is for each one. Right now each prior says you expect the mean to be around zero but it could also be slightly positive or slightly negative influence on y1.

Thank you.

Are you referencing the ‘mean?’ The mean output of my bayesian model was not close to the frequentist GLM either. Or are you referencing another way?

Yes the mean! Sometimes I need to report out a point estimate for stakeholders to use for management actions. But I always caution them that they really need to use the uncertainty intervals as well.

Can you post part of the summary of the model?

I apologize as I missed the second part of your follow up. The mean comes because I’m uniformed.

Summary in two windows as I’m pulling off of an enterprise GIT and it’s not fitting the screen.


Here is the frequentist model. It’s hard to see but the intercept is on the top unlike my bayesian model where it’s on the bottom.


Well to start with I’d remove

intercept ~ normal(0,5);


+ intercept

from your model and run it for 2000 iterations. And run the diagnostics from the workflow that @Max_Mantei pointed out.

Thanks. Can you explain why by chance?

Yes, it simplifies things a bit and it looks a bit odd to me since it’s additive according to your model code. My intercepts usually take the form:

model {
beta[1] ~ some_prior_goes here;
other modely stuff goes here

Hopefully someone else can jump in a give a better explanation.

Alright. I thought the model should take a form of a + x_1B + x_2B...xnB where a is the intercept. I removed and rerunning now. I also downloaded the stan utilitiy package to do the diagnostics @Max_Mantei git page references.

1 Like

Oh yeah, sorry! Your model might need to take that form. I am not as familiar with these types. It may not run right.

Ah okay. Yes, but for a Bayesian approach you wouldn’t just get one (arbitrary) draw from the posterior, but rather other summaries of the posterior predictive distribution. For example taking the mean over all posterior predictions (for any given new observation) as a summary would minimize squared loss in terms of decision theory. Taking the median of the same distribution would minimize absolute loss, etc… In any case you want your single number, that you use in production, summarize the entire distribution of predictions of any given new observation. Does that make sense?

Yes I think I’m starting to follow. Are there resources out there on putting a Stan model into production? Maybe, instead of taking point draw from the samples for point estimates for the coefficients, I just pull the mean posterior prediction from all of the draws and let that feed into a decision.

What do you mean by production here? As in getting the results into the hands of a manager? Or informing some decision?

Hey @Jordan_Howell, I’d consider taking out all of your priors for coefficients and the intercept and running your model without. You presently have informative priors which suggest that the mean of all your coefficients is zero, that they are symmetrical and that they have unit variance. This may shape the fitting considerably. The frequentist alternative would operate in effect with flat priors (to my understanding), so this could be an important difference.

1 Like

With well beyond 100,000 observations the prior would likely not make a huge difference. Also, centering the prior at zero is not so bad imo, especially if the goal is prediction - it’s basically encoding shrinkage, or the belief that coefficients are small (which they usually are in a Poisson GLM) - think of a special case of Ridge Regression.

@Jordan_Howell I’ll have to look it up, but I think there were 1-2 talks in recent StanCons about using Stan in production. Not sure, if it’s helpful to your case. I’ll post a link later, when I find it.

1 Like