Hello. I have a model trying to predict if a perspective customer will convert to an actual customer. Usually, I’d use a decision tree to do this but I want to convert…myself into a bayesian modeler. The problem is there are only 327 conversions out of about 9,000 observations. This leaves me with the warning:
UserWarning: Your data appears to have a single value or no finite values
warnings.warn("Your data appears to have a single value or no finite values")
Below is what the ppc looks like. Is this a “lack of conversion problem” or is my model the problem?
data {
int<lower=0> N; // number perspective customers
int<lower=0, upper=1> issued_flag[N]; // conversion numbers
vector[N] n_lmt1;// normalized limit 1
vector[N] n_pp_lmt; //normalized personal property limit
vector[N] n_age;//normalized age of insured
vector[N] n_aoh;//normalized age of home
}
parameters {
real<lower=0> mu;
real limit1_beta;
real pplimit_beta;
real age_beta;
real aoh_beta;
}
model {
mu ~ normal(0,3);
limit1_beta ~ normal(0,5);
pplimit_beta ~ normal(0,5);
age_beta ~ normal(0,5);
aoh_beta ~ normal(0,5);
issued_flag ~ bernoulli_logit(mu + n_lmt1*limit1_beta + n_pp_lmt*pplimit_beta + n_age*age_beta + n_aoh*aoh_beta);
}
generated quantities {
vector[N] eta = mu + n_lmt1*limit1_beta + n_pp_lmt*pplimit_beta + n_age*age_beta + n_aoh*aoh_beta;
int y_rep[N];
if (max(eta) > 20) {
// avoid overflow in poisson_log_rng
print("max eta too big: ", max(eta));
for (n in 1:N)
y_rep[n] = -1;
} else {
for (n in 1:N)
y_rep[n] = bernoulli_rng(eta[n]);
}
}
I would write the model differently, defining a vector in the model block that takes this
mu + n_lmt1*limit1_beta + n_pp_lmt*pplimit_beta + n_age*age_beta + n_aoh*aoh_beta
and sets it equal to a variable that you insert into the Bernoulli_logit function.
model {
vector[N] a;
for n in 1:N {
a[n] = mu + n_lmt1[n]*limit1_beta + n_pp_lm[n]t*pplimit_beta + n_age[n]*age_beta + n_aoh[n]*aoh_beta
}
limit1_beta ~ normal(0,5);
pplimit_beta ~ normal(0,5);
age_beta ~ normal(0,5);
aoh_beta ~ normal(0,5);
issued_flag ~ bernoulli_logit(a)
If I were a better Stan user I could tell you what your code did, but whatever it did I think it was a little off. If you shared your model output it would be easier to see if your model ran correctly.
My guess here is that there is a problem with the way you are passing your data. In your Stan code, there’s a problem in your generated quantities (see below), but I don’t think it could cause the error you are seeing (what interface are you using?).
In generated quantities, you have y_rep[n] = bernoulli_rng(eta[n]);
but eta
is on the logit scale and you need to use bernoulli_logit_rng
. This will produce warnings like
Chain 1 Exception: bernoulli_rng: Probability parameter is -0.0278299, but must be in the interval [0, 1] (in '/var/folders/j6/dg5l3gl11xb9v8w61w99ngh80000gn/T/RtmpjULs5c/model-27aa267750.stan', line 38, column 8 to column 41)
but it shouldn’t yield the behavior that you’re seeing.
Model output like the summary below?
Mean MCSE StdDev 5% 50% \
name
lp__ -6300.00000 0.049000 1.70000 -6300.000000 -6300.00000
mu 0.00023 0.000004 0.00023 0.000014 0.00017
limit1_beta -0.04200 0.005200 0.20000 -0.370000 -0.03900
pplimit_beta 0.05100 0.005200 0.20000 -0.280000 0.04800
age_beta -0.00053 0.000360 0.02100 -0.035000 -0.00062
... ... ... ... ... ...
y_rep[9083] 0.00000 NaN 0.00000 0.000000 0.00000
y_rep[9084] 0.00000 NaN 0.00000 0.000000 0.00000
y_rep[9085] 0.00000 NaN 0.00000 0.000000 0.00000
y_rep[9086] 0.00000 NaN 0.00000 0.000000 0.00000
95% N_Eff N_Eff/s R_hat
name
lp__ -6300.00000 1200.0 4.9 1.0
mu 0.00068 3100.0 13.0 1.0
limit1_beta 0.29000 1500.0 6.2 1.0
pplimit_beta 0.38000 1500.0 6.2 1.0
age_beta 0.03400 3400.0 14.0 1.0
... ... ... ... ...
y_rep[9083] 0.00000 NaN NaN NaN
Thank you. Here is the updated model I used:
data {
int<lower=0> N; // number policy
int<lower=0, upper=1> issued_flag[N]; // conversion numbers
vector[N] n_lmt1;// normalized limit 1
vector[N] n_pp_lmt; //normalized personal property limit
vector[N] n_age;//normalized age of insured
vector[N] n_aoh;//normalized age of home
}
parameters {
real<lower=0> mu;
real limit1_beta;
real pplimit_beta;
real age_beta;
real aoh_beta;
}
model {
limit1_beta ~ normal(0,5);
pplimit_beta ~ normal(0,5);
age_beta ~ normal(0,5);
aoh_beta ~ normal(0,5);
vector[N] a;
for (n in 1:N) {
a[n] = mu + n_lmt1[n]*limit1_beta + n_pp_lmt[n]*pplimit_beta + n_age[n]*age_beta + n_aoh[n]*aoh_beta;
}
issued_flag ~ bernoulli_logit(a);
}
generated quantities {
vector[N] eta = mu + n_lmt1*limit1_beta + n_pp_lmt*pplimit_beta + n_age*age_beta + n_aoh*aoh_beta;
int y_rep[N];
if (max(eta) > 20) {
// avoid overflow in poisson_log_rng
print("max eta too big: ", max(eta));
for (n in 1:N)
y_rep[n] = -1;
} else {
for (n in 1:N)
y_rep[n] = bernoulli_logit_rng(eta[n]);
}
}
Here is the cumulative ppc output:
Here is the regular ppc output:
Do you think I should be using a different distribution for a convert/not convert target variable?
Since the outcome variable is binary, you are definitely using the right distribution (Bernoulli). It’s very hard to see what’s going on in your graphical posterior predictive plots because the only interesting outcomes are 0 and 1, and these don’t show up adequately on your continuous abscissa. Even from these graphs, however, it’s clear that something is going seriously wrong here.
What’s going wrong is that you’ve declared mu
to be greater than zero in the parameters block, but mu
is a logit-scale parameter. Given that there are only a few conversions, mu
is going to take a negative value. Note that once mu
is given a prior consistent with the data, posterior predictive checks based on the total frequency of 1’s and 0’s in the response will no longer be at all sensitive to model misspecification; this is related to the advice (e.g. in @jonah et al’s paper here https://arxiv.org/pdf/1709.01449.pdf) to use statistics of the posterior predictive distribution that are orthogonal to the model parameters.
Ahhh.
I took this model from a previous, poisson model script. I’ve released the floor on mu. This is the result…