Logistic Model

Jordan_Howell · January 27, 2022, 8:49pm

Hello. I have a model trying to predict if a perspective customer will convert to an actual customer. Usually, I’d use a decision tree to do this but I want to convert…myself into a bayesian modeler. The problem is there are only 327 conversions out of about 9,000 observations. This leaves me with the warning:

UserWarning: Your data appears to have a single value or no finite values
  warnings.warn("Your data appears to have a single value or no finite values")

Below is what the ppc looks like. Is this a “lack of conversion problem” or is my model the problem?

data {
    int<lower=0> N; // number perspective customers
    int<lower=0, upper=1>  issued_flag[N]; // conversion numbers
    vector[N] n_lmt1;// normalized limit 1
    vector[N] n_pp_lmt; //normalized personal property limit
    vector[N] n_age;//normalized age of insured
    vector[N] n_aoh;//normalized age of home

}

parameters {
    real<lower=0> mu;
    real limit1_beta;
    real pplimit_beta;
    real age_beta;
    real aoh_beta;
}

model {
       mu ~ normal(0,3);
       limit1_beta ~ normal(0,5);
       pplimit_beta ~ normal(0,5);
       age_beta ~ normal(0,5);
       aoh_beta ~ normal(0,5);
       issued_flag ~ bernoulli_logit(mu + n_lmt1*limit1_beta + n_pp_lmt*pplimit_beta + n_age*age_beta + n_aoh*aoh_beta);
}

generated quantities {
      vector[N] eta = mu + n_lmt1*limit1_beta + n_pp_lmt*pplimit_beta + n_age*age_beta + n_aoh*aoh_beta;
  int y_rep[N];
  if (max(eta) > 20) {
    // avoid overflow in poisson_log_rng
    print("max eta too big: ", max(eta));
    for (n in 1:N)
      y_rep[n] = -1;
  } else {
      for (n in 1:N)
        y_rep[n] = bernoulli_rng(eta[n]);
  }
}

jack_monroe · January 28, 2022, 12:02am

I would write the model differently, defining a vector in the model block that takes this

mu + n_lmt1*limit1_beta + n_pp_lmt*pplimit_beta + n_age*age_beta + n_aoh*aoh_beta

and sets it equal to a variable that you insert into the Bernoulli_logit function.

model {

vector[N] a;

for n in 1:N {
a[n] = mu + n_lmt1[n]*limit1_beta + n_pp_lm[n]t*pplimit_beta + n_age[n]*age_beta + n_aoh[n]*aoh_beta
}

limit1_beta ~ normal(0,5);
pplimit_beta ~ normal(0,5);
age_beta ~ normal(0,5);
aoh_beta ~ normal(0,5);

issued_flag ~ bernoulli_logit(a)

If I were a better Stan user I could tell you what your code did, but whatever it did I think it was a little off. If you shared your model output it would be easier to see if your model ran correctly.

jsocolar · January 28, 2022, 1:26am

My guess here is that there is a problem with the way you are passing your data. In your Stan code, there’s a problem in your generated quantities (see below), but I don’t think it could cause the error you are seeing (what interface are you using?).

In generated quantities, you have y_rep[n] = bernoulli_rng(eta[n]); but eta is on the logit scale and you need to use bernoulli_logit_rng. This will produce warnings like

Chain 1 Exception: bernoulli_rng: Probability parameter is -0.0278299, but must be in the interval [0, 1] (in '/var/folders/j6/dg5l3gl11xb9v8w61w99ngh80000gn/T/RtmpjULs5c/model-27aa267750.stan', line 38, column 8 to column 41)

but it shouldn’t yield the behavior that you’re seeing.

Jordan_Howell · January 28, 2022, 11:09am

Model output like the summary below?

                   Mean      MCSE   StdDev           5%         50%  \
name                                                                   
lp__         -6300.00000  0.049000  1.70000 -6300.000000 -6300.00000   
mu               0.00023  0.000004  0.00023     0.000014     0.00017   
limit1_beta     -0.04200  0.005200  0.20000    -0.370000    -0.03900   
pplimit_beta     0.05100  0.005200  0.20000    -0.280000     0.04800   
age_beta        -0.00053  0.000360  0.02100    -0.035000    -0.00062   
...                  ...       ...      ...          ...         ...   
y_rep[9083]      0.00000       NaN  0.00000     0.000000     0.00000   
y_rep[9084]      0.00000       NaN  0.00000     0.000000     0.00000   
y_rep[9085]      0.00000       NaN  0.00000     0.000000     0.00000   
y_rep[9086]      0.00000       NaN  0.00000     0.000000     0.00000

                     95%   N_Eff  N_Eff/s  R_hat  
name                                              
lp__         -6300.00000  1200.0      4.9    1.0  
mu               0.00068  3100.0     13.0    1.0  
limit1_beta      0.29000  1500.0      6.2    1.0  
pplimit_beta     0.38000  1500.0      6.2    1.0  
age_beta         0.03400  3400.0     14.0    1.0  
...                  ...     ...      ...    ...  
y_rep[9083]      0.00000     NaN      NaN    NaN

Jordan_Howell · January 28, 2022, 12:47pm

Thank you. Here is the updated model I used:

data {
    int<lower=0> N; // number policy
    int<lower=0, upper=1>  issued_flag[N]; // conversion numbers
    vector[N] n_lmt1;// normalized limit 1
    vector[N] n_pp_lmt; //normalized personal property limit
    vector[N] n_age;//normalized age of insured
    vector[N] n_aoh;//normalized age of home

}

parameters {
    real<lower=0> mu;
    real limit1_beta;
    real pplimit_beta;
    real age_beta;
    real aoh_beta;
}

model {
    limit1_beta ~ normal(0,5);
    pplimit_beta ~ normal(0,5);
    age_beta ~ normal(0,5);
    aoh_beta ~ normal(0,5);

    vector[N] a;
    for (n in 1:N) {
            a[n] = mu + n_lmt1[n]*limit1_beta + n_pp_lmt[n]*pplimit_beta + n_age[n]*age_beta + n_aoh[n]*aoh_beta;
    }

    issued_flag ~ bernoulli_logit(a);
}

generated quantities {
      vector[N] eta = mu + n_lmt1*limit1_beta + n_pp_lmt*pplimit_beta + n_age*age_beta + n_aoh*aoh_beta;
  int y_rep[N];
  if (max(eta) > 20) {
    // avoid overflow in poisson_log_rng
    print("max eta too big: ", max(eta));
    for (n in 1:N)
      y_rep[n] = -1;
  } else {
      for (n in 1:N)
        y_rep[n] = bernoulli_logit_rng(eta[n]);
  }
}

Here is the cumulative ppc output:

Here is the regular ppc output:

Do you think I should be using a different distribution for a convert/not convert target variable?

jsocolar · January 28, 2022, 1:34pm

Since the outcome variable is binary, you are definitely using the right distribution (Bernoulli). It’s very hard to see what’s going on in your graphical posterior predictive plots because the only interesting outcomes are 0 and 1, and these don’t show up adequately on your continuous abscissa. Even from these graphs, however, it’s clear that something is going seriously wrong here.

What’s going wrong is that you’ve declared mu to be greater than zero in the parameters block, but mu is a logit-scale parameter. Given that there are only a few conversions, mu is going to take a negative value. Note that once mu is given a prior consistent with the data, posterior predictive checks based on the total frequency of 1’s and 0’s in the response will no longer be at all sensitive to model misspecification; this is related to the advice (e.g. in @jonah et al’s paper here https://arxiv.org/pdf/1709.01449.pdf) to use statistics of the posterior predictive distribution that are orthogonal to the model parameters.

Jordan_Howell · January 28, 2022, 3:25pm

Ahhh.

I took this model from a previous, poisson model script. I’ve released the floor on mu. This is the result…

Topic		Replies	Views
Simple PPC for logistic regression Modeling	9	1461	September 1, 2020
Help with misspecified model Modeling fitting-issues , specification	2	431	September 10, 2020
Posterior predictive checking Modeling	10	2026	November 26, 2019
PPC for simple logistic regression in Stan Modeling	2	853	October 31, 2018
Divergent transitions Modeling	12	934	July 17, 2019

Logistic Model

Related topics