How to create a Stan model using Logit link where response variable of a Regression equation is fractional

Hi,

I have posted a similar question for help in some other forum, but suggested to post the problem here as dedicated forum for stan models.

I have below model written in Python’s pymc3 to estimate the parameters of a regression equation where the response variable is Fractional. To model this fractional response variable I used a Logit link function. Below is my model

import pandas as pd
import pymc3 as pm
import arviz as arviz

myData = pd.DataFrame.from_dict({'Unnamed: 0': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15, 15: 16, 16: 17, 17: 18, 18: 19, 19: 20, 20: 21, 21: 22, 22: 23, 23: 24, 24: 25, 25: 26, 26: 27, 27: 28, 28: 29, 29: 30, 30: 31, 31: 32, 32: 33, 33: 34, 34: 35, 35: 36, 36: 37, 37: 38}, 'y': {0: 0.0079235409492941, 1: 0.0086530073429249, 2: 0.0297400780486734, 3: 0.0196358416326437, 4: 0.0023902064076204, 5: 0.0258055591736283, 6: 0.17394835142698, 7: 0.156463554455613, 8: 0.329388185725557, 9: 0.0076443508881763, 10: 0.0162081480398152, 11: 0.0, 12: 0.0015759139941696, 13: 0.420025972703085, 14: 0.0001226236519444, 15: 0.133061480234834, 16: 0.565454216154227, 17: 0.0002819734812997, 18: 0.000559715156383, 19: 0.0270686389659072, 20: 0.918300537689865, 21: 7.8262468302e-06, 22: 0.0073241434191945, 23: 0.0, 24: 0.0, 25: 0.0, 26: 0.0, 27: 0.0, 28: 0.0, 29: 0.0, 30: 0.174071274611405, 31: 0.0432109713717948, 32: 0.0544400838264943, 33: 0.0, 34: 0.0907049925221286, 35: 0.616680102647887, 36: 0.0, 37: 0.0}, 'x': {0: 23.8187587698947, 1: 15.9991138359515, 2: 33.6495930512881, 3: 28.555818797764, 4: -52.2967967248258, 5: -91.3835208788233, 6: -73.9830692708321, 7: -5.16901145289629, 8: 29.8363012310241, 9: 10.6820057903939, 10: 19.4868517164395, 11: 15.4499668436458, 12: -17.0441644773509, 13: 10.7025053739577, 14: -8.6382953428539, 15: -32.8892974839165, 16: -15.8671863161348, 17: -11.237248036145, 18: -7.37978020066205, 19: -3.33500586334862, 20: -4.02629933182873, 21: -20.2413384726948, 22: -54.9094885578775, 23: -48.041459120976, 24: -52.3125732905322, 25: -35.6269065970458, 26: -62.0296155423529, 27: -49.0825017152659, 28: -73.0574478287598, 29: -50.9409090127938, 30: -63.4650928035253, 31: -55.1263264283842, 32: -52.2841103768755, 33: -61.2275334149805, 34: -74.2175990067417, 35: -68.2961107804698, 36: -76.6834643609286, 37: -70.16769103228}})

with pm.Model() as myModel :
    beta0 = pm.Normal('intercept', 0, 1)
    beta1 = pm.Normal('x', 0, 1)
    mu = beta0 + beta1 * myData['x'].values
    pm.Bernoulli('obs', p = pm.invlogit(mu), observed = myData['y'].values)

with myModel :
    trace = pm.sample(50000, tune = 10000, step = pm.Metropolis(), random_seed = 1000)

arviz.summary(trace, round_to = 10)

Above implementation gives below result, which is running perfectly fine with my Python+pymc3.


               mean        sd    hdi_3%   hdi_97%  mcse_mean   mcse_sd      ess_bulk      ess_tail     r_hat
intercept -2.537501  0.599667 -3.707061 -1.450243   0.004375  0.003118  18893.344191  22631.772985  1.000070
x          0.033750  0.024314 -0.007871  0.081619   0.000181  0.000133  18550.620475  20113.739639  1.000194

I also could use typical Frequentist estimation using R’s glm() function, which gives moderately close estimates

myData = list(y = c(0.00792354094929414, 0.00865300734292492, 0.0297400780486734, 
0.0196358416326437, 0.00239020640762042, 0.0258055591736283, 
0.17394835142698, 0.156463554455613, 0.329388185725557, 0.00764435088817635, 
0.0162081480398152, 0, 0.00157591399416963, 0.420025972703085, 
0.000122623651944455, 0.133061480234834, 0.565454216154227, 0.000281973481299731, 
0.000559715156383041, 0.0270686389659072, 0.918300537689865, 
0.00000782624683025728, 0.00732414341919458, 0, 0, 0, 0, 0, 0, 
0, 0.174071274611405, 0.0432109713717948, 0.0544400838264943, 
0, 0.0907049925221286, 0.616680102647887, 0, 0), x = c(23.8187587698947, 
15.9991138359515, 33.6495930512881, 28.555818797764, -52.2967967248258, 
-91.3835208788233, -73.9830692708321, -5.16901145289629, 29.8363012310241, 
10.6820057903939, 19.4868517164395, 15.4499668436458, -17.0441644773509, 
10.7025053739577, -8.6382953428539, -32.8892974839165, -15.8671863161348, 
-11.237248036145, -7.37978020066205, -3.33500586334862, -4.02629933182873, 
-20.2413384726948, -54.9094885578775, -48.041459120976, -52.3125732905322, 
-35.6269065970458, -62.0296155423529, -49.0825017152659, -73.0574478287598, 
-50.9409090127938, -63.4650928035253, -55.1263264283842, -52.2841103768755, 
-61.2275334149805, -74.2175990067417, -68.2961107804698, -76.6834643609286, 
-70.16769103228), N = 38)

summary(glm(y ~ x, myData, family = binomial()))

Result-

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) -2.02368    0.63467  -3.189  0.00143 **
x            0.00641    0.01492   0.430  0.66742

I am wondering if this model can be expressed as stan model. I have not seen any online resource etc on how can a stan model be defined for a Regression equation where values of the response variable is strictly fractional which required to be addressed by a Logit link function.

The model I am trying to fit with this data is Fractional model - Wikipedia

Any pointer is highly appreciated.

Can you elaborate what is going on in the pymc3 model a bit? It looks like you are modeling:

y \sim \textrm{Bernoulli}(\textrm{inv-logit}(\beta_0 + \beta_1x))

But y is a continuous parameters between 0 and 1? The support of the Bernoulli distribution is the discrete set \{0, 1\}, so y cannot be observations from a Bernoulli distribution. Some clarification wold be helpful!

Hi, thanks for your reply. I modified my original post on what kind of model I am trying to fit. The definition can be found in- Fractional model - Wikipedia

Thanks for the additional context. I’m not familiar with this extended bernoulli logit method, but a natural way to model response data on (0, 1) would be to use a Beta regression – namely where the response variable is Beta distributed, and you can reparameterize it to be in terms of location and scale parameters.

This vignette showing the approach using rstanarm could be worthwhile to read through.

The ultimate answer to your question is yes. Stan can certainly fit a regression model where the response is “fractional”.

Thanks for the vignette. Yes I am aware of the fact that stan is definitely be capable to model fraction regression model, but I am more interested to see if stan is capable to use logit link to model such variable, given pymc3 is.

Sure you can use a logit link. In fact, the Rstanarm implementation in the linked vignette from the previous post uses a logit link. In raw Stan code, you can use whatever link you want (though you’d be well-advised to use one whose inverse is restricted to the unit interval) in a beta regression.

You’d set up the linear predictor as in any other generalized linear model, then take the inverse link. The tricky part with beta regression in Stan is that Stan’s beta distribution is not parameterized in terms of its mean, so you need to transform a beta distribution parameterized by the mean to Stan’s parameterization of the beta distribution. Here’s example Stan code to do it

I guess you just want to use the bernoulli distribution. I don’t think this is “bayesian” from the sense of having a generative model because the outcome is fractional. All discrete distributions are enforced to have integer outcomes in Stan so it is not possible to implement this using the bernoulli distribution. However, you can write the bernoulli log probability yourself and achieve what you want.

data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x;
}
parameters {
  vector[2] beta;
}
transformed parameters {
  vector[N] mu = inv_logit(beta[1] + beta[2] * x);
}
model {
  beta ~ std_normal();
  target += y .* log(mu) + (1 - y) .* log1m(mu); 
}

With your data I get

 variable   mean median   sd  mad     q5    q95 rhat ess_bulk ess_tail
  lp__    -14.85 -14.54 1.01 0.73 -16.72 -13.90 1.00     1815     2490
  beta[1]  -1.59  -1.57 0.48 0.46  -2.39  -0.82 1.00     1952     2113
  beta[2]   0.01   0.01 0.01 0.01  -0.01   0.04 1.00     1976     2184

If I remove the normal prior on the intercept and slope then I get close to what you posted

 variable   mean median   sd  mad     q5    q95 rhat ess_bulk ess_tail
  lp__    -13.38 -13.06 1.03 0.75 -15.48 -12.38 1.00     1705     1703
  beta[1]  -2.25  -2.19 0.70 0.69  -3.49  -1.19 1.00     1306     1634
  beta[2]   0.01   0.01 0.02 0.02  -0.02   0.03 1.00     1393     1362
library(cmdstanr)

# data linked above
myData 

mod <- cmdstan_model("fractional_response.stan")

out <- mod$sample(
  data = myData,
  parallel_chains = 4
)
out
5 Likes

Many thanks. I think this is more close to the pymc3 implementation. But I wonder why there is significant difference between stan and pymc3 results for estimates? In both cases, I considered standard normal distribution as prior and used exactly same data.

I don’t know what pymc3 is doing for Bernoulli outcomes that are non integer. However, thinking about this, we can think of it generatively as a mixture of 2 bernoulli’s where the outcome is the mixing weight.

It’s possible that pymc3 is doing this? In Stan this is

data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x;
}
parameters {
  vector[2] beta;
}
transformed parameters {
  vector[N] mu = inv_logit(beta[1] + beta[2] * x);
}
model {
  beta ~ std_normal();
  for (n in 1:N)
    target += log_mix(y[n], 
              bernoulli_lpmf(1 | mu[n]),
              bernoulli_lpmf(0 | mu[n]));
}

where the outcome corresponds more closely to what pymc3 gives

 variable   mean median   sd  mad     q5    q95 rhat ess_bulk ess_tail
  lp__    -11.45 -11.12 1.12 0.80 -13.65 -10.41 1.00     1632     2057
  beta[1]  -2.28  -2.24 0.63 0.61  -3.34  -1.27 1.00     2434     2481
  beta[2]   0.03   0.03 0.02 0.02   0.00   0.08 1.00     2083     1772
3 Likes

Interesting, this seems analogous to some methods for handling missing discrete data. Essentially modelling the likelihood of the implied discrete response, rather than the fractional outcome itself

2 Likes

Seems all related to a bayesian translation of quasiMLE

2 Likes

Many thanks. This is really awesome. However, given that still we see some difference between pymc3 and stan (and this appears to be more prominent if I incorporate other independent variables for mu, which I checked based on various combinations, inclusions, and exclusions of different variables), is there any way to simulate artificial data based on above process, and then compare estimates from at least stan model how close it’s estimates are to that artificial data? For example, if I modify my model as,

transformed parameters {
  vector[N] mu = inv_logit(beta[1] + beta[2] * x * 0);
}

then, I again see large difference between stan and pymc3

I believe you may be overthinking this. Before embarking too much more I suggest you reach out to the pymc team and ask what’s happening for fractional responses (or dive into the source code yourself). The second thing is to test a few simple hypotheses, here are some:

Hypothesis: The y values are rounded to the nearest integer

There are 3 values that are rounded to 1 and 35 to 0. That gives a sample probability of 1 as 3/38 or 7.9% which is close to the intercept of pymc of inv_logit(-2.53) = 7.4%.

Model
Stan

data {
  int<lower=0> N;
  array[N] int y;
  vector[N] x;
}
parameters {
  vector[2] beta;
}
transformed parameters {
  vector[N] mu = beta[1] + beta[2] * x;
}
model {
  beta ~ std_normal();
  y ~ bernoulli_logit(mu);
}

Output

 variable   mean median   sd  mad     q5    q95 rhat ess_bulk ess_tail
  lp__    -13.58 -13.26 1.04 0.72 -15.61 -12.61 1.00     1821     2176
  beta[1]  -1.84  -1.81 0.51 0.50  -2.72  -1.06 1.00     1691     1723
  beta[2]   0.01   0.01 0.01 0.01  -0.01   0.03 1.00     2028     2339

Well, there are some big negative x values so with the covariate, this does make sense. Let’s remove that prior.

Mini Hypothesis a: Remove the prior on beta

Kind of close

 variable   mean median   sd  mad     q5    q95 rhat ess_bulk ess_tail
  lp__    -11.56 -11.25 1.03 0.77 -13.63 -10.55 1.00     1259     1849
  beta[1]  -2.80  -2.69 0.87 0.84  -4.39  -1.56 1.00     1191     1388
  beta[2]   0.00   0.00 0.02 0.02  -0.03   0.03 1.00     1357     1586

Mini hypothesis b: Test only the intercept and remove the std_normal prior

That looks pretty close!

variable   mean median   sd  mad     q5    q95 rhat ess_bulk ess_tail
     lp__ -11.04 -10.75 0.74 0.34 -12.60 -10.50 1.00     1704     1673
     beta  -2.63  -2.58 0.66 0.65  -3.81  -1.63 1.00     1307     1441

Let’s see if we add the prior back in.

 variable   mean median   sd  mad     q5    q95 rhat ess_bulk ess_tail
     lp__ -13.32 -13.05 0.74 0.34 -14.73 -12.80 1.00     1761     1860
     beta  -1.98  -1.96 0.45 0.46  -2.72  -1.27 1.00     1215     1648

Umm, ok, well there’s not much data here so the fact that the prior does push the probability up makes sense.

At this point, I’d focus on figuring out the implementation details of pymc which will tell you exactly what’s going on. If the implementation is similar to something tested here then we have to wonder about the sampling algorithm. The Stan team is very confident in the HMC sampler. I think heading over to pymc and comparing Stan’s results will be the most fruitful next steps.

1 Like

I had one thought. Maybe they’re using the floor function. Let’s test

# round_bern.stan corresponds to prev post stan program
mod_round <- cmdstan_model("round_bern.stan")
out_round <- mod_round$sample(
  data = list(y = floor(myData$y),
              x = myData$x,
              N = myData$N),
  parallel_chains = 4
)

out_round
 variable  mean median   sd  mad    q5   q95 rhat ess_bulk ess_tail
  lp__    -6.26  -5.95 1.04 0.74 -8.28 -5.28 1.00     1574     1874
  beta[1] -2.53  -2.50 0.61 0.62 -3.57 -1.59 1.00     2029     1809
  beta[2]  0.03   0.03 0.02 0.02  0.00  0.08 1.00     2372     1920

so I think that’s what they’re doing, and it’s not the fractional model you want. If you can get confirmation from them on this, then you may want to request that they error or give out a warning message that the inputs have been modified.

1 Like

Hi -

I encourage you to look at my new model/paper on ordered beta regression, which can handle discrete (rounded) responses to 0/1 in addition to continuous variables:

https://osf.io/preprints/socarxiv/2sx6y/

There is an R package available to estimate the model with brms, but you can also get the stan code here:

There has not yet been a PyMC implementation but would be happy to see one/help if I can.

all best

Also, I believe the model you’re using above @volabos is the fractional logit model, which I also have stan code for in the repo above. However, it can be quite inefficient if the ratio of continuous to discrete responses is very high. Also, it isn’t a statistical distribution, so it doesn’t have an RNG function/etc.