# How to create a Stan model using Logit link where response variable of a Regression equation is fractional

Hi,

I have posted a similar question for help in some other forum, but suggested to post the problem here as dedicated forum for stan models.

I have below model written in Python’s pymc3 to estimate the parameters of a regression equation where the response variable is Fractional. To model this fractional response variable I used a Logit link function. Below is my model

import pandas as pd
import pymc3 as pm
import arviz as arviz

myData = pd.DataFrame.from_dict({'Unnamed: 0': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15, 15: 16, 16: 17, 17: 18, 18: 19, 19: 20, 20: 21, 21: 22, 22: 23, 23: 24, 24: 25, 25: 26, 26: 27, 27: 28, 28: 29, 29: 30, 30: 31, 31: 32, 32: 33, 33: 34, 34: 35, 35: 36, 36: 37, 37: 38}, 'y': {0: 0.0079235409492941, 1: 0.0086530073429249, 2: 0.0297400780486734, 3: 0.0196358416326437, 4: 0.0023902064076204, 5: 0.0258055591736283, 6: 0.17394835142698, 7: 0.156463554455613, 8: 0.329388185725557, 9: 0.0076443508881763, 10: 0.0162081480398152, 11: 0.0, 12: 0.0015759139941696, 13: 0.420025972703085, 14: 0.0001226236519444, 15: 0.133061480234834, 16: 0.565454216154227, 17: 0.0002819734812997, 18: 0.000559715156383, 19: 0.0270686389659072, 20: 0.918300537689865, 21: 7.8262468302e-06, 22: 0.0073241434191945, 23: 0.0, 24: 0.0, 25: 0.0, 26: 0.0, 27: 0.0, 28: 0.0, 29: 0.0, 30: 0.174071274611405, 31: 0.0432109713717948, 32: 0.0544400838264943, 33: 0.0, 34: 0.0907049925221286, 35: 0.616680102647887, 36: 0.0, 37: 0.0}, 'x': {0: 23.8187587698947, 1: 15.9991138359515, 2: 33.6495930512881, 3: 28.555818797764, 4: -52.2967967248258, 5: -91.3835208788233, 6: -73.9830692708321, 7: -5.16901145289629, 8: 29.8363012310241, 9: 10.6820057903939, 10: 19.4868517164395, 11: 15.4499668436458, 12: -17.0441644773509, 13: 10.7025053739577, 14: -8.6382953428539, 15: -32.8892974839165, 16: -15.8671863161348, 17: -11.237248036145, 18: -7.37978020066205, 19: -3.33500586334862, 20: -4.02629933182873, 21: -20.2413384726948, 22: -54.9094885578775, 23: -48.041459120976, 24: -52.3125732905322, 25: -35.6269065970458, 26: -62.0296155423529, 27: -49.0825017152659, 28: -73.0574478287598, 29: -50.9409090127938, 30: -63.4650928035253, 31: -55.1263264283842, 32: -52.2841103768755, 33: -61.2275334149805, 34: -74.2175990067417, 35: -68.2961107804698, 36: -76.6834643609286, 37: -70.16769103228}})

with pm.Model() as myModel :
beta0 = pm.Normal('intercept', 0, 1)
beta1 = pm.Normal('x', 0, 1)
mu = beta0 + beta1 * myData['x'].values
pm.Bernoulli('obs', p = pm.invlogit(mu), observed = myData['y'].values)

with myModel :
trace = pm.sample(50000, tune = 10000, step = pm.Metropolis(), random_seed = 1000)

arviz.summary(trace, round_to = 10)


Above implementation gives below result, which is running perfectly fine with my Python+pymc3.


mean        sd    hdi_3%   hdi_97%  mcse_mean   mcse_sd      ess_bulk      ess_tail     r_hat
intercept -2.537501  0.599667 -3.707061 -1.450243   0.004375  0.003118  18893.344191  22631.772985  1.000070
x          0.033750  0.024314 -0.007871  0.081619   0.000181  0.000133  18550.620475  20113.739639  1.000194


I also could use typical Frequentist estimation using R’s glm() function, which gives moderately close estimates

myData = list(y = c(0.00792354094929414, 0.00865300734292492, 0.0297400780486734,
0.0196358416326437, 0.00239020640762042, 0.0258055591736283,
0.17394835142698, 0.156463554455613, 0.329388185725557, 0.00764435088817635,
0.0162081480398152, 0, 0.00157591399416963, 0.420025972703085,
0.000122623651944455, 0.133061480234834, 0.565454216154227, 0.000281973481299731,
0.000559715156383041, 0.0270686389659072, 0.918300537689865,
0.00000782624683025728, 0.00732414341919458, 0, 0, 0, 0, 0, 0,
0, 0.174071274611405, 0.0432109713717948, 0.0544400838264943,
0, 0.0907049925221286, 0.616680102647887, 0, 0), x = c(23.8187587698947,
15.9991138359515, 33.6495930512881, 28.555818797764, -52.2967967248258,
-91.3835208788233, -73.9830692708321, -5.16901145289629, 29.8363012310241,
10.6820057903939, 19.4868517164395, 15.4499668436458, -17.0441644773509,
10.7025053739577, -8.6382953428539, -32.8892974839165, -15.8671863161348,
-11.237248036145, -7.37978020066205, -3.33500586334862, -4.02629933182873,
-20.2413384726948, -54.9094885578775, -48.041459120976, -52.3125732905322,
-35.6269065970458, -62.0296155423529, -49.0825017152659, -73.0574478287598,
-50.9409090127938, -63.4650928035253, -55.1263264283842, -52.2841103768755,
-61.2275334149805, -74.2175990067417, -68.2961107804698, -76.6834643609286,
-70.16769103228), N = 38)

summary(glm(y ~ x, myData, family = binomial()))


Result-

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.02368    0.63467  -3.189  0.00143 **
x            0.00641    0.01492   0.430  0.66742


I am wondering if this model can be expressed as stan model. I have not seen any online resource etc on how can a stan model be defined for a Regression equation where values of the response variable is strictly fractional which required to be addressed by a Logit link function.

The model I am trying to fit with this data is Fractional model - Wikipedia

Any pointer is highly appreciated.

Can you elaborate what is going on in the pymc3 model a bit? It looks like you are modeling:

y \sim \textrm{Bernoulli}(\textrm{inv-logit}(\beta_0 + \beta_1x))

But y is a continuous parameters between 0 and 1? The support of the Bernoulli distribution is the discrete set \{0, 1\}, so y cannot be observations from a Bernoulli distribution. Some clarification wold be helpful!

Hi, thanks for your reply. I modified my original post on what kind of model I am trying to fit. The definition can be found in- Fractional model - Wikipedia

Thanks for the additional context. I’m not familiar with this extended bernoulli logit method, but a natural way to model response data on (0, 1) would be to use a Beta regression – namely where the response variable is Beta distributed, and you can reparameterize it to be in terms of location and scale parameters.

This vignette showing the approach using rstanarm could be worthwhile to read through.

The ultimate answer to your question is yes. Stan can certainly fit a regression model where the response is “fractional”.

Thanks for the vignette. Yes I am aware of the fact that stan is definitely be capable to model fraction regression model, but I am more interested to see if stan is capable to use logit link to model such variable, given pymc3 is.

Sure you can use a logit link. In fact, the Rstanarm implementation in the linked vignette from the previous post uses a logit link. In raw Stan code, you can use whatever link you want (though you’d be well-advised to use one whose inverse is restricted to the unit interval) in a beta regression.

You’d set up the linear predictor as in any other generalized linear model, then take the inverse link. The tricky part with beta regression in Stan is that Stan’s beta distribution is not parameterized in terms of its mean, so you need to transform a beta distribution parameterized by the mean to Stan’s parameterization of the beta distribution. Here’s example Stan code to do it

I guess you just want to use the bernoulli distribution. I don’t think this is “bayesian” from the sense of having a generative model because the outcome is fractional. All discrete distributions are enforced to have integer outcomes in Stan so it is not possible to implement this using the bernoulli distribution. However, you can write the bernoulli log probability yourself and achieve what you want.

data {
int<lower=0> N;
vector[N] y;
vector[N] x;
}
parameters {
vector[2] beta;
}
transformed parameters {
vector[N] mu = inv_logit(beta[1] + beta[2] * x);
}
model {
beta ~ std_normal();
target += y .* log(mu) + (1 - y) .* log1m(mu);
}


 variable   mean median   sd  mad     q5    q95 rhat ess_bulk ess_tail
lp__    -14.85 -14.54 1.01 0.73 -16.72 -13.90 1.00     1815     2490
beta[1]  -1.59  -1.57 0.48 0.46  -2.39  -0.82 1.00     1952     2113
beta[2]   0.01   0.01 0.01 0.01  -0.01   0.04 1.00     1976     2184


If I remove the normal prior on the intercept and slope then I get close to what you posted

 variable   mean median   sd  mad     q5    q95 rhat ess_bulk ess_tail
lp__    -13.38 -13.06 1.03 0.75 -15.48 -12.38 1.00     1705     1703
beta[1]  -2.25  -2.19 0.70 0.69  -3.49  -1.19 1.00     1306     1634
beta[2]   0.01   0.01 0.02 0.02  -0.02   0.03 1.00     1393     1362

library(cmdstanr)

myData

mod <- cmdstan_model("fractional_response.stan")

out <- mod$sample( data = myData, parallel_chains = 4 ) out  5 Likes Many thanks. I think this is more close to the pymc3 implementation. But I wonder why there is significant difference between stan and pymc3 results for estimates? In both cases, I considered standard normal distribution as prior and used exactly same data. I don’t know what pymc3 is doing for Bernoulli outcomes that are non integer. However, thinking about this, we can think of it generatively as a mixture of 2 bernoulli’s where the outcome is the mixing weight. It’s possible that pymc3 is doing this? In Stan this is data { int<lower=0> N; vector[N] y; vector[N] x; } parameters { vector[2] beta; } transformed parameters { vector[N] mu = inv_logit(beta[1] + beta[2] * x); } model { beta ~ std_normal(); for (n in 1:N) target += log_mix(y[n], bernoulli_lpmf(1 | mu[n]), bernoulli_lpmf(0 | mu[n])); }  where the outcome corresponds more closely to what pymc3 gives  variable mean median sd mad q5 q95 rhat ess_bulk ess_tail lp__ -11.45 -11.12 1.12 0.80 -13.65 -10.41 1.00 1632 2057 beta[1] -2.28 -2.24 0.63 0.61 -3.34 -1.27 1.00 2434 2481 beta[2] 0.03 0.03 0.02 0.02 0.00 0.08 1.00 2083 1772  3 Likes Interesting, this seems analogous to some methods for handling missing discrete data. Essentially modelling the likelihood of the implied discrete response, rather than the fractional outcome itself 2 Likes Seems all related to a bayesian translation of quasiMLE 2 Likes Many thanks. This is really awesome. However, given that still we see some difference between pymc3 and stan (and this appears to be more prominent if I incorporate other independent variables for mu, which I checked based on various combinations, inclusions, and exclusions of different variables), is there any way to simulate artificial data based on above process, and then compare estimates from at least stan model how close it’s estimates are to that artificial data? For example, if I modify my model as, transformed parameters { vector[N] mu = inv_logit(beta[1] + beta[2] * x * 0); }  then, I again see large difference between stan and pymc3 I believe you may be overthinking this. Before embarking too much more I suggest you reach out to the pymc team and ask what’s happening for fractional responses (or dive into the source code yourself). The second thing is to test a few simple hypotheses, here are some: ## Hypothesis: The y values are rounded to the nearest integer There are 3 values that are rounded to 1 and 35 to 0. That gives a sample probability of 1 as 3/38 or 7.9% which is close to the intercept of pymc of inv_logit(-2.53) = 7.4%. Model Stan data { int<lower=0> N; array[N] int y; vector[N] x; } parameters { vector[2] beta; } transformed parameters { vector[N] mu = beta[1] + beta[2] * x; } model { beta ~ std_normal(); y ~ bernoulli_logit(mu); }  Output  variable mean median sd mad q5 q95 rhat ess_bulk ess_tail lp__ -13.58 -13.26 1.04 0.72 -15.61 -12.61 1.00 1821 2176 beta[1] -1.84 -1.81 0.51 0.50 -2.72 -1.06 1.00 1691 1723 beta[2] 0.01 0.01 0.01 0.01 -0.01 0.03 1.00 2028 2339  Well, there are some big negative x values so with the covariate, this does make sense. Let’s remove that prior. ### Mini Hypothesis a: Remove the prior on beta Kind of close  variable mean median sd mad q5 q95 rhat ess_bulk ess_tail lp__ -11.56 -11.25 1.03 0.77 -13.63 -10.55 1.00 1259 1849 beta[1] -2.80 -2.69 0.87 0.84 -4.39 -1.56 1.00 1191 1388 beta[2] 0.00 0.00 0.02 0.02 -0.03 0.03 1.00 1357 1586  ### Mini hypothesis b: Test only the intercept and remove the std_normal prior That looks pretty close! variable mean median sd mad q5 q95 rhat ess_bulk ess_tail lp__ -11.04 -10.75 0.74 0.34 -12.60 -10.50 1.00 1704 1673 beta -2.63 -2.58 0.66 0.65 -3.81 -1.63 1.00 1307 1441  Let’s see if we add the prior back in.  variable mean median sd mad q5 q95 rhat ess_bulk ess_tail lp__ -13.32 -13.05 0.74 0.34 -14.73 -12.80 1.00 1761 1860 beta -1.98 -1.96 0.45 0.46 -2.72 -1.27 1.00 1215 1648  Umm, ok, well there’s not much data here so the fact that the prior does push the probability up makes sense. At this point, I’d focus on figuring out the implementation details of pymc which will tell you exactly what’s going on. If the implementation is similar to something tested here then we have to wonder about the sampling algorithm. The Stan team is very confident in the HMC sampler. I think heading over to pymc and comparing Stan’s results will be the most fruitful next steps. 1 Like I had one thought. Maybe they’re using the floor function. Let’s test # round_bern.stan corresponds to prev post stan program mod_round <- cmdstan_model("round_bern.stan") out_round <- mod_round$sample(
data = list(y = floor(myData$y), x = myData$x,
N = myData\$N),
parallel_chains = 4
)

out_round

 variable  mean median   sd  mad    q5   q95 rhat ess_bulk ess_tail
lp__    -6.26  -5.95 1.04 0.74 -8.28 -5.28 1.00     1574     1874
beta[1] -2.53  -2.50 0.61 0.62 -3.57 -1.59 1.00     2029     1809
beta[2]  0.03   0.03 0.02 0.02  0.00  0.08 1.00     2372     1920


so I think that’s what they’re doing, and it’s not the fractional model you want. If you can get confirmation from them on this, then you may want to request that they error or give out a warning message that the inputs have been modified.

1 Like

Hi -

I encourage you to look at my new model/paper on ordered beta regression, which can handle discrete (rounded) responses to 0/1 in addition to continuous variables:

https://osf.io/preprints/socarxiv/2sx6y/

There is an R package available to estimate the model with brms, but you can also get the stan code here:

There has not yet been a PyMC implementation but would be happy to see one/help if I can.

all best

Also, I believe the model you’re using above @volabos is the fractional logit model, which I also have stan code for in the repo above. However, it can be quite inefficient if the ratio of continuous to discrete responses is very high. Also, it isn’t a statistical distribution, so it doesn’t have an RNG function/etc.