Hi,
I have posted a similar question for help in some other forum, but suggested to post the problem here as dedicated forum for stan
models.
I have below model written in Python’s pymc3
to estimate the parameters of a regression equation
where the response variable is Fractional. To model this fractional
response variable I used a Logit
link function. Below is my model
import pandas as pd
import pymc3 as pm
import arviz as arviz
myData = pd.DataFrame.from_dict({'Unnamed: 0': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15, 15: 16, 16: 17, 17: 18, 18: 19, 19: 20, 20: 21, 21: 22, 22: 23, 23: 24, 24: 25, 25: 26, 26: 27, 27: 28, 28: 29, 29: 30, 30: 31, 31: 32, 32: 33, 33: 34, 34: 35, 35: 36, 36: 37, 37: 38}, 'y': {0: 0.0079235409492941, 1: 0.0086530073429249, 2: 0.0297400780486734, 3: 0.0196358416326437, 4: 0.0023902064076204, 5: 0.0258055591736283, 6: 0.17394835142698, 7: 0.156463554455613, 8: 0.329388185725557, 9: 0.0076443508881763, 10: 0.0162081480398152, 11: 0.0, 12: 0.0015759139941696, 13: 0.420025972703085, 14: 0.0001226236519444, 15: 0.133061480234834, 16: 0.565454216154227, 17: 0.0002819734812997, 18: 0.000559715156383, 19: 0.0270686389659072, 20: 0.918300537689865, 21: 7.8262468302e-06, 22: 0.0073241434191945, 23: 0.0, 24: 0.0, 25: 0.0, 26: 0.0, 27: 0.0, 28: 0.0, 29: 0.0, 30: 0.174071274611405, 31: 0.0432109713717948, 32: 0.0544400838264943, 33: 0.0, 34: 0.0907049925221286, 35: 0.616680102647887, 36: 0.0, 37: 0.0}, 'x': {0: 23.8187587698947, 1: 15.9991138359515, 2: 33.6495930512881, 3: 28.555818797764, 4: -52.2967967248258, 5: -91.3835208788233, 6: -73.9830692708321, 7: -5.16901145289629, 8: 29.8363012310241, 9: 10.6820057903939, 10: 19.4868517164395, 11: 15.4499668436458, 12: -17.0441644773509, 13: 10.7025053739577, 14: -8.6382953428539, 15: -32.8892974839165, 16: -15.8671863161348, 17: -11.237248036145, 18: -7.37978020066205, 19: -3.33500586334862, 20: -4.02629933182873, 21: -20.2413384726948, 22: -54.9094885578775, 23: -48.041459120976, 24: -52.3125732905322, 25: -35.6269065970458, 26: -62.0296155423529, 27: -49.0825017152659, 28: -73.0574478287598, 29: -50.9409090127938, 30: -63.4650928035253, 31: -55.1263264283842, 32: -52.2841103768755, 33: -61.2275334149805, 34: -74.2175990067417, 35: -68.2961107804698, 36: -76.6834643609286, 37: -70.16769103228}})
with pm.Model() as myModel :
beta0 = pm.Normal('intercept', 0, 1)
beta1 = pm.Normal('x', 0, 1)
mu = beta0 + beta1 * myData['x'].values
pm.Bernoulli('obs', p = pm.invlogit(mu), observed = myData['y'].values)
with myModel :
trace = pm.sample(50000, tune = 10000, step = pm.Metropolis(), random_seed = 1000)
arviz.summary(trace, round_to = 10)
Above implementation gives below result, which is running perfectly fine with my Python+pymc3
.
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
intercept -2.537501 0.599667 -3.707061 -1.450243 0.004375 0.003118 18893.344191 22631.772985 1.000070
x 0.033750 0.024314 -0.007871 0.081619 0.000181 0.000133 18550.620475 20113.739639 1.000194
I also could use typical Frequentist estimation using R’s glm()
function, which gives moderately close estimates
myData = list(y = c(0.00792354094929414, 0.00865300734292492, 0.0297400780486734,
0.0196358416326437, 0.00239020640762042, 0.0258055591736283,
0.17394835142698, 0.156463554455613, 0.329388185725557, 0.00764435088817635,
0.0162081480398152, 0, 0.00157591399416963, 0.420025972703085,
0.000122623651944455, 0.133061480234834, 0.565454216154227, 0.000281973481299731,
0.000559715156383041, 0.0270686389659072, 0.918300537689865,
0.00000782624683025728, 0.00732414341919458, 0, 0, 0, 0, 0, 0,
0, 0.174071274611405, 0.0432109713717948, 0.0544400838264943,
0, 0.0907049925221286, 0.616680102647887, 0, 0), x = c(23.8187587698947,
15.9991138359515, 33.6495930512881, 28.555818797764, -52.2967967248258,
-91.3835208788233, -73.9830692708321, -5.16901145289629, 29.8363012310241,
10.6820057903939, 19.4868517164395, 15.4499668436458, -17.0441644773509,
10.7025053739577, -8.6382953428539, -32.8892974839165, -15.8671863161348,
-11.237248036145, -7.37978020066205, -3.33500586334862, -4.02629933182873,
-20.2413384726948, -54.9094885578775, -48.041459120976, -52.3125732905322,
-35.6269065970458, -62.0296155423529, -49.0825017152659, -73.0574478287598,
-50.9409090127938, -63.4650928035253, -55.1263264283842, -52.2841103768755,
-61.2275334149805, -74.2175990067417, -68.2961107804698, -76.6834643609286,
-70.16769103228), N = 38)
summary(glm(y ~ x, myData, family = binomial()))
Result-
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.02368 0.63467 -3.189 0.00143 **
x 0.00641 0.01492 0.430 0.66742
I am wondering if this model can be expressed as stan
model. I have not seen any online resource etc on how can a stan
model be defined for a Regression equation
where values of the response variable is strictly fractional which required to be addressed by a Logit
link function.
The model I am trying to fit with this data is Fractional model - Wikipedia
Any pointer is highly appreciated.