Hi,
Let say I have below data
import random as random
import numpy as np
import pandas as pd
np.random.seed(1)
x = np.random.uniform(0,10,10000)
err = np.random.normal(0,1,10000)
y = [-3 + .7*x[i] + err[i] for i in range(np.size(x))]
p = [np.exp(i) / (1 + np.exp(i)) for i in y]
dat = pd.DataFrame({'x' : x, 'y' : y, 'p' : p})
Based on this data, I am fitting below stan
model
import stan
modelStan1 = """
data {
int<lower=0> N;
vector[N] p;
vector[N] x;
}
parameters {
real intercept;
real slope;
}
transformed parameters {
vector[N] mu = inv_logit(intercept + slope * x);
}
model {
intercept ~ std_normal();
slope ~ std_normal();
target += p .* log(mu) + (1 - p) .* log1m(mu);
}
"""
posterior = stan.build(modelStan1, data={"N" : np.size(x), "p" : p, "x": x})
fit = posterior.sample(num_chains=4, num_samples=10000)
np.mean(fit["intercept"])
np.mean(fit["slope"])
This gives estimates of intercept and slope as -1.4479 & 0.005
respectively.
Could you please help to understand that why I am getting different estimates from actual data process?
I also fit below model on same data
import statsmodels.api as sm
sm.GLM(dat['p'], sm.add_constant(dat['x']), family=sm.families.Binomial()).fit().summary()
This gives somewhat closer estimate to actual data process.
Any pointer to understand if something is wrong in my stan
model, will be very helpful.