Stan model provides different estimate from actual process

Hi,

Let say I have below data

import random as random
import numpy as np
import pandas as pd

np.random.seed(1)
x = np.random.uniform(0,10,10000)
err = np.random.normal(0,1,10000)
y = [-3 + .7*x[i] + err[i] for i in range(np.size(x))]
p = [np.exp(i) / (1 + np.exp(i)) for i in y]
dat = pd.DataFrame({'x' : x, 'y' : y, 'p' : p})

Based on this data, I am fitting below stan model

import stan
modelStan1 =  """
data {
  int<lower=0> N;
  vector[N] p;
  vector[N] x;
}
parameters {
  
  real intercept;
  real slope;
}
transformed parameters {
  vector[N] mu = inv_logit(intercept + slope * x);
}
model {
  intercept ~ std_normal();
  slope ~ std_normal();  
  target += p .* log(mu) + (1 - p) .* log1m(mu);
}
"""
posterior = stan.build(modelStan1, data={"N" : np.size(x), "p" : p, "x": x})
fit = posterior.sample(num_chains=4, num_samples=10000)
np.mean(fit["intercept"])
np.mean(fit["slope"])

This gives estimates of intercept and slope as -1.4479 & 0.005 respectively.

Could you please help to understand that why I am getting different estimates from actual data process?

I also fit below model on same data

import statsmodels.api as sm
sm.GLM(dat['p'], sm.add_constant(dat['x']), family=sm.families.Binomial()).fit().summary()

This gives somewhat closer estimate to actual data process.

Any pointer to understand if something is wrong in my stan model, will be very helpful.

Bayes is not really a tool to achieve point estimates. I’m not sure whether your description of the output reflects a mean/median, but regardless you need to look at the full posterior distribution, and additionally consider the influence of the prior, before jumping to conclusions that Bayes isn’t “recovering” the parameter values used to generate the synthetic data.

Additionally, are you sure you have the math right? When generating data, you have:

y = -3 + .7*x + err
p = exp(y) / (1 + exp(y))

while in Stan you have:

mu = inv_logit(intercept + slope * x)
target += p .* log(mu) + (1 - p) .* log1m(mu);

So there’s use of inv_logit in the latter but not the former, and the Python has a std_normal-distributed parameter err that is absent in the Stan.

Thank. p is the observed probabilities. What do you suggest to be the correct stan model given this data generation process?