# Stan model provides different estimate from actual process

Hi,

Let say I have below data

``````import random as random
import numpy as np
import pandas as pd

np.random.seed(1)
x = np.random.uniform(0,10,10000)
err = np.random.normal(0,1,10000)
y = [-3 + .7*x[i] + err[i] for i in range(np.size(x))]
p = [np.exp(i) / (1 + np.exp(i)) for i in y]
dat = pd.DataFrame({'x' : x, 'y' : y, 'p' : p})
``````

Based on this data, I am fitting below `stan` model

``````import stan
modelStan1 =  """
data {
int<lower=0> N;
vector[N] p;
vector[N] x;
}
parameters {

real intercept;
real slope;
}
transformed parameters {
vector[N] mu = inv_logit(intercept + slope * x);
}
model {
intercept ~ std_normal();
slope ~ std_normal();
target += p .* log(mu) + (1 - p) .* log1m(mu);
}
"""
posterior = stan.build(modelStan1, data={"N" : np.size(x), "p" : p, "x": x})
fit = posterior.sample(num_chains=4, num_samples=10000)
np.mean(fit["intercept"])
np.mean(fit["slope"])
``````

This gives estimates of intercept and slope as `-1.4479 & 0.005` respectively.

Could you please help to understand that why I am getting different estimates from actual data process?

I also fit below model on same data

``````import statsmodels.api as sm
``````

This gives somewhat closer estimate to actual data process.

Any pointer to understand if something is wrong in my `stan` model, will be very helpful.

Bayes is not really a tool to achieve point estimates. I’m not sure whether your description of the output reflects a mean/median, but regardless you need to look at the full posterior distribution, and additionally consider the influence of the prior, before jumping to conclusions that Bayes isn’t “recovering” the parameter values used to generate the synthetic data.

Additionally, are you sure you have the math right? When generating data, you have:

``````y = -3 + .7*x + err
p = exp(y) / (1 + exp(y))
``````

while in Stan you have:

``````mu = inv_logit(intercept + slope * x)
target += p .* log(mu) + (1 - p) .* log1m(mu);
``````

So there’s use of `inv_logit` in the latter but not the former, and the Python has a std_normal-distributed parameter `err` that is absent in the Stan.

Thank. `p` is the observed probabilities. What do you suggest to be the correct `stan` model given this data generation process?