Stan model provides different estimate from actual process

volabos · April 4, 2022, 10:23pm

Hi,

Let say I have below data

import random as random
import numpy as np
import pandas as pd

np.random.seed(1)
x = np.random.uniform(0,10,10000)
err = np.random.normal(0,1,10000)
y = [-3 + .7*x[i] + err[i] for i in range(np.size(x))]
p = [np.exp(i) / (1 + np.exp(i)) for i in y]
dat = pd.DataFrame({'x' : x, 'y' : y, 'p' : p})

Based on this data, I am fitting below stan model

import stan
modelStan1 =  """
data {
  int<lower=0> N;
  vector[N] p;
  vector[N] x;
}
parameters {
  
  real intercept;
  real slope;
}
transformed parameters {
  vector[N] mu = inv_logit(intercept + slope * x);
}
model {
  intercept ~ std_normal();
  slope ~ std_normal();  
  target += p .* log(mu) + (1 - p) .* log1m(mu);
}
"""
posterior = stan.build(modelStan1, data={"N" : np.size(x), "p" : p, "x": x})
fit = posterior.sample(num_chains=4, num_samples=10000)
np.mean(fit["intercept"])
np.mean(fit["slope"])

This gives estimates of intercept and slope as -1.4479 & 0.005 respectively.

Could you please help to understand that why I am getting different estimates from actual data process?

I also fit below model on same data

import statsmodels.api as sm
sm.GLM(dat['p'], sm.add_constant(dat['x']), family=sm.families.Binomial()).fit().summary()

This gives somewhat closer estimate to actual data process.

Any pointer to understand if something is wrong in my stan model, will be very helpful.

mike-lawrence · April 5, 2022, 11:22am

Bayes is not really a tool to achieve point estimates. I’m not sure whether your description of the output reflects a mean/median, but regardless you need to look at the full posterior distribution, and additionally consider the influence of the prior, before jumping to conclusions that Bayes isn’t “recovering” the parameter values used to generate the synthetic data.

Additionally, are you sure you have the math right? When generating data, you have:

y = -3 + .7*x + err
p = exp(y) / (1 + exp(y))

while in Stan you have:

mu = inv_logit(intercept + slope * x)
target += p .* log(mu) + (1 - p) .* log1m(mu);

So there’s use of inv_logit in the latter but not the former, and the Python has a std_normal-distributed parameter err that is absent in the Stan.

volabos · April 5, 2022, 12:21pm

Thank. p is the observed probabilities. What do you suggest to be the correct stan model given this data generation process?

Topic		Replies	Views
Bayesian inference in Stan Modeling	40	5813	July 18, 2017
Toy ODE model problem -- posterior predictives not consistent with real values Modeling fitting-issues , specification	14	1066	June 4, 2019
Correlation between bias and true parameter values Modeling	6	1523	July 9, 2020
Large discrepencies between stan_glm and standard glm Modeling specification	2	413	May 28, 2020
Having trouble recovering known parameters from simulated data Modeling cmdstanpy	2	655	June 4, 2021

Stan model provides different estimate from actual process

Related topics