Training a Bernoulli model using probabilities as inputs

I’m using two methods to train a Bernoulli model, and am trying to understand why they are not yielding similar results. For both methods, I have a length N array of probabilities \{\hat{y}^{(n)}\}_{n=1}^{N}, and I want to estimate the distribution of a length N array of parameters \{\theta^{(n)}\}_{n=1}^{N}. In method (1), for each \hat{y}^{(n)}, I sample M times from a Bernoulli distribution with probability \hat{y}^{(n)}, and use the resulting binary data as the input to my model. In method (2), I use \hat{y}^{(n)} directly by incrementing the log joint density using the following update rule:

\begin{equation*} \begin{split} \log p(\vec{y} \mid \theta) &= \sum_{n=1}^{N} \log p(y^{(n)} \mid \theta^{(n)})\\ &\approx \sum_{n=1}^{N} \frac{1}{M} \sum_{m=1}^{M} \log p(y^{(n)}_m \mid \theta^{(n)})\\ &\approx \sum_{n=1}^{N} E_{y^{(n)}_m}[\log p(y^{(n)}_m \mid \theta^{(n)})]\\ &= \sum_{n=1}^{N} E_{y^{(n)}_m}\left[\log({\theta^{(n)}}^{y^{(n)}_m} \cdot (1 - \theta^{(n)})^{1 - y^{(n)}_m})\right]\\ &= \sum_{n=1}^{N} E_{y^{(n)}_m}\left[y^{(n)}_m \log(\theta^{(n)}) + (1 - y^{(n)}_m) \log(1 - \theta^{(n)})\right]\\ &= \sum_{n=1}^{N} \Pr(y^{(n)}_m = 1) \log(\theta^{(n)}) + \Pr(y^{(n)}_m = 0) \log(1 - \theta^{(n)})\\ &= \sum_{n=1}^{N} \hat{y}^{(n)} \log(\theta^{(n)}) + (1 - \hat{y}^{(n)}) \log(1 - \theta^{(n)}) \end{split} \end{equation*}

In the following toy problem, I expect method (1) to approximate method (2) for large M, but I’m not getting similar estimates for \theta. I’d like to share the output plots, but unfortunately I can’t upload any images since I’m a new member. Thank you very much for your help, and please let me know if I provide any further information.

Method (1):

model = '''
data {
    int<lower=0> M;
    int<lower=0> N;
    int<lower=0, upper=1> y[M, N];
}
parameters {
    real<lower=0, upper=1> theta[N];
}
model {
    for (m in 1:M) {
        y[m] ~ bernoulli(theta);
    }
}
'''
sm = pystan.StanModel(model_code=model)

rng = np.random.RandomState(0)
M = 100
N = 100
theta_actual = rng.uniform(0, 1, N)

y = np.full((M, N), np.nan)
for m in range(M):
    for n in range(N):
        y[m, n] = rng.binomial(1, theta_actual[n])
y = y.astype(int)

data = {
    'M': M,    
    'N': N,
    'binary_data': binary_data
}

fit = sm.sampling(data=data, verbose=True)
samples = fit.extract()
theta = np.moveaxis(samples['theta'], 0, -1)
theta_mean = theta.mean(axis=1)
theta_sd = theta.std(axis=1)
plt.scatter(theta_mean, theta_actual)

Method (2):

model = '''
data {
    int<lower=0> N;
    real<lower=0, upper=1> yhat[N];
}
parameters {
    real<lower=0, upper=1> theta[N];
}
model {
    for (n in 1:N) {
        target += yhat[n] * log(theta[n]) + (1 - yhat[n]) * log1m(theta[n]);
    }
}
'''
sm = pystan.StanModel(model_code=model)

N = 100
theta_actual = rng.uniform(0, 1, N)

data = {
    'N': N,
    'yhat': theta_actual
}

fit = sm.sampling(data=data, verbose=True)
samples = fit.extract()
theta = np.moveaxis(samples['theta'], 0, -1)
theta_mean = theta.mean(axis=1)
theta_sd = theta.std(axis=1)
plt.scatter(theta_mean, theta_actual)

If I’m reading this right model 1 sees 100 samples (and thus can form a somewhat sharp estimate of \theta) while model 2 only gets an amount of information equivalent to one sample (and must remain highly uncertain about \theta). If you weight the information in model 2 by the number of samples it should give the same results as model 1

target += num_samples * (real_data[example_idx] * log(theta[example_idx])
                         + (1 - real_data[example_idx]) * log1m(theta[example_idx]));