Understanding log likelihood of sums

This question arises from quite a complex stan program, but after constant reduction of the model, I’ve ended up with a very simple program, showing that I clearly don’t understand how likelihoods work. Or maybe at least not in stan.
In case of the former, apologies for posting such simple quesitons here.

Here’s the simple problem:

Given three (iid) variables x1, x2, x3 ~ normal(0, D), I observe only their sum: s = x1 + x2 + x3 with some error n, i.e. s ~ Normal(x1+x2+x3, n).

Given some observations of s I wish to find the most likely value of D. One way to do this is to realize that one can just sample an x1 ~ Normal(0, sqrt(3)*D) and then s ~ Normal(x1, n). Here’s a stan program:

data {
  int<lower=1> N;
  real<lower=0> D;
  real<lower=0> noise;
  real s[N];
}
parameters {
  real x1[N];
}
model {
  // Intialise
  for (n in 1:N) {
    target += normal_lpdf(x1[n] | 0, sqrt(3)*D);
  }

  // Measurements
  for (n in 1:N) {
    s[n] ~ normal(x1[n], noise);
  }
}

If I optimize or sample for the values of x1 for differnent values of D, the maximum log likelihood for both optimizing and sampling gives the correct value of D. Here’s a pystan program to run the above:

import numpy as np 
import matplotlib.pyplot as plt
import pickle 
from pystan import StanModel
import sys, os

if True:
	sm = StanModel(file='stan_model3.stan')
	with open('model3.pkl', 'wb') as f:
	    pickle.dump(sm, f)
else:
	sm = pickle.load(open('model3.pkl', 'rb'))

N = 500
D = 5.
noise = 0.01

x1 = D*np.random.randn(N)
x2 = D*np.random.randn(N)
x3 = D*np.random.randn(N)
s_true = x1 + x2 + x3
s_mes = s_true + np.random.randn(N)*noise

D_list = np.linspace(0.2*D, 1.5*D, 20)
mean = np.zeros_like(D_list)
mean_sampling = np.zeros_like(D_list)

x1 = D*np.random.randn(N)
x2 = D*np.random.randn(N)
x3 = D*np.random.randn(N)
s_true = x1 + x2 + x3
s_mes = s_true + np.random.randn(N)*noise

# MLE
temp = np.zeros_like(D_list)
try:
	for i, D_guess in enumerate(D_list):
		print i
		data = dict(s=s_mes, N=N, D=D_guess, noise=noise)
		opt = sm.optimizing(data=data, iter=10000, as_vector=False)
		temp[i] = opt['value']
	mean += temp
except:
	print 'Exception'
	pass
# MCMC
temp = np.zeros_like(D_list)
try:
	for i, D_guess in enumerate(D_list):
		print i
		data = dict(s=s_mes, N=N, D=D_guess, noise=noise)
		fit = sm.sampling(data=data)
		temp[i] = np.mean(fit['lp__'])
	mean_sampling += temp
except:
	print 'Exception'
	pass

plt.plot(D_list, mean, '-bx')
plt.plot(D_list, mean_sampling, '-rx')
plt.xlabel('D')
plt.ylabel('log P')
plt.show()

Now I realize that I could have made D a parameter of the model. But I want the most likely D for any sampling of x1.

This all works very well, but it doesn’t generalize to what I actually need. If instead I write the stan program that explicitly samples x1, x2, x3, like this:

data {
  int<lower=1> N;
  real<lower=0> D;
  real<lower=0> noise;
  real s[N];
}
parameters {
  real x1[N];
  real x2[N];
  real x3[N];
}
model {
  // Intialise
  for (n in 1:N) {
    target += normal_lpdf(x1[n] | 0, D);
    target += normal_lpdf(x2[n] | 0, D);
    target += normal_lpdf(x3[n] | 0, D);
  }

  // Measurements
  for (n in 1:N) {
    s[n] ~ normal(x1[n] + x2[n] + x3[n], noise);
  }
}

It no longer works. Suddenly, both for sampling and optimizing, the optimal value of D is wrong. The optimum becomes D = D_true / sqrt(3).

I’ve used target += normal_lpdf to get the right log likelihoods, but since noise is not varied, ~ normal can be used for the final measurement.

I hope the question is clear. I know it’s very simple, but I have debugged for a long time now, and I simply cannot understand why this doesn’t just work.

Thank you in advance.

I think you’re confused about how Stan works. The model you wrote places a prior on your parameters (x1, x2, x3) and then defines a non-identified measurement model. When you increment target your are writing the calculation for the posterior density you want to sample/optimize. So the right place to start might be to figure out exactly what density you want to sample/optimize. At a glance I think your model will ultimately be identified because you have a hierarchical model with N noisy measurements contributing to the estimation of D so if N is big enough you should be able to estimate D.

Thank you for your answer.

I am not sure what exactly you mean, however. According to the stan manual, the only difference between doing

target += normal_lpdf(x1[n] | 0, D);

and

x1[n] ~ normal(0, D);

is that the normalization constants are included.

But you’re probably right that I am confused. I simply want to build a model with hidden/missing data. x1, x2, x3 are hidden and only s = x1 + x2 + x3 is observed. If not like the code above, how should this be written?

Sure but you’re talking like you want to estimate D, but your model passes D in as data so you’re clearly not writing the right likelihood… but I can’t tell what likelihood you want to write.

Thank you again.

I realize this is not a standard approach, and that if I want to estimate D in the MCMC way, I should have that as a parameter. But what I want is the maximum likelihood estimate of D, but not MLE of x1, x2, x3, these should be MCMC sampled.

In any case, even if I pass D as a parameter, shouldn’t I be able to expect that the correct D maximizes the log likelihood?

Again, I’m sorry if I’m talking nonsense…

It’s not nonsense, but what you want is a max marginal likelihood (when we view this as a multilevel model) where x1, x2, and x3 are marginalized. That’s not a thing Stan does so I can see how your approach of including D as data (I assume trying a range of values of D to get the max) might work.

Since these are all normals you could just combine then and your model block ends up something like (pretty sure I goofed up the back-and-forth between sd and variance somewhere but you get the idea)… now you don’t need to sample any parameters so you can just optimize on D.

model {
  target += normal(s, sqrt( (sqrt(3)*D)^2 + noise^2) )
}

Why not simply set D up as a parameter with a very broad prior, make the x values parameters, set up the observed sum as error on the sum of the parameters, and then sample, take the samples of D and do a Kernel Density Estimate and pick out the maximum a posteriori D?

D ~ some_broad_prior()

x1 ~ normal(0,D)
x2 ~ normal(0,D)
x3 ~ normal(0,D)

s ~ normal(x1+x2+x3,noise)

sample… select D, find the MAP value (in R, python, whatever, as post-processing)?

Thank you @sakrajda! This was exactly what I meant.

The example I gave was using normals, just so I know what the right solution is (which for instance is the one you post). But my real problem is a lot bigger and not gaussian, so I would like to use the approach of defining x1, x2, x3 seperately. I just don’t know why it doesn’t work. Optimizing over D (from the pystan side) yields the wrong optimum.

Thank klakelan. This is not what I want, however. This samples (x1, x2, x3, D), and the MAP will also choose specific values of x1, x2, x3.

No, after sampling, pool all the D values, do a KDE and pick the max density D. This would be done outside Stan in post-processing (say in R for example). This is max marginal density for D, and would be different than if you asked Stan to maximize the posterior and give you that D.

Ah! Now I see what you are saying. Yes, I guess this could work as well. I’ll try it.

But I still would like to understand why my current approach isn’t working.

Thank again!

Well, x1, x2, and x3 for a given N are completely non-identified so that might cause problems with exploring the density fully, but have you checked that using the marginalized form we discussed about gives the expected answer for max marginal of D?

@dlakelan : Your approach works. Thanks! This pretty much is what I was looking, however, I am still very curious to understand why my original approach yields the wrong D.

@sakrejda : Yes! This works as well. However, I don’t see how to generalize this to when x1, x2, x3 is not all Gaussian or some other simple distribution.

I’m not sure what your original approach was, because you were passing D in as data but then wanting to find a particular D? If you made D a parameter and did a maximization you’d wind up with the joint MAP which isn’t what you wanted to get, you wanted the marginal mode.

The basic approach of making the thing you’re interested in a parameter and then sampling and using the samples to decide on a particular value for D after the fact is the one that makes sense from a Bayesian Decision Theory perspective. First, do inference, second, decide what you want to do about it (ie. choose a particular value).

My original approach was to pass many different D's as data and see which had the highest mean likelihood on an MCMC run.

But I see that your approach is the right one, and I’ll definitely use that. (although, I iterate, I really don’t understand why my approach doesn’t even work. I understand it might not be optimal, but the fact that it returns wrong values escapes me).

Ah, I see, you were running many MCMC runs… well I’m not sure why that didn’t work, but one question were you taking the highest mean lp__ or highest mean exp(lp__) ??

Yes! Sorry if I didn’t explain that well enough.

I was taking highest mean lp__. That’s a good point, I should try mean exp(lp__). Thank you!!

This also doesn’t seem to work. Although, now I might have to worry about floating point accuracy…

Calculate M = max(lp__) in one of your samples, then do

mean(exp(lp__ - M))

so that the argument to exp is never too large

Note though: keep M constant across the whole calculation, not a different M per each batch