Ep stan separate out normalized std

Tuomas, there is a common pattern I see through all the ep stan sample models (see m4b for example). The mu and sigma are passed as elements in phi, and therefore estimated using distributed worker nodes. But there is always a global parameter, say eta, which is not incorporated into phi and used in the transformed parameter

alpha <- mu_a + eta * sigma_a;

and eta itself is derived by

eta ~ normal(0, 1);

May I ask why is eta treated as a gobal parameter?

This is an example of the ‘non-centred’ parameterization, which is used to make sampling from some (not all!) posteriors easier and more efficient. There’s a great overview on p.341 of the stan manual (v2.17).

Essentially, in some cases it can be difficult for stan to sample from a normal(mu_a, sigma_a) distribution. Thankfully however, normal(mu_a, sigma_a) is equivalent to mu_a + normal(0, 1) * sigma_a. This means that Stan only needs to sample from a normal(0, 1) distribution (which it’s pretty great at).

For the parameters that you gave they’re equivalent to, but likely more efficient than, alpha ~ normal(mu_a, sigma_a)

Ok, I accept that sampling from normal(0, 1) may be easier, but why is mu_a and sigma_a estimated using phi (distributed using EP), and eta is just global?

Mu_a and sigma_a are estimated through Phi (multivariate normal) because the two estimates are correlated (not independent). Eta is estimated separately (univariate normal) because each value of eta is assumed to be uncorrelated with all of the others (independent)

Andr, are you familiar with ep-stan? My question are directly related to how some variables are solved through expectation-maximization in a distributed fashion, while the variable eta is not. I don’t believe your response is along that line.

I especially have a problem visualizing why etb[j] in model m4b is calculated per group j and a single etb is shared among all dimension of D in beta[j]

for (j in 1:J){
    beta[j] <- mu_b + etb[j] .* sigma_b;
}

where beta[j], mu_b, and sigma_b are all of dimension D. The standard dev for each dimension should be completely independent. I fail to see the reasoning for sharing the normalized standard dev.

see https://github.com/gelman/ep-stan/blob/master/experiment/models/m4b.stan

That’s not quite what’s happening here, since etb[j] is also of dimension D:

vector[D] etb[J];

So there is an independent etb for each dimension D in each group J.

Taking the example from the first post:

alpha <- mu_a + eta * sigma_a;
eta ~ normal(0, 1);

The aim is to estimate variable alpha, which is normally distributed with mean mu_a and std sigma_b. These are the three parameters that need to be returned from each of the K partitions/nodes. The mu_a and sigma_a are able to be sampled easily enough, but sampling a variable with a normal(mu_a, sigma_a) is difficult and so the non-centered parameterization (and eta is used). eta does not need be returned from each of the partitions/nodes, because it was simply used as a tool to construct the variable alpha, which was the variable of interest.

TL;DR From each partition/node want to estimate the distribution of alpha. Rather than directly sampling alpha in each partition, the non-centered parameterization uses eta to construct alpha. For each partition alpha is returned.

Thanks, my bad for missing the dimension on etb[j]. That makes much more sense. I still don’t have the intuition why eta and etb do not need to go thru phi. Let me ask this way. in the eqn

alpha <- mu_a + eta * sigma_a;

alpha is already a fixed value, not a random variable, is that right? another word, eta was already sampled from

eta ~ normal(0, 1);

Doesn’t the exact sampling value for eta come from the input evidence involving the complicated likelihood. I am not getting why you can pick any sample from the normalized gaussian and simply plug in to derive alpha.

Upon further inspection, I am more puzzled by the fact that the res_d_model.npz file only gives you the mu_phi and cov_phi, then how do I get the actual values of alpha and beta? These are the parameters that the model is intended to derive.

Can I also ask, if you check the difference between m1a and m4b, the first calculates beta directly from phi, the second goes thru this mu + norrm(0. 1) * sigma calcuation. how do you decide which method to use? is direct derivation of beta good enough?

after a few rounds of internal debate, I totally get your points. It’s a matter of full posterior vs. a simple maximum a posteriori.

I am leaving my comments in place, so I can revisit and get back to my thought patterns.