Normal_sufficient signature

Since normal_sufficient has already been merged to stan-dev, I think now is a good time to talk about this.

The thing is that normal_sufficient has the following signature:

normal_sufficient_lpdf(const T_y& y_bar, const T_s& s_squared,
                       const T_n& n_obs, const T_loc& mu,
                       const T_scale& sigma)

So, the straightforward way is to expose it as:

y_bar ~ normal_sufficient(s_squared, n_obs, mu, sigma);
normal_sufficient_lpdf(y_bar | s_squared, n_obs, mu, sigma);

But the problem with that is that s_squared and n_obs are also data, so it would be better to have it as something like:

(y_bar, s_squared, n_obs) ~ normal_sufficient(mu, sigma);
normal_sufficient_lpdf(y_bar, s_squared, n_obs | mu, sigma);

Has this problem happened before with another distribution? How feasible is this in terms of parser and stuff?

It’s not so much about data vs parameters, but rather on what are you conditioning. The “straightforward” way that you note is the correct statistical description – y_bar follows a Gaussian distribution conditioned on s_squared, n_obs, mu, and sigma.

But isn’t more accurate to say that

(y_bar, s_squared, n_obs) follows a Gaussian distribution conditioned on mu, and sigma_

since the whole (the sufficient statistic) vector (y_bar, s_squared, n_obs) is a function of data (y_1, y_2, …, y_n)?

Yes, it’s reasonable for s_squared to be on the left, but n_obs is definitely not data as you know how many elements there are ahead of time! The challenge with putting more than one variable on the left of the bar is that we don’t have any convention for something like (y_bar, s_squared) ~ sufficient_normal(mu, sigma, N);.

We could write this:

[y_bar, s_squared]' ~ sufficient_normal(mu, sigma, N);

if we assume the signature is

sufficient_normal_lpdf(vector, real, real, int);

I think we may need to rename this, too. We’ve gone with distribution first and parameterization later. So shouldn’t this be normal_sufficient_lpdf rather than sufficient_normal_lpdf? Or is it more like lognormal_lpdf?

@betanalpha Yes, you are right, n_obs is not part of statistic, just (y_bar, s_squared).

@Bob_Carpenter This would be a possible solution, although it makes it a little more complex on the side the user, specially for when y_bar and s_squared already are vector (probably very unlikely to happen with sufficient_normal), but that’s why we have append_col anyway. But then, would we document the distribution as multivariate?

Also note that the current name already is normal_sufficient_lpdf.

It’d have to be multivariate if that were the case. The real question is what normalizes to a proper density? That’ll determine when users need to apply jacobians and should also detemrine how we decide what goes to the left of the vertical bar.

Yes, since [Y_bar, S_squared] is a statistic, therefore a function of data [Y_1, Y_2, ..., Y_n], then they must integrate to 1 given the true parameters mu and sigma (obviously provided we adjust the jacobian of the lpdf just like any change of variable of we do).

But then, I think it’s not worth it to vectorize [y_bar, s_squared] like was done for multi_normal (what I mean is, the function should accept only a single vector [y_bar, s_squared], not a std::vector of Eigen vector like multi_normal), since if that’s the case the user can (and should) add more samples using some straigtfoward algebra like y_bar_new = (y_bar_1*n1 + y_bar2*n2)/(n1+n2).