Request for final feedback: User controlled unnormalized (propto) distribution syntax

We are trying to add an option for users to manually specify a distribution that will only be calculated proportional to a constant (i.e. it will not be normalized). The PR is mostly done I believe and Niko has been great with his feedback and ideas, but I want some more eyes on it, just to double-check if we missed anything and if anyone has more thoughts on this.

Stanc3 currently does support the so-called _propto_ infix, but this was never documented and thus not released. We want to release this for 2.24 if possible. But not going to push it, if we feel it should be done in some other way.

There was already a discussion on this previously (and also a poll), that I will try to quickly summarize. For those that wish to wish to read through previous discussions here are the main links:

As @Bob_Carpenter mentioned in the previous Discourse thread

target += foo_lpdf(y | theta);

and

y ~ foo(theta);

behave differently. The functional form and hence target += includes normalizing constants (like log√2π in normal_lpdf). The sampling statement form (with ~ ) drops normalizing constants in the samplers and optimizers.

Based on the poll done in the linked Discourse thread, the version that “won” was the _lupdf suffix for unnormalized form.

Meaning that

target += foo_lupdf(y | theta);

would be equal to

y ~ foo(theta);

as in both would drop the normalizing constant as the tilde statement already does. And

target += foo_lpdf(y | theta);

would stay the normalized form as it is currently.

The unnormalized form would be allowed in the model block, the _lpdf / _lpmf function body, and the _lp function body. It would not be allowed in any other block. For reasons on why see Niko’s comment here.

The suffix would work for all lpdfs/lpmfs defined in the Stan Math library, as well as for user-defined lpdfs/lpmfs. The way it would work for user-defined functions is described in the example below. See comment next to the calls.

functions {
    real foo_lpdf(real y, real x) {
        return normal_lupdf(y| x, 1);
    }
    real goo_lpdf(real y, real x) {
        return normal_lpdf(y| x, 1);
    }
}
parameters {
    real y;
}
model{
    target += foo_lpdf(y| x); // normal_lpdf would be normalized
    target += foo_lupdf(y| x); // normal_lpdf would be unnormalized
    target += goo_lpdf(y| x); // normal_lpdf would be normalized
    target += goo_lupdf(y| x); // normal_lpdf would be normalized
}

And a UDF with a _lupdf suffix would not be allowed.

functions {
    real foo_lupdf(real y, real x) {
        //...
    }
}

There is also an option to allow for a stanc flag (for example “–unnormalized”) that would use the unnormalized form in all applicable cases. This is not implemented right now, but could also be if others find it useful.

If anyone has any thoughts on this, please comment. I don’t have any strong feelings on _lupdf vs _unnormalized_lpdf etc. We merely implemented what was the answer that “won”. But there is still time to rethink this.

Please tag anyone you feel might have thoughts on this.

6 Likes

Great summary.

This would be useful for quick testing of speed difference, but it’s dangerous in cases where normalized is actually needed as in mixture weights.

3 Likes

The “unnormalized” version does not drop only the normalization constants but also everything else that is not relevant for computing the autodiff gradient. That means the output of an unnormalized pdf is nondeterministic and context dependent.

Normalized PDFs are easy.

target += normal_lpdf(x| mu, sigma);

is always equivalent to

target += -0.5*square(x-mu)/sigma - log(sigma) - 0.5*log(2*pi());

Now let’s consider unnormalized PDF.

target += normal_lupdf(x| mu, sigma);

When both mu and sigma are unknown parameters the above is equivelent to

target += -0.5*square(x-mu)/sigma - log(sigma);

The obvious constant 0.5*log(2*pi()) is removed for efficiency. So far so good.
But supposed that sigma is known data. In that case the unnormalized PDF becomes

target += -0.5*square(x-mu)/sigma;

In the context of this particular model log(sigma) is constant and can be dropped. Note that there is no way to decide whether a given variable is data or parameter other than to look up where it was originally defined, possibly following it through function calls. The same function may have different results when called from transformed data block and model block in the same model. The restrictions on _lupdfs are there to avoid this weirdness.

Even 0.5*log(2*pi()) isn’t always dropped. RStan and PyStan expose the model’s log_prob method to users. When called through the interface it doesn’t use autodiff so it wouldn’t make sense to calculate the target “up to a constant”–the result is a constant. Consequently the exposed method computes full normalization.

The unpredictable nature of _lupdfs makes them a bit tricky to use correctly. Here’s an example.

real mixed1_lpdf(real x, real mu, real sigma1, sigma2) {
  return log_mix(0.5,
                 normal_lpdf(x | mu, sigma1),
                 normal_lpdf(x | mu, sigma2));
}
real mixed2_lpdf(real x, real mu1, real mu2, real sigma) {
  return log_mix(0.5,
                 normal_lupdf(x | mu1, sigma),
                 normal_lupdf(x | mu2, sigma));
}

mixed1 requires normalizing because the two normals have different normalizations if either sigma is constant. Even if you don’t need to normalize the result of mixed1 you still need to normalize the two normals used in the calculation.
On the other hand, in mixed2 the two distributions have the same normalization for any possible inputs. If you don’t need to normalize mixed2 then these normals can also be left unnormalized.

Although the user-defined function is defined with _lpdf suffix it can be invoked either as mixed2_lpdf or mixed2_lupdf. The former means that the caller needs full normalization (and thus forces the inner normal_lupdf calls to become normal_lpdfs) while the latter leaves the normalization optional.

5 Likes

Documentation log_prob and grad_log_prob functions — log_prob-methods • rstan says otherwise

log_prob returns a value (up to an additive constant) the log posterior. If gradient is TRUE, the gradients are also returned as an attribute with name gradient.
1 Like

You’re right. I guess I should stop assuming I know how these work without looking anything up.
The interfaces do not call the C++ model’s log_prob method directly but instead use stan::model::log_prob_propto helper function. That function converts the inputs to autodiff vars so it builds the autodiff stack and calculates all precomputed gradients but no normalizing constants.

3 Likes

The unnormalized form would be allowed in the model block

Wasn’t the latest thing that kicked this off unnormalized densities for reduce_sum? So that would mean evaluations in user defined functions.

The issue that @nhuurre found with transformed parameters being evaluated separately from the log density (which is totally wild lol) seems more of a bug with how we do the output that we should fix.

We could also just evaluate things with vars in generated quantities. It really wouldn’t cost us anything, and then we could protect our sanity from these things more.

3 Likes

reduce_sum is compatible with distribtutions. You’ll be able to call reduce_sum(mixed2_lupdf,...) and it’ll return the unnormalized density. Other higher-order functions might not be “propto-safe”. For example algebra_solver_fp evaluates the function without autodiff.

The rule is that you can call _lupdf only inside the model block or user-defined _lpdf function body. This is enough to keep the weirdness under control and can allow things like unnormalized reduce_sum.

Surely there’s some overhead to allocating the ad stack?
The distributions in Stan math library don’t have custom reverse mode autodiff. They calculate derivatives during the forward pass and store them in operands_and_partials structure.

2 Likes
mixed2_lupdf

Is this implicitly defined by the existence of:

real mixed2_lpdf(real x, real mu1, real mu2, real sigma) {
  return log_mix(0.5,
                 normal_lupdf(x | mu1, sigma),
                 normal_lupdf(x | mu2, sigma));
}

?

Yes, every _lpdf has an implicitly defined _lupdf.
At C++ level they differ by a template argument.

And defining a mixed2_lupdf would not be allowed:

real mixed2_lupdf(real x, real mu1, real mu2, real sigma) {
  return log_mix(0.5,
                 normal_lupdf(x | mu1, sigma),
                 normal_lupdf(x | mu2, sigma));
}

I agree.

Quoting @nhuurre’s comment from Github:

That does seem weird, but I also don’t have enough background here.

This is due to how the sampler works.

Each iteration of Hamiltonian Monte Carlo requires many evaluations of the target log density and its gradient, which each require evaluating the transformed parameters block. These evaluations, however, are not saved anywhere but the C++ in order to avoid unnecessary memory overhead.

It’s only after a new iteration has been birthed that the constrained parameters, transformed parameters, and generated quantities are evaluated and passed to writer callbacks from which the interfaces can access the values. Because the output of the transformed parameters is not saved along with each state it has to be run again, but because it is not being used for Hamiltonian Monte Carlo it doesn’t need to be run with vars.

2 Likes

I talked to @rok_cesnovar about this at the Math meeting and he cleared up some confusion I had about this.

The plan as-is seems good to me.

I’d still like to change how transformed parameters works so it’s not run twice. This seems like an extra unnecessary layer of confusion on top of the already-can-be-quite-confusing lp__.

There’s probably a way to do this that avoids much overhead.

2 Likes