Question about terminology: Change of variables section in Users Guide

Hi all,

In section 24.3 of the Users Guide, it says

“Changes of variables are applied when the transformation of a parameter is characterized by a distribution.”

See here: 24.3 Changes of variables | Stan User’s Guide

I think this phrasing is misleading and confusing. Instead of parameter one should say random variable. The conserquences of using the word parameter here is that the observed data y is also called a parameter further down—this can confuse people (it certainly confused me).

It was also kind of odd to see y in the transformed parameters block. Maybe the block should have been called transformed variables?

Or do I misunderstand something fundamental here?

In this section of the User’s Guide, y is a parameter in the Stan snippets (there is no data block, and note that y is getting declared in the parameters block). These are simple illustrative examples just to illustrate the change of variables issue. I see why the choice of y as a parameter name could be confusing though!

Note also that one should not expect to never see data in the transformed parameters block. For example, the following would be unproblematic:

  real a;
  real theta;
transformed parameters{
  real theta_transformed = theta^a;

Syntactically, it is legal to define quantities as transformed parameters when the quantity depends only on data, but this is always a bad idea because it will be very inefficient compared to declaring those quantities as transformed data; see. 8.6 Program block: transformed parameters | Stan Reference Manual

Edit to add: circling back to the change of variables question, change of variables is not required when a quantity that depends only on transformed data is characterized by a distribution. The key feature is that some transformation of a parameter (which may or may not involve data in defining the transformation) is meant to be characterized by a distribution.

Thanks. I do understand all this, but this wasn’t really my question. I was really asking about the wording in the Users guide, which can confuse readers.

IMO it would be helpful to change the Users guide to refer to random variables, rather than parameters. So, the opening sentence should be:

“Changes of variables are applied when the transformation of a random variable is characterized by a distribution.”

The Jacobian adjustment has to do with transformations of random variables, not parameters per se. It just happens to be the case in Bayes that parameters are random variables.

But the models can have also random variables that are not parameters, so this change would make it less clear that the change of variables is needed only for the parameters.

1 Like

I’m trying to understand in what sense y is a parameter. It’s a random variable that is being transformed. Don’t you think the reader might get confused when we start talking about y as a parameter?

Do you agree that the Jacobian adjustment has to do with transformations of random variables, not parameters per se?

I think the declaration in the parameters block makes it explicit that in this context y is parameter. In the linked document page

parameters {
  real<lower=0> y;

There can be confusion if the reader is skipping these parts, and thinking y refers to the data (declared in the data block).

In general yes, but in the context of the linked documentation, the Jacobian adjustment is specifically for the transformations of parameters that have been declared in the parameters block.

The linked page also contains the following example

parameters {
  real<lower=0> y_inv;
transformed parameters {
  real<lower=0> y;
  y = 1 / y_inv;  // change variables
model {
  y ~ gamma(2,4);
  target +=  -2 * log(y_inv);  //  Jacobian adjustment;

Both y_inv and y are random variables. We need to add the Jacobian adjustment to the target, because y_inv is declared in the parameters block. In this context, it matters which variables have been declared in the parameters block, which defines in this context what the term parameter means.

1 Like

If we all agree data can also be RV then one example the above statement would be confusing is when we have data y and parameters \mu & \sigma and

\exp(y) \sim \mathcal{N}(\mu, \sigma)
1 Like

I just came back to this discussion and I notice another subtlety that might be causing confusion: the distinction between the data and the prior predictive distribution for the data. [Sorry for the weird formatting below; inline latex seems to be broken right now?]

If I have some observed data \hat{y} that I model as lognormally distributed with \mathrm{log}(\hat{y})\sim \mathcal{N}(\mu,\sigma), then I don’t need any Jacobian adjustment. However, if I write down the corresponding Stan program to look a the prior predictive distribution for \hat{y}, I will declare a parameter to represent this distribution, yielding the model in the Users Guide (which calls this parameter y) and I will require a Jacobian adjustment.

That the model for the observed data \hat{y} does not require a Jacobian adjustment is why @yizhang says above:

The model for the prior predictive distribution y does require a Jacobian adjustment, and it also requires declaring y as a parameter.

Note that simulating a prior predictive distribution is not the only circumstance where one might want to declare a parameter with a lognormal distribution; we can have lognormal random effects in hierarchical models for example, and there’s nothing stopping us from giving the name y to the random effect parameter vector.

1 Like

“Random variable” is perhaps the most overloaded and unhelpful term in statistics and in my humble opinion should be avoided at all costs. The confusion in the documentation is largely due to implicit assumptions and overloading of meanings, but there’s not much that can be done without changing the block names and/or requiring more probability theory background from users.

At its most abstract level a Stan program defines a joint probability density function over the product of two spaces which I will denote Y and \Theta. In general Y is itself a product space comprised of many component spaces represented by variables defined in the data or transformed data block but, and confusingly, not necessarily all of the variables declared in those blocks. In other words we have

y = (y_1, \ldots, y_N)

where the variables on the right hand side include some but not necessarily all of the variables defined in the data and transformed data blocks. On the other hand \Theta is also generally product space, but one comprised of component space represented by all of the variables defined in the parameters block,

\theta = (\theta_1, \ldots, \theta_I).

The algorithms library then take this joint density function and partially evaluate it by binding y_1, \ldots, y_N to values specified by the external interface, yielding something proportional to a conditional distribution,

\pi( \theta_1, \ldots, \theta_I \mid \tilde{y}_1, \ldots, \tilde{y}_N) \propto \pi(\tilde{y}_1, \ldots, \tilde{y}_N, \theta_1, \ldots, \theta_I ).

This unnormalized conditional density function is then used to inform estimates of conditional expectation values.

The problem with transformations is that if f : x\mapsto z then for any density function

\pi(z) \ne \pi( f(x) );

instead we need a Jacobian determinant correction. In the context of a Stan program transformations of any of the variables in Y or \Theta technically require a correction. If f : y_1 \mapsto z_1 then

\pi(z, \ldots, y_N, \theta_1, \ldots, \theta_I ) \ne \pi(f(y_1), \ldots, y_N, \theta_1, \ldots, \theta_I )

and if g : \theta_1 \mapsto \eta_1 then

\pi(y_1, \ldots, y_N, \eta_1, \ldots, \theta_I ) \ne \pi(y_1, \ldots, y_N, g(\theta_1), \ldots, \theta_I ).

All of this is pretty basic probability theory, but does requires understanding the subtleties of probability density functions and product spaces.

These corrections, however, are not so symmetric once we partially evaluate on the y_n variables. In this case the Jacobian corrections become constants – ignoring them does not change the behavior of the resulting unnormalized conditional density function. In other words the reason that one has to correct for “parameters” but not “data variables” is not a generic property of Stan programs but rather an accident of how Stan programs are used.

The second source of confusion is in the names of the blocks. Generally the spaces Y and \Theta can be used to implement all kinds of useful behavior. For prior predictive analyses where we don’t want to condition on anything Y would be empty and \Theta would correspond to the product of the observational and model configuration spaces. For posterior analyses we could take Y to the observational space and \Theta to be the model configuration space in which case the automatic partial evaluation gives an unnormalized posterior density function. Of course the blocks take their names from this last particular application, which makes it very confusing when trying to build a Stan program with any other interpretation.

In other words the block naming convention obscures the full potential of the language in an attempt to prioritize one particular application on which most users focused. More generic names could make the general probabilisitic structure more clear, but also require more probability theory understanding from users. There are definitely design constraints on both sides here, not to say that the current names are the best compromise.

For example to @vasishth’s point the term parameter is sometimes used to describe variables taking values in a component of any product space; i.e. the variables “parameterizing” the product space. This particular interpretation would indeed be applicable equally well to the “Y” and “\Theta” spaces, both generally and when those spaces implicitly represent an observational space and model configuration space. The Stan block names assume that the term parameter refers to only variables that parameterize the model configuration space, which to be fair is sloppy terminology at best.

1 Like