Bounded parameters transformation

lacerbi · May 14, 2022, 8:12am

Hi all (mostly for the devs),

Maybe this was already addressed in past discussions, but could not find it with a quick search.

Is there any particular theoretical reason why Stan uses the log-odds transform for lower and upper bounded scalars? As opposed e.g. to a probit or many other transformations.

I can see that the log-odds transform is computationally easy (easy to compute and invert, easy Jacobian) but I was wondering if there were deeper theoretical reasons.

(Follow up question: if it is just for computational convenience, would it make sense to consider other transformations that may be more expensive but have better properties?)

Thanks,
Luigi

spinkney · May 14, 2022, 10:29am

I don’t believe this has been widely discussed and I don’t claim to know the historical reason. Probably @Bob_Carpenter or @betanalpha know more. What I do know is that the transforms in Stan are tested to work decently well but not proven to be the best in all cases. There are absolutely cases where you may want to use a different transform. There’s a particularly good discussion about this in the A better unit vector - #29 by betanalpha post (linked to @betanalpha’s excellent wrap up that is relevant to why different parameterizations can yield better performance in different models).

You asked this question at a good time because @mjhajharia is planning on testing a bunch of transforms this summer and writing up her findings.

One of the limitations of using the Stan types for transforms is that the “raw” parameters that feed into the transform are hidden from the user (this has memory and speed benefits). Sometimes one may want to put priors or compose the transforms in a way that adds additional information to the model. These are more easily accomplished by writing it in your Stan model. For example, I recently wanted to add more prior information into a simplex. The easiest way for me to express that prior was from a logistic-normal perspective. I accomplished this by adding a (multi)normal prior on raw" - stick_slices - parameters and then doing a logistic transform with a stick-breaking procedure (output on the log-scale).

vector log_logistic_simplex_lp(vector stick_slices) {
     int K = num_elements(stick_slices) + 1;
     vector[K] log_pi;
    
     real log_stick = 0;
     for (k in 1:K - 1) {
       real log_inv_logit_stick = log_inv_logit(stick_slices[k]);
       log_pi[k] = log_inv_logit_stick  + log_stick;
       log_stick = log_diff_exp(log_stick, log_pi[k]);
       // the jacobian for inv_logit is
       // target += log_inv_logit(y) + log1m_inv_logit(y);
       // because this is log_inv_logit(y)
       // we can use the chain rule 
       // jacobian for f'(y) = d log_inv_logit(y) / dy
       // = d log(inv_logit(y))/d log(inv_logit(y)) + d inv_logit(y) / dy 
       // = -log_inv_logit(y) + log_inv_logit(y) + log1m_inv_logit(y) 
       // = log1m_inv_logit(y) 
       target += log1m_inv_logit(stick_slices[k]);
       target += log_stick + log1m_exp(log_stick) + log1m_exp(log_pi[k]);
     }
     
     log_pi[K] = log_stick;
     
     return log_pi;
   }

To make these more accessible there has been discussion in the github Stan organization about composable transforms. See

Feel free to add to those discussions.

Niko · May 14, 2022, 10:51am

AFAIK, there are reasonably low hanging fruit to enable Stan to automatically select the “best” transformation, which will depend on the posterior.

Do you have any specific application in mind?

lacerbi · May 21, 2022, 1:51pm

Hi both, thanks for the answers! This is very interesting to know (in particular e.g. the discussion over the better unit vector). I will go over the other linked discussions.

I was thinking that it would be interesting to test different transforms (at least for the common case of bounded variables). Indeed, one direction would be to go towards automated/adaptive transforms, as it has been done in the past in other contexts (see e.g. Snoek et al., 2014). But even before getting to that, even a benchmark with a non-adaptive set of transforms different than the standard ones might be useful.

I don’t have a single specific application in mind. My group has been developing inference methods and we stumbled (again) upon this issue; so I thought I’d investigate here as it seems a topic obviously related to Stan.

betanalpha · June 13, 2022, 3:52pm

Originally the transformations were largely motivated by the link functions typical in statistics – the log link function unconstrained positive variables, the logit link function unconstrained interval variables, etc. While an explicit argument has not been made for this choice within Stan, they are typical in statistics for a variety of theoretical and practical reasons. In particular these transformations are at the intersection of a variety of useful mathematical properties – convex, relatively uniform curvature (which also means that the Jacobians are nice), preservation of algebraic structure – that often manifest in nice practical properties.

Alternative parameterizations have occasionally been discussed, but none proved to be substantially better than the current implementations.

One general way of thinking about alternative transformations is that they all reduce to the current transformation composed with some smooth, one-to-one transformation of the unconstrained space into itself. More formally if X is the initial, one-dimensional space and \phi : X \rightarrow \mathbb{R} is unconstrained transformation then any* other smooth unconstraining transformation can be written as \psi = \gamma \circ \phi where \gamma : \mathbb{R} \rightarrow \mathbb{R}.

*Pretty sure this is true in one-dimension. In higher dimensions there may be exceptions without enough additional constraints, for example with the existence of the exotic \mathbb{R}^{4} s.

From this perspective the question of alternative parameterizations reduces to the increasingly common discussion of “what general differeomorphism will give the ideal posterior density function for my given computational method?”. For a formal discussion of this problem for Hamiltonian Monte Carlo in particular see for example [1910.09407] Incomplete Reparameterizations and Equivalent Metrics. These kinds of questions have become in vogue in the machine learning literature lately with the rise of “generative modeling” (quotes to indicate the machine learning use of “generative” and not the probabilistic modeling use that’s more common in Stan discussions) but I strongly believe that automatically tuning bespoke reparameterizations for each Stan program is an intractable problem which is one of the reasons why I’ve been trying to push back on the introduction of additional compositional features to the Stan compiler.

To clarify the unit vector discussion has been centered around two transformations which aren’t actually compatible. One is an approximate transformation and one is exact; the approximation introduces another layer of complexity to that particular discussion which can distract from the more relevant points here.

The typical use of Stan’s constrained types is for when the available domain expertise is most interpretable on the constrained space. For example a half-normal prior is much easier to specify with a positively-constrained variable than trying to work out what the corresponding density is for an unconstrained variable. Note that all of the constrained types have at least one natural, complementary prior model – gamma and inverse gamma for positive variables, beta for interval variables, Dirichlet for simplex variables, LKJ for positive-definite matrices, at the like.

When the available domain expertise better manifests through some latent construction then the most useful Stan program will follow that construction rather than rely on constrained types (although once the construction is well-understood then it can be abstracted into a prior model directly on the constrained space; see for example Ordinal Regression.

Sometimes these constructions are compatible with the existing constraining transformations, but often they’re not. For example because the stick breaking construction for a simplex treats each component asymmetrically it can be awkward for building exchangeable prior models. Not impossible, of course, just awkward. More often one needs a custom transformation that is better suited to the available domain expertise, which one can implement directly using the wonderful expressiveness of the Stan language.

We’ve long talked about exposing the transformation functions used for the constrained variable types in the Stan language. I do agree that this can be helpful in some cases and harmless in the worst cases, and hence worth exposing. That said I don’t think that adding transformations in the compilation/post-processing of a Stan program facilitates this kind of construction.

Topic		Replies	Views
What transformation does Stan use to constrain parameter between 0 and 1? General	2	587	April 16, 2023
Optimize Jacobians and make transforms accessible from language Developers	0	646	October 16, 2016
Difference between constraint and transformed parameter Modeling fitting-issues	4	710	July 26, 2022
Built-in constrain vs user transformation Modeling fitting-issues	4	564	July 25, 2018
What does the log lower bound transform (Stan manual section 10.2) mean for lower bounded parameters? Modeling	9	728	August 15, 2022

Bounded parameters transformation

Related topics