Bounded parameters transformation

Hi all (mostly for the devs),

Maybe this was already addressed in past discussions, but could not find it with a quick search.

Is there any particular theoretical reason why Stan uses the log-odds transform for lower and upper bounded scalars? As opposed e.g. to a probit or many other transformations.

I can see that the log-odds transform is computationally easy (easy to compute and invert, easy Jacobian) but I was wondering if there were deeper theoretical reasons.

(Follow up question: if it is just for computational convenience, would it make sense to consider other transformations that may be more expensive but have better properties?)

Thanks,
Luigi

I don’t believe this has been widely discussed and I don’t claim to know the historical reason. Probably @Bob_Carpenter or @betanalpha know more. What I do know is that the transforms in Stan are tested to work decently well but not proven to be the best in all cases. There are absolutely cases where you may want to use a different transform. There’s a particularly good discussion about this in the A better unit vector - #29 by betanalpha post (linked to @betanalpha’s excellent wrap up that is relevant to why different parameterizations can yield better performance in different models).

You asked this question at a good time because @mjhajharia is planning on testing a bunch of transforms this summer and writing up her findings.

One of the limitations of using the Stan types for transforms is that the “raw” parameters that feed into the transform are hidden from the user (this has memory and speed benefits). Sometimes one may want to put priors or compose the transforms in a way that adds additional information to the model. These are more easily accomplished by writing it in your Stan model. For example, I recently wanted to add more prior information into a simplex. The easiest way for me to express that prior was from a logistic-normal perspective. I accomplished this by adding a (multi)normal prior on raw" - stick_slices - parameters and then doing a logistic transform with a stick-breaking procedure (output on the log-scale).

vector log_logistic_simplex_lp(vector stick_slices) {
     int K = num_elements(stick_slices) + 1;
     vector[K] log_pi;
    
     real log_stick = 0;
     for (k in 1:K - 1) {
       real log_inv_logit_stick = log_inv_logit(stick_slices[k]);
       log_pi[k] = log_inv_logit_stick  + log_stick;
       log_stick = log_diff_exp(log_stick, log_pi[k]);
       // the jacobian for inv_logit is
       // target += log_inv_logit(y) + log1m_inv_logit(y);
       // because this is log_inv_logit(y)
       // we can use the chain rule 
       // jacobian for f'(y) = d log_inv_logit(y) / dy
       // = d log(inv_logit(y))/d log(inv_logit(y)) + d inv_logit(y) / dy 
       // = -log_inv_logit(y) + log_inv_logit(y) + log1m_inv_logit(y) 
       // = log1m_inv_logit(y) 
       target += log1m_inv_logit(stick_slices[k]);
       target += log_stick + log1m_exp(log_stick) + log1m_exp(log_pi[k]);
     }
     
     log_pi[K] = log_stick;
     
     return log_pi;
   }

To make these more accessible there has been discussion in the github Stan organization about composable transforms. See

Feel free to add to those discussions.

3 Likes

AFAIK, there are reasonably low hanging fruit to enable Stan to automatically select the “best” transformation, which will depend on the posterior.

Do you have any specific application in mind?

Hi both, thanks for the answers! This is very interesting to know (in particular e.g. the discussion over the better unit vector). I will go over the other linked discussions.

I was thinking that it would be interesting to test different transforms (at least for the common case of bounded variables). Indeed, one direction would be to go towards automated/adaptive transforms, as it has been done in the past in other contexts (see e.g. Snoek et al., 2014). But even before getting to that, even a benchmark with a non-adaptive set of transforms different than the standard ones might be useful.

I don’t have a single specific application in mind. My group has been developing inference methods and we stumbled (again) upon this issue; so I thought I’d investigate here as it seems a topic obviously related to Stan.

1 Like