Student T distribution: A question on scale

A bit of a question of aesthetics/style. STAN seems to use the usual definition of the Student-T distribution, so that in code like:

parameters {
    ...
    real<lower=2> nu;
    real<lower=0> scale;
   ...
}
model {
    ... 
    Y ~ student_t(nu, loc, scale);
    ...
}

one is actually describing a distribution not with standard deviation equal to scale but instead with standard deviation equal to scale * sqrt(nu / (nu - 2)) which is strictly greater than scale. I understand that the Student-T so-defined is understood to be the fat-tailed cousin of a Normal distribution with standard deviation equal to scale. But I never really understood why the convention was to formulate things so that as nu shrinks (with scale held constant) the result is to both have the tails get fatter (relative to the core of the distribution) and have the actual standard deviation of the distribution increase.

To that end, I’m always tempted to write this instead as:

parameters {
    ...
    real<lower=2> nu;
    real<lower=0> scale;
   ...
}
transformed parameters {
    real norm_equiv_scale;
    norm_equiv_scale = scale / sqrt(nu / (nu - 2));
}
model {
    ... 
    Y ~ student_t(nu, loc, norm_equiv_scale);
    ...
}

In this way, once my parameters are done fitting, I wind up with what feels like a more easily-interpreted meaning for scale (namely that it should match up with the standard deviation of the data) and a more easily-interpreted meaning for nu (namely that it is just a shape-of-distribution parameter that doesn’t have much to do with observed standard deviation).

Put differently, I can imagine that scatter plots of scale vs nu in the top case could show negative correlation between the fitted parameters (because increasing nu is compatible with shrinking the modeled standard deviation in that case). But I would think that same scatter plot in the second formulation should show less correlation between the parameters.

What do you think?

1 Like

It’s always tricky to compare the “scale” of a heavy tailed distribution to that of a light-tailed one. I suspect that the Student T scale parameter was simply chosen so as to make the probability density function take simple form.
However, note that Student T distribution is defined for all \nu > 0 but it’s variance is infinite when \nu \leq 2. At small \nu you need something other than standard deviation to measure the scale. The usual parametrization approximates the width of the central bump. See for example the plot on Wikipedia.


All the plotted distributions have the same “scale” per common definition but even the blue line has 30% larger standard deviation than the black because the extra probability mass in the tails pulls the standard deviation strongly.
1 Like

I think what you’re doing is fairly reasonable – it is common (at least, I like to do it) to re-parameterize distributions in terms of interpretable statistics like the mean and variance. However, in stan, the parameterization that you choose can have an affect on the convergence and speed of sampling, so you can’t do it blindly.

The reason that it’s parameterized that way though, and the reason that it’s called “scale” is that it has the following property: if F(x; \mu, \nu, \sigma) is the CDF of the student-T distribution, then F(x; \mu, \nu, \sigma) = F(x / \sigma; \mu, \nu, 1). Similarly for the “location” parameter \mu: F(x; \mu, \nu, \sigma) = F(x - \mu, 0, \nu, \sigma). \nu is called a “shape” parameter because it doesn’t satisfy any “nice” relationship like this.

You can actually add location and shape parameters to any distribution by just taking the CDF F and defining F(x; \mu, \sigma) = F(\frac{x - \mu}{\sigma}).

1 Like

Hi, I have another question about the scale parameter \sigma and it links to this thread prior with brms.
Apologies in advance if I am repeating the question in this thread.
Here is a simple linear model y\sim N(1+x,\tau^2). The prior for the intercept that I got from brms is student_t(3, 84.7, 31.3), where \nu=3, \mu=84.7, and the scale \sigma=31.3. However, the sd of y is 25.6, hence \sigma should be 14.78 according to sd=\sqrt{\frac{\nu}{\nu-2}}\sigma .
Did I misunderstand the formula? Thank you all for your time.