Piecewise Heavy-Tailed Distributions in Stan

Author
Affiliation
John M. Hall, CFA

PGIM Quantitative Solutions

Published

October 13, 2022

This case study reimplements construction of a semi-parametric cumulative density function (CDF) described in Baker (2007) using Stan.

Introduction

Stock prices are notoriously volatile. When an extreme event occurs (e.g. October 29, 1929 or October 19, 1987), it is much more extreme than would be expected if returns were normally distributed.

A Student’s t-distribution can do a better job of capturing the fat tailed behavior of stock returns, but does not quite match up with the properties of the empirical distribution. The t-distribution is a bit of a blunt instrument. Low degrees of freedom makes the tails fatter, but it also makes the distribution less highly peaked. That is, extreme events are more common, while non-extreme events are less common. Empirically, however, non-extreme stock returns are fairly common. But when stock returns are extreme, they are very extreme. The t-distribution does not quite capture this behavior.

In addition, there is an asymmetry to stock returns whereby extreme declines are more common than extreme gains. The t-distribution cannot account for this asymmetry1.

To better estimate market risk, Baker (2007) introduces a semi-parametric approach to construct a cumulative density function. Each tail (the top and bottom 10%2) is fit separately with a Generalized Pareto distribution, while the middle is constructed using Gaussian kernel smoothing. By using a parametric approach for the tails, the piecewise distribution can be extrapolated to “quantiles outside the historical record”, which is “invaluable for risk management applications.”

A similar piecewise approach can be used in Stan, albeit with some modification. Stan does not have functionality for Gaussian kernel smoothing. Instead, the middle part of the distribution can be fit with a truncated normal distribution.

Simulating Data

To test this procedure, fitting the model to simulated fake data helps verify that Stan is estimating the model properly and that the results are appropriate.

To create the fake data, a piecewise distribution can be constructed as in the figure below whereby the lower and upper tails are fit with Generalized Pareto distributions3, while the middle of the distribution is fit with a truncated normal distribution.

The distribution above is calibrated such that it has similar statistical properties as the monthly log returns of the S&P 500 total return index4 from January 1970 to August 2022. Asymptotically, the mean and standard deviation are close, while the kurtosis is larger than what is observed in practice and the skewness a bit more negative.

The plot below compares the density plot of a large number of simulations of this piecewise distribution with comparable normal (left) and Student’s t-distributions (right). To facilitate comparisons, both are sampled in such a way that their simulated mean and standard deviation match the true mean and standard deviation. The degrees of freedom parameter for the t-distribution is set to 5.2.

The piecewise distribution is more highly peaked than the normal. Though it is difficult to see, the density on the lower tail is higher for the piecewise distribution than the normal (differences are small, yet non-zero). The upper tail density is a little larger as well, suggesting that simulating from a normal distribution might oversample returns in that region.

It is harder to tell the difference between the Piecewise distribution and the t-distribution, but there tend to be larger differences in the tail.

Nevertheless, for the purpose of fitting the model , the number of observations is limited to the number of monthly observations over this period (631 months) and thus these values can be different for any particular simulation (though again, it is sample such that the simulated mean and standard deviation match the true mean and standard deviation). See the figure below for a histogram representing the simulated distribution.

Stan Model

Some complications emerge when attempting to code this into a Stan model. While Stan has built-in functionality to model the truncated normal distribution and the generalized pareto distribution separately, it is up to the user to combine these pieces together into the form needed.

One difficulty is in terms of specifying the bounds for the parts of the distributions that fit the tails. Specifying the bounds in advance, either by setting them such that the left and right tail each constitute a fixed percent or just using constant bounds, helps speed up estimation.

The key piece of the Stan code is shown below:


model {
    ...
    for (i in 1:T) {
        if (z[i] <= trunc_pt_up && z[i] >= trunc_pt_down) {
            z[i] ~ normal(mu_z_mid, sd_z_mid) T[trunc_pt_down, trunc_pt_up];
        } else if (z[i] > trunc_pt_up) {
            z[i] ~ pareto_type_2(trunc_pt_up, lambda_z_up, alpha_z_up);
        } else {
            (-z[i]) ~ pareto_type_2(fabs(trunc_pt_down), lambda_z_down, alpha_z_down);
        }
    }
    ...
}

In this model, the original data is standardized to have a mean of 0 and standard deviation of 1, to improve fitting. If the transformed data is between trunc_pt_down and trunc_pt_up , then they are fit with a truncated normal distribution with parameters mu_z_mid and sd_z_mid corresponding to the mean and standard deviation.

If the values are above trunc_pt_up, then they are fit with a Generalized Pareto distribution (correspondingly with parameters trunc_pt_up, lambda_z_up, and alpha_z_up). Recall that for the Generalized Pareto distribution in Stan, the first parameter (trunc_pt_up) corresponds with the least positive value of the tail. The second parameter (lambda_z_up) is a scale parameter such that larger values shift the distribution higher. The final parameter (alpha_z_up) is a shape parameter such that larger (smaller) values will reduce (increase) the probability of extreme events in the right tail.

Alternately, if values are below trunc_pt_down , then they are fit with a separate Generalized Pareto distribution with parameters and inputs adjusted so that they are positive (parameters trunc_pt_down, lambda_z_down, and alpha_z_down).

While this simple approach generally fits the resulting model quickly with a large number of effective samples and no divergences, the resulting PDF will often be discontinuous. To be continuous, the parameters of the Generalized Pareto distribution require additional restrictions based on the parameters of the truncated normal distribution5.

The printed output below contains details on the posterior mean and uncertainty in the estimates.

Inference for Stan model: 6751a8121f149ab4a4373c5596b82e3e.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

               mean se_mean   sd  2.5%  25%   50%   75% 97.5% n_eff Rhat
mu_z_mid       0.04    0.00 0.05 -0.06 0.01  0.04  0.07  0.14  1753    1
sd_z_mid       0.87    0.00 0.10  0.72 0.80  0.86  0.93  1.10  1316    1
lambda_z_up    4.81    0.03 1.35  2.60 3.83  4.65  5.62  7.84  1576    1
lambda_z_down  6.56    0.05 1.87  3.54 5.19  6.34  7.66 10.75  1323    1
alpha_z_up    10.88    0.07 2.71  6.30 8.96 10.63 12.60 16.54  1566    1
alpha_z_down   8.18    0.06 2.07  4.74 6.73  7.98  9.40 13.08  1305    1

Samples were drawn using NUTS(diag_e) at Fri Oct 14 08:06:06 2022.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Recall that the Stan model is expressed in terms of the standardized data with a mean of zero and a standard deviation of one. As a result, the parameters associated with the truncated normal distribution (mu_z_mid and sd_z_mid) will tend to be close to zero and one. Since the original simulated data has fatter tails than a normal distribution, however, the estimated standard deviation of the truncated normal will tend to be less than one since the tails are capturing a greater portion of the variability. This helps contribute to the more highly peaked nature of the resulting probability density function.

For the parameters associated with the Generalized Pareto distribution (lambda_z_up, lambda_z_down, alpha_z_up, and alpha_z_down), the scale parameter associated with the left tail (lambda_z_down) tends to be larger than that of the right tail (lambda_z_up), which helps account for the greater downside risks than upside risks. Similarly, the shape parameter is lower for the left tail (alpha_z_down) than the right tail (alpha_z_up), which reflects a greater likelihood of an extreme negative return versus an extreme positive return.

The plots below presents the information above graphically.

The next chart compares the actual probability density function to the the average computed by the parameters above. In this case, the PDF is constructed from each simulation and then averaged, rather than using the mean value of each parameter. As can be seen in the chart, it is very hard to distinguish between the actual PDF and the estimated PDF.

The charts below show the posterior predictive distribution of both the standardized (left) and raw (non-standardized) (right) returns. Comparing it to the original probability density function and the histogram of simulated returns above, it appears to capture similar behavior with a highly peaked center, fat tails, and asymmetric tail behavior.

Model Comparison

After fitting this model, it is useful to a normal distribution and a Student’s t-distribution as alternative benchmark models.

Using the loo package to perform leave-one-out cross-validation, the piecewise distribution tends to have the best expected log pointwise density (ELPD), though the t-distribution tends to be around 1.5 to 2.5 standard errors of the piecewise distribution (when re-running the code multiple times).

                   elpd_diff se_diff
Model 1: Piecewise   0.0       0.0  
Model 3: T          -6.9       3.9  
Model 2: Normal    -44.6      15.9  

The charts below compare the posterior distributions of the mean (left) and standard deviation (right) depending on the type of model (the black lines represent the true values). The posterior distribution of the mean tends to overlap the true value for each of the models with a little bit more variation for the Student’s t-distribution. Nevertheless, the standard deviation of the posterior distribution of the mean tends to be smaller for the piecewise distribution than for the normal distribution, suggesting that it is putting less weight on outliers and getting a better estimate.

The posterior distribution of the standard deviation also tends to be close to the true value across all three of the models.

It is also possible to compute the expected shortfall of each of the models, either analytically through the probability density function or empirically from simulations of the posterior predictive distribution. The table below shows the results comparing the original probability density function to the normal, Student’s t, and piecewise models.

Expected Shortfall (1% & 5%)
1% 5%
True Model -19.4% -11.6%
True Model Simulation -17.8% -10.4%
Normal Model -11.0% -8.3%
Normal Model Simulation -10.8% -8.5%
T Model -14.4% -9.0%
T Model Simulation -13.8% -8.9%
Piecewise Model -17.3% -9.6%
Piecewise Model Simulation -19.8% -11.5%

The “True Model” and “True Model Simulation” reflect analytical and empirical calculations from the probability density function and simulations from it, respectively. As expected, these results are effectively the same. By contrast, the normal and Student’s t models underpredict downside risks at both the 1% and 5% level. The piecewise model does a little bit better, while the simulation is even closer.

Nevertheless, what is reported in the table above for the “Model” results are the mean of the posterior expected shortfall. They also have the distributions:

For the 1% expected shortfall, the wide distributions (notice scale of x-axis) reflect the smaller number of data points and correspondingly higher estimation error. Nevertheless, the piecewise distribution tends to be more shifted the left, consistent with capturing greater downside risks. For the 5% expected shortfall, the distributions tend to be more similar with a smaller range on the x-axis. However, at both the 1% and 5% level, all the models tend to underpredict the true risks captured by the dark black line.

Last Words

Stan can be a useful tool for capturing the uncertainty associated with financial data. Nevertheless, Bayesian estimation is not without its challenges. This case study focuses on one particular way to characterize a non-normal distribution. Baker (2007) had also incorporated autoregressive and GARCH features, as well as complex multivariate relationships between the data (through a T-copula).

Stan’s documentation provides sample models that include autoregressive and GARCH features. While autoregression can typically be handled well by Stan, GARCH models can sometimes be slow to estimate because the conditional volatility is a parameter and the parameters are correlated with nearby values. By contrast, an ARCH model (which does not make use of this feature) can be estimated more quickly. As a general principle when estimating models in Stan, it is a good idea to limit the amount the number of parameters you are sampling, particularly if they are highly correlated.

By contrast, a T-copula would be difficult to implement in Stan since its likelihood requires the inverse of the cumulative density function. Nevertheless, it matters more that the model being fitted captures the relevant behavior of the data than if the model matches ones commonly used in maximum likelihood.

References

Baker, Rick. (2007). Modeling Market Risk Using Extreme Value Theory and Copulas. MathWorks. https://www.mathworks.com/company/newsletters/articles/modeling-market-risk-using-extreme-value-theory-and-copulas.html (an alternate version is available here)

Appendix: Full Stan Code

Piecewise Distribution


data {
    int<lower=1> T;
    vector[T] y;
    real<lower=0, upper=1> trunc_percent;
}
transformed data {
    vector[T] z;
    real mu_y0 = mean(y);
    real sd_y0 = sd(y);
    real trunc_pt_down;
    real trunc_pt_up;
    real count_low = 0;
    real count_mid = 0;
    real count_high = 0;
    real share_low;
    real share_mid;
    real share_high;
    real log_share_low;
    real log_share_mid;
    real log_share_high;

    for (i in 1:T) {
        z[i] = (y[i] - mu_y0) / sd_y0;
        if (z[i] <= trunc_pt_up && z[i] >= trunc_pt_down) {
            count_mid += 1;
        } else if (z[i] > trunc_pt_up) {
            count_high += 1;
        } else {
            count_low += 1;
        }
    }

    {
        vector[T] z_sort = sort_asc(z);
        real j_down = floor(T * trunc_percent + 1 - trunc_percent);
        real j_up = floor(T * (1 - trunc_percent) + 1 - (1 - trunc_percent));
        real gamma_down = T * trunc_percent + 1 - trunc_percent - j_down;
        real gamma_up = T * (1 - trunc_percent) + 1 - (1 - trunc_percent) - j_up;
        int j_down_int = 0;
        int j_up_int = T;
        while (j_down_int < j_down) {
            j_down_int = j_down_int + 1;
        }
        while (j_up_int > j_up) {
            j_up_int = j_up_int - 1;
        }
        if (gamma_down > 0 && j_down < T) {
            trunc_pt_down = (1 - gamma_down) * z_sort[j_down_int] + gamma_down * z_sort[j_down_int + 1];
        } else {
            trunc_pt_down = z_sort[j_down_int];
        }
        if (gamma_up > 0 && j_up < T) {
            trunc_pt_up = (1 - gamma_up) * z_sort[j_up_int] + gamma_up * z_sort[j_up_int + 1];
        } else {
            trunc_pt_up = z_sort[j_up_int];
        }
    }

    share_low = trunc_percent;
    share_mid = 1 - 2 * trunc_percent;
    share_high = trunc_percent;
    log_share_low = log(share_low);
    log_share_mid = log(share_mid);
    log_share_high = log(share_high);
}
parameters {
    real mu_z_mid;
    real<lower=0> sd_z_mid;
    real<lower=0> lambda_z_up;
    real<lower=0> lambda_z_down;
    real<lower=2> alpha_z_up; // constraint ensures that variance exist
    real<lower=2> alpha_z_down; // constraint ensures that variance exists
}
model {
    mu_z_mid ~ normal(0, 0.1);
    sd_z_mid ~ cauchy(1, 0.25);
    lambda_z_up ~ gamma(5, 1);
    lambda_z_down ~ gamma(5, 1);
    alpha_z_up ~ gamma(10, 1);
    alpha_z_down ~ gamma(10, 1);

    for (i in 1:T) {
        if (z[i] <= trunc_pt_up && z[i] >= trunc_pt_down) {
            z[i] ~ normal(mu_z_mid, sd_z_mid) T[trunc_pt_down, trunc_pt_up];
        } else if (z[i] > trunc_pt_up) {
            z[i] ~ pareto_type_2(trunc_pt_up, lambda_z_up, alpha_z_up);
        } else {
            (-z[i]) ~ pareto_type_2(fabs(trunc_pt_down), lambda_z_down, alpha_z_down);
        }
    }
    target += count_low * log_share_low + count_mid * log_share_mid + count_high * log_share_high;
}
generated quantities {
    real trunc_output_up = trunc_pt_up;
    real trunc_output_down = trunc_pt_down;
    real mu_z;
    real sd_z;
    real mu_y;
    real sd_y;
    real z_sim;
    real y_sim;
    vector[T] log_lik;

    {
        real normal_cdf_diff;
        real a_z0;
        real b_z0;
        real mu_z0;
        real mu_z1;
        real mu_z2;
        real var_z0;
        real var_z1;
        real var_z2;
        real E2_z0;
        real E2_z1;
        real E2_z2;
        real var_z;

        normal_cdf_diff = normal_cdf(trunc_pt_up, mu_z_mid, sd_z_mid) - normal_cdf(trunc_pt_down, mu_z_mid, sd_z_mid);
        a_z0 = (trunc_pt_down - mu_z_mid) / sd_z_mid;
        b_z0 = (trunc_pt_up - mu_z_mid) / sd_z_mid;
        mu_z0 = mu_z_mid + sd_z_mid * (exp(std_normal_lpdf(a_z0)) - exp(std_normal_lpdf(b_z0))) / normal_cdf_diff;
        mu_z1 = trunc_pt_up + lambda_z_up / (alpha_z_up - 1);
        mu_z2 = -(fabs(trunc_pt_down) + lambda_z_down / (alpha_z_down - 1));

        mu_z = share_mid * mu_z0 + share_high * mu_z1 + share_low * mu_z2;

        var_z0 = 1.0;
        var_z0 = var_z0 - (b_z0 * exp(std_normal_lpdf(b_z0)) - a_z0 * exp(std_normal_lpdf(a_z0))) / normal_cdf_diff;
        var_z0 = var_z0 - ((exp(std_normal_lpdf(b_z0)) - exp(std_normal_lpdf(a_z0))) / normal_cdf_diff) ^ 2;
        var_z0 = var_z0 * (sd_z_mid ^ 2);
        var_z1 = ((lambda_z_up / alpha_z_up) ^ 2) / ((1 - 1 / alpha_z_up) ^ 2 * (1 - 2 / alpha_z_up));
        var_z2 = ((lambda_z_down / alpha_z_down) ^ 2) / ((1 - 1 / alpha_z_down) ^ 2 * (1 - 2 / alpha_z_down));

        E2_z0 = var_z0 + mu_z0 ^ 2;
        E2_z1 = var_z1 + mu_z1 ^ 2;
        E2_z2 = var_z2 + mu_z2 ^ 2;

        var_z = (share_mid * E2_z0 + share_high * E2_z1 + share_low * E2_z2) - (mu_z ^ 2);
        sd_z = sqrt(var_z);
    }
    {
        real u_sim;
        u_sim = uniform_rng(0, 1);
        if (u_sim < share_low) {
            z_sim = -(fabs(trunc_pt_down) + lambda_z_down * (pow(1.0 - u_sim / share_low, -1.0 / alpha_z_down) - 1));
        } else if (u_sim < (1 - share_high)) {
            z_sim = inv_Phi(u_sim) * sd_z_mid + mu_z_mid;
        } else {
            z_sim = trunc_pt_up + lambda_z_up * (pow(1.0 - (u_sim - (1 - share_high)) / share_high, -1.0 / alpha_z_up) - 1);
        }
    }
    mu_y = mu_z * sd_y0 + mu_y0;
    sd_y = sd_z * sd_y0;
    y_sim = z_sim * sd_y0 + mu_y0;
    
    for (i in 1:T) {
        if (z[i] <= trunc_pt_up && z[i] >= trunc_pt_down) {
            log_lik[i] = normal_lpdf(z[i] | mu_z_mid, sd_z_mid);
            log_lik[i] -= log_diff_exp(normal_lcdf(trunc_pt_up | mu_z_mid, sd_z_mid),
                                       normal_lcdf(trunc_pt_down | mu_z_mid, sd_z_mid));
            log_lik[i] += log_share_mid;
        } else if (z[i] > trunc_pt_up) {
            log_lik[i] = pareto_type_2_lpdf(z[i] | trunc_pt_up, lambda_z_up, alpha_z_up);
            log_lik[i] += log_share_high;
        } else {
            log_lik[i] = pareto_type_2_lpdf(-z[i] | fabs(trunc_pt_down), lambda_z_down, alpha_z_down);
            log_lik[i] += log_share_low;
        }
    }
}

Normal Distribution


data {
    int<lower=1> T;
    vector[T] y;
}
transformed data {
    vector[T] z;
    real mu_y0 = mean(y);
    real sd_y0 = sd(y);
    for (i in 1:T) {
        z[i] = (y[i] - mu_y0) / sd_y0;
    }
}
parameters {
    real mu_z;
    real<lower=0> sd_z;
}
model {
    mu_z ~ normal(0, 0.1);
    sd_z ~ normal(1, 0.1);
    for (i in 1:T) {
        z[i] ~ normal(mu_z, sd_z);
    }
}
generated quantities {
    real mu_y;
    real sd_y;
    real z_sim;
    real y_sim;
    vector[T] log_lik;
    
    mu_y = mu_z * sd_y0 + mu_y0;
    sd_y = sd_z * sd_y0;
    z_sim = normal_rng(mu_z, sd_z);
    y_sim = z_sim * sd_y0 + mu_y0;
    for (i in 1:T) {
        log_lik[i] = normal_lpdf(z[i] | mu_z, sd_z);
    }
}

Student’s t-distribution


data {
    int<lower=1> T;
    vector[T] y;
}
transformed data {
    vector[T] z;
    real mu_y0 = mean(y);
    real sd_y0 = sd(y);
    for (i in 1:T) {
        z[i] = (y[i] - mu_y0) / sd_y0;
    }
}
parameters {
    real mu_z;
    real<lower=0> sd_z0;
    real<lower=2> nu_z;
}
model {
    mu_z ~ normal(0, 0.1);
    sd_z0 ~ normal(1, 0.1);
    nu_z ~ gamma(2, 0.1);
    for (i in 1:T) {
        z[i] ~ student_t(nu_z, mu_z, sd_z0);
    }
}
generated quantities {
    real sd_z;
    real mu_y;
    real sd_y;
    real z_sim;
    real y_sim;
    vector[T] log_lik;
    
    sd_z = sd_z0 * sqrt(nu_z / (nu_z - 2));
    mu_y = mu_z * sd_y0 + mu_y0;
    sd_y = sd_z * sd_y0;
    z_sim = student_t_rng(nu_z, mu_z, sd_z0);
    y_sim = z_sim * sd_y0 + mu_y0;
    for (i in 1:T) {
        log_lik[i] = student_t_lpdf(z[i] | nu_z, mu_z, sd_z0);
    }
}

Footnotes

  1. A skew-t distribution could account for the asymmetry.↩︎

  2. Of the residuals from other statistical procedures↩︎

  3. Referred to in Stan as a Pareto Type 2 distribution. This distribution is preferred to the Pareto distribution since it allows sufficient degrees of freedom to both control the tail and ensure that the piecewise distribution is smooth. In practice, there can be some issues with divergences in Stan when forcing continuity in the probability density function.↩︎

  4. Using data provided by Factset.↩︎

  5. In testing, this can have the effect of causing a small number divergences after warmup (typically fewer than 5), particularly if the tails of the distribution are assumed to be small (meaning they have less data available to be estimated on), though not an obvious impact on the results. Though a small number of divergences may not be indicative of a significant problem, the focus is kept on the version that does generally avoids issues with divergences.↩︎