Stochastic volatility and loo

Hi, I was trying to fit the example SVM model 2.5 Stochastic volatility models | Stan User’s Guide

// SVM.stan
data {
  int<lower=0> T;   // # time points (equally spaced)
  vector[T] y;      // mean corrected return at time t
}
parameters {
  real mu;                     // mean log volatility
  real<lower=-1,upper=1> phi;  // persistence of volatility
  real<lower=0> sigma;         // white noise shock scale
  vector[T] h_std;  // std log volatility time t
}
transformed parameters {
  vector[T] h = h_std * sigma;  // now h ~ normal(0, sigma)
  h[1] /= sqrt(1 - phi * phi);  // rescale h[1]
  h += mu;
  for (t in 2:T)
    h[t] += phi * (h[t-1] - mu);
}
model {
  phi ~ uniform(-1, 1);
  sigma ~ cauchy(0, 5);
  mu ~ cauchy(0, 10);
  h_std ~ std_normal();
  for (t in 1:T)
    y[t] ~ normal(0, exp(h[t] / 2));
}
generated quantities { // this part is mine
  vector[T] log_lik; 
  for (t in 1:T) 
        log_lik[t] = normal_lpdf(y[t] | 0, exp(h[t] / 2));

}

to some example financial data:

library(quantmod)
library(cmdstanr)
getSymbols("AAPL") # load the AAPLT price
y <- diff(log(as.vector(AAPL$AAPL.Close)))*100;  y[1] <- 0 # get the returns in "percentage"
SVM <- cmdstan_model("SVM.stan") # load the example model
svm <-  SVM$sample(data = list(T = length(y), y = y), chains=4, parallel_chains = 4)

In general, with either this data or other similar, the loo results are poor:

svm$loo()

Computed from 4000 by 3713 log-likelihood matrix

         Estimate    SE
elpd_loo  -7235.1  59.4
p_loo       360.6  15.7
looic     14470.2 118.8
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     3431  92.4%   453       
 (0.5, 0.7]   (ok)        201   5.4%   103       
   (0.7, 1]   (bad)        73   2.0%   20        
   (1, Inf)   (very bad)    8   0.2%   5         
See help('pareto-k-diagnostic') for details.
Warning message:
Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.

It must be said that the equivalent GARCH models do just fine and show no problems with loo on the same data, or any other similar data.
Maybe I am calculating the posterior predictive log likelihood wrong in the generated quantities block?

thank you

1 Like

What do you mean by “results are poor”? Diagnostic warnings or is there something else you interpret as poor result?

As there are T log volatility parameters and T observations, it is very likely that some observations are highly influential for the corresponding log volatility parameter. See interpreting p_loo when Pareto k is large LOO package glossary — loo-glossary • loo

If you would use rstan instead of cmdstanr you could try moment matching Avoiding model refits in leave-one-out cross-validation with moment matching • loo

With cmdstanr you could refit some of the high k cases and use subsampling idea to estimate what would be the result if you would do refit for all high khats (see Using Leave-one-out cross-validation for large data • loo)

Thanks, I forgot to mention that I got loo warning even when running against fake data, for example:

N <- 500
mu <- -1.02
phi <- 0.95
sigma <- 0.25
h <- vector()
h[1] <- rnorm(1, mu, sigma / sqrt(1 - phi^2))
y <- vector()
y[1] <- 0
for(i in 2:N) {
  h[i] <- rnorm(1, mu + phi*(h[i-1] - mu), sigma)
  y[i] <- rnorm(1, 0, exp(h[i]/2))
}

svm <-  SVM$sample(data = list(T = length(y), y = y), parallel_chains = 4)
svm$loo()
Computed from 4000 by 500 log-likelihood matrix

         Estimate   SE
elpd_loo   -579.4 17.3
p_loo        27.5  2.6
looic      1158.8 34.6
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     483   96.6%   1278      
 (0.5, 0.7]   (ok)        15    3.0%   316       
   (0.7, 1]   (bad)        2    0.4%   175       
   (1, Inf)   (very bad)   0    0.0%   <NA>      
See help('pareto-k-diagnostic') for details.

which seems to me quite unusual, how is that I get highly influential observations from fake data obtained directly from the generative model?

The LOO glossary that I mentioned in my reply says (with some bolding added here)

If p_loo < p and the number of parameters p is relatively large compared to the number of observations (e.g., p>N/5 ), it is likely that the model is so flexible or the population prior so weak that it’s difficult to predict the left out observation (even for the true model). This happens, for example, in the simulated 8 schools (in VGG2017), random effect models with a few observations per random effect, and Gaussian processes and spatial models with short correlation lengths.

Your model corresponds to “random effect model” with one observation per “random effect”.

1 Like