I am trying to conduct bayesian inference in a multidimensional setting. I’m using a multivariate normal prior for my prior mean and an inverse wishart distribution for my prior covariance matrix. When I test my code in a low-dimensional setting (10-30), it works fine. But in higher dimensional settings (50+) it fails because of what appears to be a poorly specified prior.
I’m using ScalaStan and here is my code:
val sampleSize = 1000
val assetCount: Int = 500
val dataset = List.fill(sampleSize)(0.0)
.map(x => List.fill(assetCount)(x + math.min(math.max(scala.util.Random.nextGaussian(), -3), 3)))
OutputCell.HIDDEN
val muPriorMu: List[Double] = List.fill(assetCount)(0)
val muPriorSigma: List[List[Double]] = DenseMatrix.eye[Double](assetCount)
.toArray
.toList
.map { case 0 => 0.25; case x => x }
.grouped(assetCount)
.map(x => x.toList)
.toList
val sigmaPriorNu: Double = assetCount + 1
val sigmaPriorSigma: List[List[Double]] = muPriorSigma
class multiVarNormalClass(
dataset: List[List[Double]],
muPriorMu: List[Double],
muPriorSigma: List[List[Double]],
sigmaPriorNu: Double,
sigmaPriorSigma: List[List[Double]],
sampleSize: Int,
assetCount: Int
) extends ScalaStan {
val n = data(int(lower = 0))
val y = data(vector(assetCount)(n))
val muPrior_mu = data(vector(assetCount))
val muPrior_sigma = data(covMatrix(assetCount))
val sigmaPrior_nu = data(real(lower = assetCount - 1.0))
val sigmaPrior_sigma = data(covMatrix(assetCount))
val mu = parameter(vector(assetCount))
val sigma = parameter(covMatrix(assetCount))
val model = new Model {
mu ~ stan.multi_normal(muPrior_mu, muPrior_sigma)
sigma ~ stan.inv_wishart(sigmaPrior_nu, sigmaPrior_sigma)
for (i <- range(1, n)) {
y(i) ~ stan.multi_normal(mu, sigma)
}
}
val results = model
.withData(n, sampleSize)
.withData(y, dataset)
.withData(muPrior_mu, muPriorMu)
.withData(muPrior_sigma, muPriorSigma)
.withData(sigmaPrior_nu, sigmaPriorNu)
.withData(sigmaPrior_sigma, sigmaPriorSigma)
.run(chains = 5, method = RunMethod.Sample())
val muPosteriorMean = results
.mean(mu)
.toArray
val sigmaPosteriorMean = results
.mean(sigma)
.map(_.toArray)
.toArray
def mahDistance(residualVector: Array[Double]): Double = {
val mahDist = new MahalanobisDistance(
sigmaPosteriorMean
)
mahDist
.d(residualVector, muPosteriorMean)
}
def getMu: Array[Double] = muPosteriorMean
def getSigma: Array[Array[Double]] = sigmaPosteriorMean
}
val myClass = new multiVarNormalClass(
dataset,
muPriorMu,
muPriorSigma,
sigmaPriorNu,
sigmaPriorSigma,
sampleSize,
assetCount
)
This code generates the following Stan program:
data {
vector[500] muPrior_mu;
cov_matrix[500] muPrior_sigma;
real<lower=499.0> sigmaPrior_nu;
cov_matrix[500] sigmaPrior_sigma;
int<lower=0> n;
vector[500] y[n];
}
parameters {
vector[500] mu;
cov_matrix[500] sigma;
}
model {
mu ~ multi_normal(muPrior_mu,muPrior_sigma);
sigma ~ inv_wishart(sigmaPrior_nu,sigmaPrior_sigma);
for(ss_v1 in 1:n) {
y[ss_v1] ~ multi_normal(mu,sigma);
}
}
I declare the dimensionality of the space with the variable “assetCount”, so in the code I’ve pasted above the dimensionality is 500. I’m creating four priors:
muPriorMu: assetCount x 1 vector, all 0s
muPriorSigma: assetCount x assetCount covariance matrix, 1s on the diagonal and 0.25s off the diagonal
sigmaPriorNu: double, assetCount x 2
sigmaPriorSigma: same as muPriorSigma
The dataset is simulated draws from independent univariate gaussians capped at +/-3, but my interpretation of my error is that the dataset isn’t relevant here.
When I set assetCount to say 10, the program runs fine. At around 30, it runs but slowly. At say 500, I get the following error (this is the tail end of the 100 attempts to initialize):
2018-10-18 19:14:21:700 +0000 [ForkJoinPool-1-worker-11] INFO CommandRunner$ - Exception: inv_wishart_lpdf: LDLT_Factor of random variable is not positive definite. last conditional variance is 1.11022e-15. (in '/home/beakerx/.scalastan/7e4871a2f86b6a32dd8e19055afced54fb565d44-1/model.stan' at line 15)
2018-10-18 19:14:21:700 +0000 [ForkJoinPool-1-worker-11] INFO CommandRunner$ -
2018-10-18 19:14:21:787 +0000 [ForkJoinPool-1-worker-11] INFO CommandRunner$ - Rejecting initial value:
2018-10-18 19:14:21:787 +0000 [ForkJoinPool-1-worker-11] INFO CommandRunner$ - Error evaluating the log probability at the initial value.
2018-10-18 19:14:21:787 +0000 [ForkJoinPool-1-worker-11] INFO CommandRunner$ - Exception: inv_wishart_lpdf: LDLT_Factor of random variable is not positive definite. last conditional variance is 1.55431e-15. (in '/home/beakerx/.scalastan/7e4871a2f86b6a32dd8e19055afced54fb565d44-1/model.stan' at line 15)
2018-10-18 19:14:21:787 +0000 [ForkJoinPool-1-worker-11] INFO CommandRunner$ -
2018-10-18 19:14:21:787 +0000 [ForkJoinPool-1-worker-11] INFO CommandRunner$ -
2018-10-18 19:14:21:787 +0000 [ForkJoinPool-1-worker-11] INFO CommandRunner$ - Initialization between (-2, 2) failed after 100 attempts.
2018-10-18 19:14:21:787 +0000 [ForkJoinPool-1-worker-11] INFO CommandRunner$ - Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
2018-10-18 19:14:21:788 +0000 [ForkJoinPool-1-worker-11] INFO CommandRunner$ - Initialization failed.
2018-10-18 19:14:21:791 +0000 [ForkJoinPool-1-worker-11] ERROR CmdStanRunner - model returned 70
This appears to be a problem with the inverse wishart prior. My first attempted remediation was to go from a pure identity matrix to a matrix with off diagonal values of 0.25, but this doesn’t fix anything. My next was to constrain the dataset to be only between +/-3, but I don’t think the dataset has anything to do with the problem. Any pointers would be much appreciated!