CmdStanPy Variational - How to get density parameters?

Hi! Here is the smallest reproducible code unit. I am working on the following fully conjugate normal model:

Y \sim \mathcal{N}(\mu, \sigma^2)\\ \mu \sim \mathcal{N}(\mu_0, \sigma_0^2)

In my toy model, I am setting

\mu_0 = 0, \sigma_0 = 1

and creating a toy dataset by drawing 1000 samples from a

\mathcal{N}(0.3, 10)

model. I am trying to compare the true posterior density defined by my toy samples and prior to the posterior density output by stan’s variational run on this data.

I am able to get the posterior mean from the created CmdStanVB object directly, but there does not seem to be any way to get the posterior variance other than calling np.var on the posterior samples returned by the CmdStanVB object. I know that the converged posterior variance from the ADVI algorithm is likely slightly different, by randomness, than the sample variance of the posterior samples but I need the ADVI’s posterior variance value at convergence and not an estimation.

How can I get the actual ADVI posterior variance? I am eventually going to need this for other fully conjugate models and I’m concerned after learning that this was possibly removed 6 years ago.

Here is my reproducible code:


When including Stan code in your post it really helps if you make it as readable as possible by using Stan code chunks (```stan) with clear spacing and indentation. For example, use

// The input data is a vector 'y' of length 'N'.
data {
  int<lower=0> N;
  vector[N] y;
  real<lower=0> sigma;
  real mu_0;
  real<lower=0> sigma_0;
}

// The parameters accepted by the model. Our model
// accepts two parameters 'mu' and 'sigma'.
parameters {

  real mu;
}

// The model to be estimated. We model the output
// 'y' to be normally distributed with mean 'mu'
// and standard deviation 'sigma'.
model {
  
  mu ~ normal(mu_0, sigma_0);
  
  y ~ normal(mu, sigma);
}
import os
from cmdstanpy import CmdStanModel
import numpy as np

# STAN_MODEL = path_to_stan_model

# set the parameters we will work with
n = 1000
mu = 0.3
sigma = 10
mu_0 = 0
sigma_0 = 1
NP_SEED = 10

stan_model = CmdStanModel(stan_file=STAN_MODEL)

# fix np seed and draw data
np.random.seed(NP_SEED)
y = np.random.normal(
            loc = mu,
            scale = sigma,
            size = n)

stan_data = {
        'N': n,
        'y': y,
        'sigma': sigma,
        'mu_0': mu_0,
        'sigma_0': sigma_0
        }

vb_fit = stan_model.variational(
            data = stan_data, 
)
# vb_fit.mu returns the posterior mean, but nothing returns the posterior variance.


One of the reasons I’m choosing to use variational inference is that it directly gives me the density of the posterior rather than samples from MCMC. I fear, however, that cmdstanpy’s variational inference implementation is not going to work in that regard. Is there any way to get the ADVI posterior variance other than calling np.var on the posterior samples?

Thanks in advance!

We would need to add a summary method to the CmdStanVB object and this would use CmdStan’s stansummary method to estimate the variance. Note that the VB underestimates the true variance; by how much depends on dimensionality and amount of correlation between parameters.

If I understand your question, the answer is no, this information is not currently preserved anywhere. Even @mitzimorris’ suggestion would be more or less equivalent to what you describe of calling np.var on everything.

If you’re interested in the approximate distribution (not the samples), we’re only giving you the mean, not the correlations. We could save this content, which would be similar to recent-ish features to save the Hessian from laplace approximation or when we improved how the metric from HMC is saved:

The thing to write depends on whether you’re doing meanfield or fullrank:

I see. So if I’m understanding correctly, my best bet is to either alter the c++ code to return the covariance or to take the np.var of the posterior variational samples and use that approximation as my density variance approximation.

To clarify, the available posterior variational samples are all drawn from the convergent density parameters, correct?

Thank you to you both for these detailed answers! I have a much better understanding of what is happening with cmdstanpy variational now.

Correct. As far as modifying the C++, I’ve opened a ticket to request that functionality: ADVI: Option to save family parameters to JSON · Issue #3315 · stan-dev/stan · GitHub

Note that as @mitzimorris said, use these estimates with care, as they’re more or less known to be incorrect for wide families of problems.

Yes. After optimization, the algorithm draws the requested number of draws from a standard normal and then transforms them using the learned mean and covariance

1 Like

Why do you want the parameters of the approximate density? They’re not directly useful for inference because they are on the unconstrained scale. For example, a parameter sigma declared with a <lower=0> constraint is going to be log transformed in the approximate density, and a covariance matrix will be replaced with its Cholesky factor with a log-transformed diagonal, and so on.

The sample returned by ADVI will be created by (1) taking normal draws on the unconstrained scale, and (2) applying the constraining transform. If you want to perform downstream inference, you can just use these in a plug-in manner. If you want to try to use the unconstrained normal approximation for inference, you will need to do the transform, but any non-linear transform won’t transform expectations, just draws, because f(\mathbb{E}[\theta \mid y]) \neq \mathbb{E} [f(\theta) \mid y] if f is non-linear.

I like the idea of making the inferred posterior (co)-variances savable somehow. While they may not be useful for inference, they are still providing additional information about the variational inference output. One follow-up question,

If my CmdStanVB object is named vb_fit, there is a note on the stan_variable() function that the default behavior will soon be changed to have mean=False, I have noted that

np.mean(vb_fit.stan_variable(‘mu’, mean=False))

and

vb_fit.stan_variable(‘mu’, mean=True)

are returning different values. Is a weighted mean being computed for the mean=True parameter?

Also, to clarify for myself, I thought that variational bayes would return posterior values for my mu_0 and sigma_0 prior values, but instead I’m getting 1000 “variational samples.” Why is that? What specifically is being monte carlo approximated to give me sample outputs?

Thank you again for this enlightening discussion!

I’m curious as to what you’d use it for. The reason it’s not in Stan is that we can’t figure out what it would be useful for. Maybe comparing two runs of variational inference? It wouldn’t be that hard to instrument our VI methods to save this, but we’d need some motivation.

Our doc for stan_variable() in VB is confusing. I would not trust using mean=True, which indicates that you get the approximate variational mean. We don’t know how to extract a variational mean for a constrained parameter, so I have no idea what this is doing, nor would I trust it, for parameters that have constraints. The mean of a non-linear transform is not the non-linear transform of the mean.

We are, actually, trying to compare multiple runs of variational inference (something something variational inference theory something something). The issue I’m running into is that the stan implementation of variational inference works under monte carlo approximations that I have yet to find clear explanations of or justification for beyond discovering that this is what was happening via the reference manual and trial and error.

I agree that the doc is confusing. Then what is the stan_variable mean=True returning? I also agree that non-linear transforms do cause issues with expected values and functions.

It’s just giving you the value from the first row of the output csv file

It’s also entirely expected that taking the mean of the rest of the rows (which are samples centered at that point) would give you something different, due to the sampling

I’ll channel for @andrewgelman and suggest using \widehat{R}. The draws should probably give you about as much accuracy as you need for that, though I agree that having the approximate normal used in the fit would be more elegant, though it’d let you measure the difference in unconstrained parameters, whereas what you probably want is to measure the difference in the constrained space.

Here are the two papers that explain all the details of our two autodiff systems:

Both of them use the standard Monte Carlo estimate of the ELBO (negative KL divergence plus a constant). Variance in this estimator is one of the main reasons for differing behaviors under different runs of variational inference. It can be reduced with a larger sample size for the Monte Carlo estimate, but that’s expensive unless you’re running on GPU (which Stan cannot do).

Good question. I have no idea. The only way we have to compute the posterior mean on the constrained scale we care about is to generate draws on the unconstrained scale, transform, and calculate means. The problem is that the transform of a mean is not the mean of a transform when the transform is non-linear.