Feedback requested on possible subtle pathology of typical multivariate specification (or maybe bug?)

I first noticed this in real data, and finally have a fairly minimal reproducible example, as well as a fix, so I thought I’d post for folks to see and provide feedback. It would not surprise me if either I’ve made a mistake or if this is a known pathology. Attaching two models and R code here:
mvn_reliability.R (19.7 KB)
mvn.stan (1.7 KB)
pairwise.stan (2.7 KB)
(for the R code to work, put the stan files in a folder named stan_code)

The context is models using ~ multi_normal (and the cholesky and non-centered variants thereof) to express that multiple parameters/observables have the possibility of relationship.

Take for example multiple “things” I’ll call coefs (bc I originally observed this in a hierarchical model, and lazily didn’t rename things for this non-hierarchical minimal example), each observed from multiple human subjects. So it could be things like height, weight, income, etc, measured once for each subject. To model the relationships among the coefs, it’s common to just use a multivariate normal, with Stan code being:

data{

	// num_subj: number of subj
	int<lower=1> num_subj ;

	// num_coef: number of coefficients modelled as intercorrelated
	int<lower=1> num_coef ;

	// subj_coef_array: array of values for each subj & coef
	vector[num_coef] subj_coef_array[num_subj] ;

	// num_cors: number of correlations implied by the number of coefficients; user-supplied bc Stan is strict in not permitting conversion from real to int.
	//   Must be: ((num_coef^2)-num_coef)/2
	int num_cors ;

}
parameters{

	// mean_coef: mean (across subj) for each coefficient
	vector[num_coef] mean_coef ;

	// sd_coef: sd (across subj) for each coefficient
	vector<lower=0>[num_coef] sd_coef ;

	// chol_corr: population-level correlations (on cholesky factor scale) amongst within-subject predictors
	cholesky_factor_corr[num_coef] chol_corr ;

}
model{

	////
	// Priors
	////

	// normal(0,1) priors on all means
	mean_coef ~ std_normal() ;

	// weibull(2,1) priors on all sds (peaked at ~.8)
	sd_coef ~ weibull(2,1) ;

	// flat prior across sets of correlation matrices
	chol_corr ~ lkj_corr_cholesky(1) ;

	////
	// Likelihood
	////

	// subj_coef_array as multivariate normal
	subj_coef_array ~ multi_normal_cholesky(
		rep_array(mean_coef,num_subj)
		, diag_pre_multiply(sd_coef,chol_corr)
	) ;

}
generated quantities{

	// cor_vec: the lower-tri of the correlation matrix, flattened to a vector for efficient storage
	vector[num_cors] cor_vec ;
	{ // local env to avoid saving `cor`
		matrix[num_coef,num_coef] cor = multiply_lower_tri_self_transpose(chol_corr) ;
		int k = 0 ;
		// extract lower-tri in column-major order
		for(this_cor_col in 1:(num_coef-1)){
			for(this_cor_row in (this_cor_col+1):num_coef){
				k += 1 ;
				cor_vec[k] = cor[this_cor_row,this_cor_col] ;
			}
		}
	}

}

Generating data for this scenario where there are 100 subjects and 32 coefs, but all have zero true correlations, this model seems to be well-calibrated. The following plot shows a representation of the posterior for each of the (32*32-32)/2=496unique correlations, where they are stacked vertically with a colored bar showing the posterior interquartile range. The color of the bar indicates whether that correlation’s posterior 80% quantile interval includes the true value of zero (blue) or not (pink); we’d expect about 20% of the 80% intervals to exclude the true value, and that’s about what we get:

In fact, looking at the 50% and 80% calibrations quantitatively, they’re both a bit high:

> #coverage of 50% intervals
> binom.test(
+ 	x = sum(odd_mvn_cor_vec_info$fifty_covers_true)
+ 	, n = nrow(odd_mvn_cor_vec_info)
+ 	, p = .5
+ )

	Exact binomial test

data:  sum(odd_mvn_cor_vec_info$fifty_covers_true) and nrow(odd_mvn_cor_vec_info)
number of successes = 275, number of trials = 496, p-value = 0.01724
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.5094720 0.5987465
sample estimates:
probability of success 
             0.5544355

> #coverage of 80% intervals
> binom.test(
+ 	x = sum(odd_mvn_cor_vec_info$eighty_covers_true)
+ 	, n = nrow(odd_mvn_cor_vec_info)
+ 	, p = .8
+ )

	Exact binomial test

data:  sum(odd_mvn_cor_vec_info$eighty_covers_true) and nrow(odd_mvn_cor_vec_info)
number of successes = 431, number of trials = 496, p-value =
6.483e-05
alternative hypothesis: true probability of success is not equal to 0.8
95 percent confidence interval:
 0.8360367 0.8973839
sample estimates:
probability of success 
             0.8689516

A little weird, but let’s continue (though note this latter bit I only noticed while writing this up, so this may signal a bug that might in turn explain the below as well!).

What if only most of the true correlations are all zero, but some are quite large? Here’s the same calibration plot for this mostly-zero-but-some-large scenario, this time with dots showing the true value of the non-zero correlations:

It’s clear that all of the non-zero correlations are dramatically under-estimated!

An alternative model that uses the same data declarations, is:

data{

	// num_subj: number of subj
	int<lower=1> num_subj ;

	// num_coef: number of coefficients modelled as intercorrelated
	int<lower=1> num_coef ;

	// subj_coef_array: array of values for each subj & coef
	vector[num_coef] subj_coef_array[num_subj] ;

	// num_cors: number of correlations implied by the number of coefficients; user-supplied bc Stan is
	// strict in not permitting conversion from real to int.
	//   Must be: ((num_coef^2)-num_coef)/2
	int num_cors ;
}
transformed data{

	// subj_coef_mat: just a reformat of subj_coef_array for convenience
	matrix[num_subj,num_coef] subj_coef_mat ;
	for(this_subj in 1:num_subj){
		subj_coef_mat[this_subj] = transpose(subj_coef_array[this_subj]) ;
	}

	// corz_se: standard error for correlations on Fisher's Z scale
	real corz_se = 1/sqrt(num_subj-3) ;

	// obs_z: observed correlations, on Fisher's Z scale
	vector[num_cors] obs_z ;
	{
		int k = 0 ;
		vector[num_cors] obs_r ;
		// looping over lower-tri to compute correlation between each column in subj_coef
		for(this_coef in 1:(num_coef-1)){
			for(this_other_coef in (this_coef+1):num_coef){
				k += 1 ;
				obs_r[k] = (
					sum(
						( subj_coef_mat[,this_coef] - mean(subj_coef_mat[,this_coef]) )
						.* ( subj_coef_mat[,this_other_coef] - mean(subj_coef_mat[,this_other_coef]) )
					)
				) / (
					(num_subj-1) * sd(subj_coef_mat[,this_coef]) * sd(subj_coef_mat[,this_other_coef])
				) ;
			}
		}
		obs_z = atanh(obs_r) ;
	}
}
parameters{

	// mean_coef: mean (across subj) for each coefficient
	vector[num_coef] mean_coef ;

	// sd_coef: sd (across subj) for each coefficient
	vector<lower=0>[num_coef] sd_coef ;

	// corz_vec: population-level correlations (on Fisher's-Z scale) among within-subject predictors
	vector[num_cors] corz_vec ;

}
model{

	////
	// Priors
	////

	// normal(0,1) priors on all means
	mean_coef ~ std_normal() ;

	// weibull(2,1) priors on all sds (peaked at ~.8)
	sd_coef ~ weibull(2,1) ;

	// normal(0,.74) priors on all corz
	// yields flat prior on abs(cor)<.7, and diminishing to zero by abs(cor)==1
	// view in R via: hist(tanh(abs(rnorm(1e7)*.74)),br=1e4,xlim=c(0,1))
	corz_vec ~ normal(0,.74) ;


	////
	// Likelihood
	////

	// looping over coefs to express structure that each is normally distributed
	// (could this part be replaced with separate expressions for the sampling distribution of the mean and SD?)
	for(this_coef in 1:num_coef){
		subj_coef_mat[,this_coef] ~ normal(mean_coef[this_coef],sd_coef[this_coef]);
	}

	// expressing the structure for the pairwise correlations
	obs_z ~ normal( corz_vec , corz_se ) ;

}
generated quantities{

	// cor_vec: the upper-tri of the correlation matrix, flattened to a vector for efficient storage
	vector[num_cors] cor_vec = tanh(corz_vec) ;

}

And with this model we get the calibration we expect on even the high correlations:

(omitted for brevity: plots of the no-high-correlations scenario and binomial tests of calibration, all showing good performance of this second model)

2 Likes

If there’s a question of bug, then we should probably do some sort of SBC thing (directions here).

I’m looking at the generating data code. It looks like the likelihood is a multivariate normal:

rmvnorm2(
		n = 1e5 #lots!
		, Mu = sim_pars$coef_means
		, sigma = sim_pars$coef_sds
		, Rho = sim_pars$cor_mat
	)

Are their priors on the parameters we’re estimating? SBC will need those.

That’s what the first image and binomial tests represent.

I see these priors in the stan model but do not see them in the R data generation.

mean_coef ~ std_normal() ;
sd_coef ~ weibull(2, 1) ;
chol_corr ~ lkj_corr_cholesky(1) ;
1 Like

On further consideration, my assertion that the results presented are sufficient SBC fails to consider the possibility that estimating all the correlations in one model might violate the independence requirement.

The data are generated from a single sample of parameters consistent with the prior, but it would be better to run it many times.

I do note that the second model passes the weak test of calibration that the single pass I posted reflects, so I suspect that this result will persist.

Yeah to diagnose something like this SBC requires generating new parameters and data each time and for the generating process to exactly match the model we’re trying to fit. If there are any differences there then any biases we see or whatever might just be the result of wrong model. So we’d do that test to feel good about Stan doing the right thing.

And it’s totally likely there’s a non-standard model that is more appropriate for the types of correlations you’re fitting. (and I am hoping this is true – that there isn’t a Stan bug and you have found a new way to fit your model that solves your problem with the correlations :P)

1 Like

The first data/model is dead-standard, multivariate normal with all true means and SDs set near the prior mode and an identity true correlation matrix. And yet the posterior correlations appear to be biased to zero (hence greater % include zero than expected).

Ok! I did SBC using rstan.

First things first: rstan::sbc is not functioning quite correctly, unless I did something wrong. The sbc S3 class only has 2 methods: print, and plot. The plot function is not working, because it relies on sbc$ranks[[1]] to have column names. It does not have column names. So the plot function fails.

Now onto SBC:

#######
# SBC #
#######

stan_code <- "
functions {
  vector get_lower_tri(matrix mat) {
    int P = rows(mat);
    int unique = P * (P - 1) / 2;
    vector[unique] lower;
    {

      int index = 1;
      for(r in 1:P) {
        for(c in 1:(r - 1)) {
          lower[index] = mat[r,c];
          index += 1;
        }
      }
    }

    return(lower);
  }
}
data {
  int N;
  int P;
}

transformed data {
  int N_unique = (P * (P - 1) / 2);
  vector[P] mus = rep_vector(0.0, P);
  vector[P] y[N];
  cholesky_factor_corr[P] corL_ = lkj_corr_cholesky_rng(P, 1);
  matrix[P,P] cor_ = multiply_lower_tri_self_transpose(corL_);
  vector[N_unique] corLower_ = get_lower_tri(cor_);
  for(n in 1:N) {
    y[n] = multi_normal_cholesky_rng(mus, corL_);
  }
}

parameters {
  cholesky_factor_corr[P] corL;
}

model {
  corL ~ lkj_corr_cholesky(1);
  y ~ multi_normal_cholesky(mus, corL);
}

generated quantities {
  vector[P] y_[N] = y;
  vector[N_unique] pars_ = corLower_;
  vector[N_unique] lower_posterior = get_lower_tri(multiply_lower_tri_self_transpose(corL));
  int<lower=0,upper=1> ranks_[N_unique];
  for(p in 1:N_unique) {
    ranks_[p] = lower_posterior[p] > corLower_[p];
  }

}

"

sbc_model <- stan_model(model_code = stan_code)

sbc_data <- list(N = 500, P = 30)

sbc_out <- sampling(sbc_model, sbc_data)
sbc_out <- sbc(sbc_model, sbc_data, M = 100, refresh = 250, iter = 500)

plot.sbc2 <- function(x, thin = 3, ...) {
    thinner <- seq(from = 1, to = nrow(x$ranks[[1]]), by = thin)
    u <- t(sapply(x$ranks, FUN = function(r) 1L + colSums(r[thinner, 
        , drop = FALSE])))
    valid_names <- paste0("r_", seq_len(ncol(u)))
    parameter <- as.factor(rep(valid_names, each = nrow(u)))
    d <- data.frame(u = c(u), parameter)
    (ggplot2::ggplot(d) + ggplot2::geom_freqpoly(ggplot2::aes(x = u), 
        ...) + ggplot2::facet_wrap("parameter"))
    ## d
}

d <- plot.sbc2(sbc_out)

That yields a nearly indecipherable, but definitely-not-uniform plot of all unique lower-triangular correlations.

EDIT: I forgot; this also had an enormous number of divergences.

> sbc_out
82 chains had divergent transitions after warmup
there were a total of 11588 divergent transitions across all chains

2 Likes

Is it possible to pull out what the ESSs were? These horns are a sign of autocorrelation. I was going to suggest you try some thinning to confirm this hypothesis, but you’re running (relatively) few iterations and it might make things unstable.

Maybe the ESS and traceplot for just one replication so that we can gauge if there’s a lot of autocorrelation going on - shouldn’t be a problem with Stan, but one never knows.

EDIT:

Well, nevermind then =D Divergences take precedence over looking at other diagnostics.

Lkj prior really doesn’t like high correlations. It’s definitely not flat, but some complex notion of uniform on a strange space, as best as i understand.

1 Like

I don’t think any of the above is about the LKJ prior used as I’ve also run a bunch of variants with no prior on the correlations at all and get the same pathologies (poor calibration in the context of all-0 true correlations and underestimating large correlations in the context of mostly-0 correlations). @Stephen_Martin has hypothesized that the latter at least is a consequence of the PSD requirement.

1 Like

Omitting a prior declaration on the cholesky-factor correlation matrix yields… a sorta weird object, with different implied non-generative priors on different correlations, many of which are somewhat “regularized” towards zero. For example:

data{
  int num_coef;
  
  // num_cors: number of correlations implied by the number of coefficients; user-supplied bc Stan is strict in not permitting conversion from real to int.
	//   Must be: ((num_coef^2)-num_coef)/2
	int num_cors ;
}
parameters{
	cholesky_factor_corr[num_coef] chol_corr;
}
generated quantities{
	// cor_vec: the lower-tri of the correlation matrix, flattened to a vector for efficient storage
	vector[num_cors] cor_vec ;
	{ // local env to avoid saving `cor`
		matrix[num_coef,num_coef] cor = multiply_lower_tri_self_transpose(chol_corr) ;
		int k = 0 ;
		// extract lower-tri in column-major order
		for(this_cor_col in 1:(num_coef-1)){
			for(this_cor_row in (this_cor_col+1):num_coef){
				k += 1 ;
				cor_vec[k] = cor[this_cor_row,this_cor_col] ;
			}
		}
	}
}
nc <- 5
corr_test_data <- list(num_coef = nc, num_cors = ((nc^2)-nc)/2)
corr_samples <- corr_test$sample(data = corr_test_data, parallel_chains = 3, chains = 3)
print(corr_samples$summary(), n=36)

I don’t know how to predict which elements of the correlation matrix will be more or less strongly regularized, or how it varies with the size of the correlation matrix.

1 Like

Just an update that with a proper SBC procedure (thanks for the reminder @bbbales2!), I actually do observe good calibration for the standard multivariate normal case. So no need to panic about the core multivariate normal stuff having a bug!

That said, the phenomenon that started me on all this persists whereby the standard multivariate model cannot handle scenarios where most correlations are near zero but some are very high. In contrast the new “pairwise” method I’ve developed achieves good calibration in both the all-near-zero and most-near-zero cases, noteably with a pretty dramatic ESS/seconds performance improvement to boot! (for example, a model with 1000 subjects and a 32x32 correlation matrix, the standard mvn model takes 10 minutes while the pairwise model takes 5 seconds!!)

I’m re-running the SBC after tweaking a few things and will post the latest results supporting the above claims tomorrow once they’re finished. In the interim, here’s a version of the pairwise model that’s slightly faster than the one posted earlier.
pairwise_sufficient.stan (2.9 KB)

(I worked out what I think is akin to the sufficient statistics trick but for normally distributed variables.)

6 Likes

Does this approach ensure positive definiteness? Looks to me like it won’t but I might be missing something.

It doesn’t. Indeed, I believe that positive definiteness is the source of underestimation in the sparse case. That is, the requirement of PSD in typical multivariate model has possibly-under-appreciated implied assumption that the correlation matrix is not sparse.

Interesting. Then I guess as soon as we’re talking about correlations of parameters then this will break, but curious to see further developments :)

I don’t follow; under what scenario are you expecting the new method to break? To clarify, the above demonstrates (I believe) that the standard multivariate normal model implies a subtle assumption that is violated by a relatively common use case (at least, common enough that I came to discover this in my applied work). The above also describes an alternative to the multivariate normal that still has correlations but in a less stringently constrained manner than the standard multivariate normal. I wouldn’t be surprised if my proposed alternative has some drawback, but it at least doesn’t manifest the pathology I’ve observed with the multivariate normal.

I think you’re right, in that typical priors for mvn’s don’t seem to cope well with the scenario you pose, and that because of the way variables tend to be selected / data gathered, this scenario occurs more often than might be expected under some kind of agnostic / naive view of the data gathering process, for which things like a uniform prior on some correlation space might seemingly make sense.

But, I expect your proposal to break down / struggle if you apply it to a scenario where you’re also estimating elements for which you want to estimate the correlation – e.g. instead of passing the data in, you are sampling it. I might turn out to be wrong, in which case I’m very curious :)

I’m still verifying the performance in the hierarchical case, but indeed your intuition is impressive in accurately anticipating it will still struggle. It’s definitely still better than multivariate normal in that case, but is still manifesting a bias toward zero. Trying to work out if this is a limitation of the model vs a limitation of the math of measurement.

1 Like

Hey folks, here’s a slightly-belated follow-through on my promise to post code/visuals of the checks I’ve run. First, a reminder that we’re dealing with a data scenario with K outcomes measured in each of N units of observation, which is normally modelled with a multivarite normal in which we have as parameters K means, K sds, and a correlation matrix with \frac{K^{2}-K}{2} unique entries.

First a check that the models are implemented correctly using data generated from the priors across multiple values for N & K:

On recovery:
Recovery is my term for whether a given posterior quantile interval on a parameter contains that parameter’s true value. When the data are generated from the prior, then we expect a posterior X% interval to contain the data-generating parameter value X% of the time. Full SBC would actually derive the rank of the data-generating parameter value in each posterior, yielding a distribution of ranks that should be uniformly distributed and thereby a variety of visual and quantitative checks on “fit” to uniform that can inform on precisely why any misfit might be occurring. But the tool I used for this exploration couldn’t implement full SBC in this manner, so I instead chose to assess recovery for both 50% and 80% intervals:

In this plot the grey bands show the interval in which one would 89% band for a binomial sampling distribution given true .5 and .8 rates and sampling effort equal to the number of simulations actually run. If points fall either frequently or substantially outside these ranges, we should be worried. There does seem to be some over-recovery in the case of the pairwise model for the sd and correlation parameters when N is low. One might naively consider greater-than-expected recovery to be a good thing, but as discussed below, this is not the case.

On Bias:
The typical central location of the posterior for a given parameter relative to the true data-generating value can inform on whether a model/implementation is “biased” to yield posteriors shifted either above or below the true value. Here is the bias for each parameter type (note that for the correlation variables, bias was computed after first converting both the posterior samples and the true values to Fisher’s Z scale via atanh()):

I may have to run more iterations of the simulation to discern if there truly is any non-negligble bias for the mean and correlation parameters, but certainly it looks like both model types yield biased estimation of the sds. I’m not quite sure what to make of this!

On Uncertainty:
The typical width (defined, for example, by the inter-quartile range, as used here) of the posterior for a given parameter can inform on how confident one might be expected to be following analysis of a given volume of data with a given model. Here’s the uncertainty for each parameter (note that for the correlation variables, uncertainty was computed after first converting the posterior samples to Fisher’s Z scale via atanh()):

Here we see the expected pattern whereby for both models, more data (greater N) yields less posterior uncertainty (note different axis scales). The two models appear to differ however in their posterior uncertainty for both the sd parameters and correlation parameters (possibly the mean parameters too at low N), such that the pairwise model is consistently less certain than the MVN model.

This is consistent with the earlier observation of over-recovery for these variables; the posterior distributions on the pairwise parameters are wider leading to a higher-than-expected rate of capturing the true data-generating values. In this case, this marks as a point against the pairwise model because it means that we learn less from a given volume of data, with the corollary that credibility of a given non-true parameter value remains higher after seeing the data than it should be. As an example, this would mean that a posterior decision framework comparing a given correlation against zero (or a region of practical equivalence to zero) would be expected to have a lower rate of rejecting zero under circumstances where the true parameter value genuinely is non-zero.

On time:


In this plot, each ribbon shows the interquartile range of the full warmup+sampling time for each model across the values for K and for each value of N. The y-axis has a log scale (though labels are seconds), showing pretty dramatic difference between the models as the size of the correlation matrix grows. So possibly folks with very large correlation matrices to estimate would accept the greater uncertainty of the pairwise model in return for its much reduced compute time. (I specifically have plans to explore soon if this might be applied to speed up Gaussian Processes)

Intermission:
Ok, so under the conditions where the multivariate model is a perfect match to the data-generation process, we generally get expected results, and we observe a time-uncertainty tradeoff for the pairwise model.

But this exploration was kicked off by observations of a different data-generating process, so let’s next look at that. We generate data similarly as above, but with the tweak that after generating a first set of K outcomes across N units of observation, we don’t model these directly but instead use them as a basis to generate two new sets of data, each with columns that differ from the original set by a small degree. Within each resulting data set, the columns should roughly still correlate in a manner close to the prior, but if we bind both together to form a N\times(K\times2) data matrix, the i^{th} column should correlate strongly with the (i+K)^{th} column. I’ll call these expected-to-be-high correlations “reliabilities”/“rel” below, using a term from the context in which I originally observed this phenomeon whereby this kind of data is generated in experiments seeking to measure test-retest reliability.

On recovery:


For the pairwise model, we see a similar over-recovery as earlier, but with the traditional MVN we see under-recovers seemingly all of the model parameters even at high N.

On bias:


Here we see a reason for the MVN’s under-recovery whereby it is yielding strongly biased posteriors even at high N. The new Pairwise model is also yielding biased posteriors, but often substantially less-so.

On uncertainty:


As before, we see that across the board the Pairwise approach yields greater posterior uncertainty than the traditional MVN.

On Compute Times:


As before, we see a pretty substantial speedup for computation of the Pairwise model relative to the traditional MVN. (If comparing against the prior timing plot, remember that thanks to the new data-generation process, there are many more parameters in the models here than before.)

Summary:
This exploration has illuminated a (to me) subtle incompatibility of the traditional MVN model and a specific-but-not-rare data scenario. I was initially enthused by the performance of the pairwise trick I came up with as fix, but was even then wary of how non-generative it felt, and after a few days of sitting with it (and particularly going through the process of creating a data-generator for the above), I realized there’s a super straightforward SEM-style alternative that is far more generative in spirit and that I hope will yield better performance than my pairwise hack. I’ll post a new presentation of that model’s performance when I find the time to implement & evaluate it (hopefully in the next couple weeks), but wanted to get this out at least as a warning to folks using the traditional MVN approach for data like the test-retest reliability scenario that inspired this exploration.

3 Likes