"Pairwise" alternative to multivariate normal isn't behaving for the hierarchical case; help?

So I posted a while back on my dissatisfaction with the increasing constraint to zero that the multivariate normal imposes in high-dimensional scenarios, and I also posted an alternative that I thought was somewhat promising, if more in the “sufficient stats” spirit than generative. It seemed to work well in the non-hierarchical case I presented then, but just this weekend I started working on applying it to the hierarchical case and discovered that it’s even worse in it’s bias of correlations to zero! As a sanity check (that I should have started with) I coded up a “sample from the prior” model only to discover that I’ve clearly done something fundamentally wrong that I’m hoping someone can point out.

To back up a bit, the standard multivariate description of the data would be that we have a latent set of n samples from a k dimensional multivariate normal, and we observe each of the n \times k latent values some number of times with some degree of measurement noise, yielding the observation level data. If we want to check that stan was producing the priors we intended for just the latent parts of this model, we’d run simply:

#include helper_functions.stan // for flatten_lower_tri()
data{
int<lower=1> n ;
int<lower=1> k ;
}
transformed data{
int num_r = (k*(k-1)) %/% 2 ; // number of correlations implied by k
}
parameters{
row_vector[k] v[n] ; // latent values
row_vector[k] v_m ; // mean for each k
vector<lower=0>[k] v_s ; // sd for each k
cholesky_factor_corr[k] r_ ; // correlations among k's (on Cholesky-factor scale)
}
model{
// priors
v_m ~ std_normal() ;
v_s ~ weibull(2,1) ;
r_ ~ lkj_corr_cholesky(1) ;
// "centered" parameterization of v
v ~ multi_normal_cholesky(
v_m
, diag_pre_multiply(v_s,r_)
) ;
}
generated quantities{
// r: the upper-tri of the correlation matrix, flattened to a vector for efficient storage
vector[num_r] r = flatten_lower_tri( multiply_lower_tri_self_transpose(r_) ) ;
}


Which we could sample then inspect that we get out the expected distributions for the parameters given their priors.

Ok, now the pairwise model seeks to avoid using the multi_normal structure to inform on the correlations but instead iterates over the pairs of k, computes the “empirical” correlation and uses the expected sampling distribution for a correlation to increment the target, which I’ve implemented in a “check the priors” model as:

#include helper_functions.stan // for flatten_lower_tri()
data{
int<lower=1> n ;
int<lower=1> k ;
}
transformed data{
int num_r = (k*(k-1)) %/% 2 ; // number of correlations implied by k
}
parameters{
matrix[k,n] v_ ; // helper for non-centered latent values
vector[k] v_m ; // mean for each k
vector<lower=0>[k] v_s ; // sd for each k
vector<lower=-1,upper=1>[num_r] r ; // correlations among k's
}
transformed parameters{
matrix[k,n] v ; // latent values
for(i_k in 1:k){
v[i_k] = v_m[i_k] + v_[i_k] * v_s[i_k] ;
}
}
model{
// Priors
v_m ~ std_normal() ;
v_s ~ weibull(2,1) ;
r ~ uniform(-1,1) ;
// non-centered parameterization for v
to_vector(v_) ~ std_normal() ; // implies v[i_k] ~ normal(v_m[i_k],v_s[i_k])
// Now for the correlations
// First transform each row in v by the *empirical* mean & sd
matrix[k,n] v_transformed ;
for(i_k in 1:k){
v_transformed[i_k] = (v[i_k]-mean(v[i_k]))/sd(v[i_k]) ;
}
// Next iterate over pairs computing the *empirical* correlation
vector[num_r] empirical_r ;
int r_index = 0 ;
for(i_k in 1:(k-1)){
for(j_k in (i_k+1):k){
r_index = r_index + 1 ;
empirical_r[r_index] = (
dot_product(
v_transformed[i_k]
,v_transformed[j_k]
)/(n-1)
) ;
}
}
// Finally, increment target using the sampling distribution of
// an emprical correlation:
atanh(empirical_r) ~ normal( atanh(r), 1/sqrt(n-3) ) ;
}


However, while sample runs pass the standard diagnostics, the resulting distribution of the correlations are far from the expected uniform:

So I’m obviously doing something wrong.

My first thought is that these correlations seem to be consistent with the sampler exploring values for v that aren’t informed at all by the uniform prior for the correlation, instead sampling v as if it had an identity correlation matrix, yet the samples for r get informed by the values in v and thereby cluster to around zero. That is, while r and v should mutually inform, there seems to be a strong direction of information from v to r but weak or completely absent direction of information from r to v. But even if this is an accurate insight as to what’s going on, I have no idea how to fix it. Anyone have any ideas?

R code for those that want to explore:

library(tidyverse)

mod = cmdstanr::cmdstan_model(
'stan/hierarchical_pairwise_prior_check.stan'
, include = 'stan'
)
iter_warmup = 2e3 # for good measure
iter_sampling = 1e3
parallel_chains = parallel::detectCores()/2
fit = mod$sample( data = list(n=100,k=4) , chains = parallel_chains*2 # for good measure , parallel_chains = parallel_chains , iter_warmup = iter_warmup , iter_sampling = iter_sampling , seed = 1 , refresh = (iter_warmup+iter_sampling)/10 , init = .1 ) # check diagnostics fit$cmdstan_diagnose()

# viz the correlations
(
fit\$draws(
variables = 'r'
, format = 'data.frame'
)
%>% pivot_longer(
cols = c(-.chain,-.iteration,-.draw)
)
%>% ggplot()
+ facet_grid(name~.)
+ geom_histogram(
mapping = aes(x=value)
)
+ scale_x_continuous(
limits=c(-1,1)
, expand = c(0,0)
, breaks = seq(-1,1,by=.2)
)
)

1 Like

I’ve made a smidge of progress on this. At least, I backed off my attempts at optimization a bit and instead of attempting to use sufficient stats for the correlation structure, I use multi-normal, but (in contrast to the standard approach) do so in a pairwise fashion. This seems to be working decently, but I’ve only worked out the “center parameterization” case, and now I’m a bit stuck on the non-centered version, which is pretty critical given that’s the version we generally expect to need for most data (i.e. as I understand, non-centered is better when there isn’t much data).

I’ll include the full-inference (c.f. priors-only above) center-parameterized model below, which also includes my reduced redundant computations trick for hierarhchical models, but hopefully anyone coming here to help will see that the core bit that needs tweaking is the ~ multi_normal() part in the Mid-Hierarchy Structure section:

created this center-parameterized pairwise-multinormal model which seems to behave well:

data{

// o: number of trials
int<lower=1> o ;

// obs: obs on each trial
vector[o] obs ;

// n: number of subj
int<lower=1> n ;

// kw: num of within coefficients
int<lower=1> kw ;

// w: unique entries in the within predictor matrix
matrix[kw,kw] w ;

// obs_index: index of each obs in flattened subject-by-condition value matrix
int obs_index[o] ;

// is_intercept: binary indicator of columns that reflect intercept parameters
int<lower=0,upper=1> is_intercept[kw] ;

}
transformed data{

// num_r: number of correlations implied by kw
int num_r = (kw*(kw-1)) %/% 2 ;

// obs_mean: mean of obs vector
real obs_mean = mean(obs) ;

// obs_sd: sd of obs vector
real obs_sd = sd(obs) ;

// obs_: observations scaled to have zero mean and unit variance
vector[o] obs_ = (obs-obs_mean)/obs_sd ;

//w_n: an array where each element is a row from w repeated for n rows
matrix[n,kw] w_n[kw] ;
for(i_kw in 1:kw){
w_n[i_kw] = rep_matrix(w[i_kw],n) ;
}

}
parameters{

//for parameters below, trailing underscore denotes that they need to be un-scaled in generated quantities

// noise_: observation-level measurement noise
real<lower=0> noise_ ;

// z_m_: mean (across subj) for each coefficient
row_vector[kw] z_m_ ;

// z_s_: sd (across subj) for each coefficient
vector<lower=0>[kw] z_s_ ;

// z_: latent values
matrix[n,kw] z_ ;

// r: population-level correlations among within-subject predictors
vector<lower=-1,upper=1>[num_r] r ;

}
model{

// prior structure ----

// low-near-zero prior on measurement noise
noise_ ~ weibull(2,1) ; // weibull(2,1) is peaked around .8

// normal(0,1) priors on all means
z_m_ ~ std_normal() ;

// normal(0,1) priors on all sds
z_s_ ~ weibull(2,1) ;

// uniform prior on all correlations
r ~ uniform(-1,1) ;

// mid-hierarchy structure ----

// looping over all pairs of columns in z_
int k = 0 ;
for(i_kw in 1:(kw-1)){
for(j_kw in (i_kw+1):kw){
k += 1 ;
array[n] row_vector[2] pairz_ ;
for(i_n in 1:n){
pairz_[i_n] = [ z_[i_n,i_kw] , z_[i_n,j_kw] ] ;
}
pairz_ ~ multi_normal(
[ z_m_[i_kw] , z_m_[j_kw] ]
to_matrix([1,r[k],r[k],1],2,2)
, [ z_s_[i_kw] , z_s_[j_kw] ]
)
) ;
}
}

// Loop to compute the loc for each subject in each condition implied by the design matrix
matrix[n,kw] m ;
for(i_kw in 1:kw){
m[,i_kw] = rows_dot_product( z_ , w_n[i_kw]	) ;
}

// observation-level structure ----

obs_ ~ normal( to_vector(m)[obs_index] , noise_ ) ;

}
generated quantities{

// noise: *unscaled* measurement noise
real noise = noise_ * obs_sd ;

// z_s: *unscaled* sd (across subj) for each coefficient
vector[kw] z_s = z_s_ * obs_sd ;

// z_m: *unscaled* mean (across subj) for each coefficient
row_vector[kw] z_m = z_m_ * obs_sd ;

// z: *unscaled* coefficients for each subject
matrix[n,kw] z = z_ * obs_sd ;

// add the mean to any intercept coefficients:
for(i_kw in 1:kw){
if(is_intercept[i_kw]==1){
z_m[i_kw] = z_m[i_kw] + obs_mean ;
z[,i_kw] = z[,i_kw] + obs_mean ;
}
}

}


I haven’t really gotten a handle on the code, but is the pairwise formulation in the first post above not missing a Jacobian adjustment on the atanh?

Edit: The empirical correlation no longer transformed data, but rather a parameter, right? And actually, the transformation isn’t just the atanh, it’s the dot product too, which seems like it’ll pose problems…

Correct on both accounts.

Also, I’ve just seen OpenCL symmetric eigensolver · Issue #2498 · stan-dev/math · GitHub.

If the Stan gets a fast enough eigendecomposition, I wonder how it would behave in practice to implement a prior on the covariance matrix by putting some arbitrary (symmetric) elementwise prior on the elements, and then doing something like this linear algebra - How to find the nearest/a near positive definite from a given matrix? - Computational Science Stack Exchange

To be clear, I don’t really understand in what sense the resulting covariance matrix would be “close” to the original, nor to what extent doing this would ameliorate the issues that originally set you on this path, nor whether there’s any hope of the eigendecomposition being fast enough to be practical.

You can already do this in Stan

2 Likes

@spinkney this is really cool! But I don’t think it accomplishes quite the same thing. I think that your formulation will just cause the model to see the parts of the prior that where valid correlation matrices are least common as being really unlikely a priori, right? Like the prior predictive distribution for r_rawisn’t (even close to) what you’d simulate from beta_rng(exponential_rng(1), exponetial_rng(1), right? Edit to add: (more relevantly, the predictive distribution for r_raw also isn’t similar to the truncated part of what you’d simulate above that sits inside the bounds of the “samosa” that contains all of the permissible correlation matrices).

My goal above was to put a prior directly on the covariances, and then find the closest valid covariance matrix that attempts to respect my actual choice of prior as nearly as possible. That is, I want to upweight valid covariance matrices that are near the regions of my prior where valid covariance matrices are rare or impossible.

To be clear, even if this can be implemented technically, I have no confidence that it would yield useful inference in practice. But it seems like one way to escape the fact that the LKJ prior sees large correlations as being really unlikely.

Not sure if you meant “this” to refer to my goals, or something related to @jsocolar’s ideas; certainly I don’t think it’s pertinent to my goals since the whole point it to get away from the PSD constraint, which so far as I understand is the source of the bias-to-zero-correlations behaviour of the multivariate normal in high dimensions.

1 Like

So what @jsocolar says is that he wants a nearest PSD but you’re saying you don’t.

From @jsocolar’s post, I understood his suggestion as finding the nearest PSD. Say you have a sample pairwise covariance matrix as a prior and you want the nearest valid PSD covariance matrix using Stan. You can flatten the sample covariance matrix, declare a covariance matrix, and add priors to each element of this covariance matrix with the values from the sample pairwise covariance matrix using a normal soft constraint. The covariance matrix will be PSD by construction and then sample pairwise priors will attempt to get your valid PSD covariance close to those elements. Not sure if this works well or at all for really high dimensions.

@mike-lawrence, if you entertain this idea of putting strong prior info using a nearest PSD and the sample pairwise correlations, does this help?

Suppose I put uniform priors on the elements of a correlation matrix. Consider what happens in a region of parameter space that is within the “samosa-shaped” hull that bounds valid correlation matrices, but is in neighborhood where valid matrices are especially sparse.

Unless I’m missing something big (always possible!) @spinkney’s method will see this neighborhood as quite unlikely a priori, because the constraints associated with the correlation matrix “flow up” to influence the prior. My method (which again, might be crazy or not-useful) is an attempt to break this “flowing up”. The idea is to sample uniformly, ensuring that we see this neighborhood as likely as any other, and even if we end up in a region where valid correlation matrices are rare, we find one nearby (or as nearby as possible) and use it.

@mike-lawrence the idea here (which may not work, and even if it does may fail to be what you want/need) is to respect PSD and do generative modeling, but with a prior that strongly upweights the neighborhoods that LKJ regularizes strongly against.

1 Like

Just FYI, I’m still working on cleaning up my reprex, which I’ll post shortly so you guys can generate the kind of data I’m talking here, but in the interim here’s the inference performance of a standard multivariate normal amid most true correlations=0 and some =.9:

So you can see the ones that are .9 are biased low. And here’s the performance of my pairwise approach:

So, some greater uncertainty on the ones with true values of zero, but no obvious bias for the ones that have true values of .9.

(the above being from the center-parameterized models with concomittant high-information data, so both sample slowly but pass all diagnostics)

Reprex posted here.

2 Likes

Oh, though it might be more pleasant an experience if you try out the aria branch, which uses my workflow-demo package, which is in a “works on my system” state at the moment.

Just want to show a little bit of tinkering. I might have time next week to test this a bit further including on @mike-lawrence’s example, but not for a few days.

Let’s take a look at the prior predictive distributions for a few ten-dimensional correlation matrices.

Here’s one with no prior declaration:

data{
int<lower=1> N;
}

parameters {
corr_matrix[N] V;
}


Here’s the LKJ prior:

data{
int<lower=1> N;
}

parameters {
corr_matrix[N] V;
}

model {
V ~ lkj_corr(1);
}


Here are explicit uniform priors with an intermediate level of soft-constrained normal sampling. I think this is related to @spinkney’s approach.

functions {
vector lower_elements(matrix M){
int n = rows(M);
int tri_size = n*(n-1)/2;
int counter = 1;
vector[tri_size] out;
for (i in 2:n){
for (j in 1:(i - 1)) {
out[counter] = M[i, j];
counter += 1;
}
}
return out;
}
}

data{
int<lower=2> N;
int<lower=1> tri_size;
}

parameters {
corr_matrix[N] V;
vector<lower = -1, upper = 1>[tri_size] lower_elements_desired; // implied uniform prior
}

model {
lower_elements(V) ~ normal(lower_elements_desired, .01);
}


So far, nothing sees high correlations as likely at all; i.e. nothing ameliorates @mike-lawrence’s concern.

Now I’ll fully embrace the non-generative approach (that’s already being used above with all priors except the LKJ). In particular, I’ll just go in and directly upweight large correlations. Nothing that follows is well tested or remotely recommended! In particular, I don’t understand the first thing about what the implied joint prior over the correlations is, nor (in the final example) do I understand what is being implied about the joint prior over the correlations and the variances. But if this seems promising, maybe it would be worth the legwork to actually characterize the joint prior.

Ok, so with that said, one idea is to just straight-up upweight correlation matrices with high absolute values:

functions {
vector lower_elements(matrix M){
int n = rows(M);
int tri_size = n*(n-1)/2;
int counter = 1;
vector[tri_size] out;
for (i in 2:n){
for (j in 1:(i - 1)) {
out[counter] = M[i, j];
counter += 1;
}
}
return out;
}
}

data{
int<lower=1> N;
real z;
}

parameters {
corr_matrix[N] V;
}

model {
lower_elements(V) ~ uniform(-1,1);
target += dot_self(lower_elements(V)')^z;
}


Interestingly, with z=1 we don’t get enough upweighting:

But with z=1.23 (this number is probably specific to the 10-dimensional case) we appear to be in business:

So maybe there’s something useful there.

Here’s one final different approach:
The idea is to sample the “desired” correlations from the prior, then find the nearest valid correlation matrix, without letting the density of correlation matrices in the neighborhood propagate back up to influence the sample of “desired” correlations. Problem is, there’s no simple algorithm to find the nearest correlation matrix to a given choice of correlations. There is, however, a simple algorithm to find the nearest positive semi-definite matrix, so we can easily get a “close” covariance matrix for any desired set of variances and covariances. Then we can extract the implied correlation matrix. Here’s an example of that approach in action, but again I don’t necessarily recommend the approach in practice because it induces extra prior dependencies between the correlations and the variances that I don’t understand or know how to effectively characterize.

functions {
matrix pseudo_cov_mat_from_lower(vector lower_tri, vector diag, int n){
int tri_size = rows(lower_tri);
matrix[n, n] out;

int counter = 1;
for (i in 2:n){
for (j in 1:(i - 1)) {
out[i, j] = lower_tri[counter];
out[j, i] = lower_tri[counter];
counter += 1;
}
}
for (i in 1:n){
out[i,i] = diag[i];
}
return(out);
}
}

data{
int<lower=2> N;
int<lower=1> tri_size;
}

parameters {
vector[tri_size] desired_s;
vector<lower=0>[N] diag;
}

transformed parameters {
matrix[N, N] pcm = pseudo_cov_mat_from_lower(desired_s, diag, N);
vector[N] delta = eigenvalues_sym(pcm);
vector[N] delta_new;
for (i in 1:N) {
if(delta[i] > 0){
delta_new[i] = delta[i];
} else {
delta_new[i] = 0.01;
}
}
matrix[N, N] diag_delta_new = diag_matrix(delta_new);
matrix[N, N] Q = eigenvectors_sym(pcm);
cov_matrix[N] S = Q*diag_delta_new*(Q');

vector[N] Var = diagonal(S);
matrix[N, N] d_inv = diag_matrix(Var.^-.5);
corr_matrix[N] V = d_inv*S*d_inv;
}

model{
desired_s ~ normal(0,5);
diag ~ std_normal();
}


Which results in marginal priors on off-diagonal elements of the correlation matrix like these:

Again, I can’t emphasize enough that I don’t understand the implied joint prior in the slightest and this could introduce all kinds of issues that would need to be ruled out before using! OTOH, it seems like the LKJ prior also introduces issues that are relatively poorly understood by the average user.

3 Likes