PSA: Marginally-Student-t multivariate (df-varying-by-dimension)

I’ve long wished for the ability to model multivariate data in a manner that flexibly accommodates heavy-tail-edness to varying degrees per dimension, contrasting with the multivariate student-t whose df parameter (nu in the Stan spec) is common to all dimensions. The solution was given to me long ago but I only recently actually came to understand it and validate a full implementation, so I thought I’d share more widely. It’s not technically speaking a “multivariate student-t” anymore, but instead a multivariate with marginals that are student-t, hence the thread title. And as shown below, the extra flexibility comes at the computational cost of the introduction of one auxiliary parameter per observation per dimension of the multivariate.

Here’s the hierarchical case:

data{
	int N ; // number of multivariate samples
	int K ; // number of dimensions of the multivariate
	int M ; // number of total observations
	vector[M] Y ; // observation vector
	array<lower=1,upper=N>[M] N_for_Y ; // index of which level of N is associated with each observation in Y
	array<lower=1,upper=K>[M] K_for_Y ; // index of which level of M is associated with each observation in Y
}
transformed data{
	// derive an index in to_vector(P) for each observation in Y
	array[M] int Y_index_in_flattened_P ;
	for(i in 1:M){
		Y_index_in_flattened_P[i] = (
			(N_for_Y[i]-1)
			* K
			+ K_for_Y[i]
		) ;
	}
}
parameters{
	// standard mvn parameters
	vector[K] mu ;
	vector<lower=0>[K] sigma ;
	cholesky_factor_corr[K] cholcor ;
	matrix[K,N] P_stdnorms ;
	// parameters for marginal-student-t
	vector<lower=0,upper=1>[K] tau ; // nu = 2+28*tau
	matrix<lower=0>[K,N] nu_chi ;
	// observation-level noise
	real<lower=0> noise ;
}
transformed parameters{
	vector[K] nu = 2+28*tau ; 
	matrix[K,N] P = (
		// standard mvn transforms for scale & correlations
		(
			diag_pre_multiply(sigma,cholcor)
			* P_stdnorms
		)
		// transform for marginal-student-t
		.* sqrt(
			rep_matrix(nu,N)
			./ nu_chi
		)
		// standard mvn shift for the mean
		+ rep_matrix(mu,N)
	);
}
model{
	// priors for standard multivariate parameters
	mu ~ std_normal() ;
	sigma ~ weibull(2,1) ;
	to_vector(P_stdnorm) ~ std_normal() ;
	cholcor ~ lkj_corr_cholesky(1) ;
	// priors for marginal-student-t parameters
	tau ~ beta(2,2) ;
	for(i_K in 1:K){
		nu_chi[i_K] ~ chi_square(nu[i_K]) ;
	}
	// prior for observation-level noise
	noise ~ weibull(2,1) ;
	// likelihood
	Y ~ normal(
		to_vector(P)[Y_index_in_flattened_P]
		, noise
	) ;
}

And the non-hierarchical case:

data{
	int N ; // number of multivariate samples
	int K ; // number of dimensions of the multivariate
	matrix[K,N] Y ; // observations
}
transformed data{
	vector[K] K_zeroes = rep_vector(0,K) ;
}
parameters{
	// standard mvn parameters
	vector[K] mu ;
	vector<lower=0>[K] sigma ;
	corr_matrix[K] cor ;
	// parameters for marginal-student-t
	vector<lower=0,upper=1>[K] tau ; // nu = 2+28*tau
	matrix<lower=0>[K,N] nu_chi ;
}
transformed parameters{
	vector[K] nu = 2+28*tau ; 
}
model{
	// priors for standard multivariate parameters
	mu ~ std_normal() ;
	sigma ~ weibull(2,1) ;
	cholcor ~ lkj_corr(1) ;
	// priors for marginal-student-t parameters
	tau ~ beta(2,2) ;
	for(i_K in 1:K){
		nu_chi[i_K] ~ chi_square(nu[i_K]) ;
	}
	// likelihood
	(
		Y
		// subtracting the mean
		- rep_matrix(mu,N) 
		// dividing by the student-t helper
		./ sqrt(
			rep_matrix(nu,N)
			./ nu_chi
		)
		// yielding a variable that is ~ multi_normal(0,quad_form_diag(cor,sigma))
	) ~ multi_normal(
		K_zeroes
		, quad_form_diag(cor,sigma)
	) ;
}

Notes:

  • The non-hierarchical case involves a likelihood expression that includes a transform of the data involving parameters, but I’m pretty sure neither the subtraction nor the division require a subsequent jacobian correction.
  • Given that the student-t begins to approximate the normal by nu=30 and nu<2 has some pathological properties (derivation of the expectation for the marginal variance involves division by nu-2), I thought a general concave prior from 2 to 30 achieved by tau~beta(2,2); nu=2+28*tau made sense.
2 Likes

How have you found this working out in practice? I tried something similar recently to fit a GARCH model to returns decomposed by P-PCA.

1 Like

Cool

As the large nu values correspond to distributions that are quite similar to each other, a decreasing prior would make more sense. See an excellent explanation in Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors. Stan Prior choice wiki provides an easy approximation to PC prior.

2 Likes

I think this is a perfectly reasonable definition for a multi-variate student-t, there is no unique construction for such an object. For example, here is a screen shot from the table of contents of Multivariate t-distributions and their applications:

Any of which could be considered reasonable. There might be some similarities between your construction and the one introduced in example 4.1 of https://www.sciencedirect.com/science/article/pii/S0047259X01920172.

3 Likes

If I understand it correctly it is individual Student-t distributions tied together with a Gaussian copula?

1 Like

It’s all semantics. The multivariate T that people typically think about is defined as having one degree of freedoms parameter.

I agree on the similarity to the Fang construction. Their method links the t marginals by a mvt copula. I’m curious if this construction is equivalent to having student t marginals with different dof and a gaussian copula (I see @StaffanBetner just beat me to this!).

3 Likes

I’m not savvy enough with copulas to venture an opinion here, but would be keen to hear from experts whether this is the case, as it would help me understand the copula concept better!

Ah, yes, I guess the hard-bound at 30 of my parameterization isn’t consistent with the recommendation to use soft rather than hard bounds. I do like that the prior is peaked away from lower values, but I guess that preference hasn’t really been motivated by deep experience in playing with things yet.

Edit: and now I see that the recommended nu~gamma(2,.1) does peak away from zero as I vaguely thought might be important

Can’t really speak much to the impact on model predictive performance on real-world data yet, as the extent to which I validated things was to simply confirm that the marginal samples are indistinguishable from samples an explicitly student-t distribution. Code for that here, starting with Stan then R:

data{
	int N ;
	int K ;
	vector<lower=2>[K] nu ;
}
parameters{
	real dummy ;
}
model{
	dummy ~ std_normal() ;
}
generated quantities{
	vector[K] mu ;
	vector<lower=0>[K] sigma ;
	cholesky_factor_corr[K] cholcor ;
	matrix[K,N] z ;
	matrix<lower=0>[K,N] u ;

	cholcor = lkj_corr_cholesky_rng(K,1) ;
	for(iK in 1:K){
		mu[iK] = std_normal_rng();
		sigma[iK] = weibull_rng(2,1);
		for(iN in 1:N){
			z[iK,iN] = std_normal_rng();
			u[iK,iN] = chi_square_rng(nu[iK]) ;
		}
	}

	matrix[K,N] Y_sampled ;
	matrix[K,N] Y_rng_marginal1 ;
	matrix[K,N] Y_rng_marginal2 ;
	{
		matrix[K,K] L = diag_pre_multiply( sigma, cholcor ) ;
		matrix[K,N] nu_mat = rep_matrix(nu,N) ;
		matrix[K,N] q = sqrt( nu_mat ./ u	) ;
		matrix[K,N] Lz = L * z ;
		Y_sampled = q .* Lz + rep_matrix(mu,N) ;
		for(iN in 1:N){
			for(iK in 1:K){
				Y_rng_marginal1[iK,iN] = student_t_rng(nu[iK],mu[iK],sigma[iK]) ;
				Y_rng_marginal2[iK,iN] = student_t_rng(nu[iK],mu[iK],sigma[iK]) ;
			}
		}
	}
}
mod = cmdstanr::cmdstan_model('stan/multi_student_t_rng.stan')
fit = mod$sample(
	data = lst(N=1000,K=5,nu=c(2,4,8,16,32))
	, adapt_engaged = F 
	, iter_warmup = 0 
	, iter_sampling = 1e3
	, refresh = 0
	, chains = 1
)
(
	fit$draws(c('Y_sampled','Y_rng_marginal1','Y_rng_marginal2'),format = 'df')
	%>% as_tibble()
	%>% select(-.chain,-.iteration)
	%>% pivot_longer(-.draw)
	%>% separate(
		name
		, sep = '[\\[,]'
		, into = c('name','iK')
		, extra = 'drop'
	)
	%>% group_by(.draw,iK,name)
	%>% mutate(
		value = sort(value)
		, rank = 1:n()
	)
	%>% pivot_wider()
) ->
	draws_sorted

#compare the sampled & rngs wrt rank-wise mse
(
	draws_sorted
	%>% group_by(iK,.draw)
	%>% summarise(
		sampled_mse_at_rank = mean((Y_sampled-Y_rng_marginal1)^2)
		, rng_mse_at_rank = mean((Y_rng_marginal2-Y_rng_marginal1)^2)
		, mse_diff = sampled_mse_at_rank - rng_mse_at_rank
		, .groups = 'drop_last'
	)
	%>% summarise(
		as_tibble(t(quantile(mse_diff,c(.025,.25,.5,.75,.975))))
	)
)