Reduced rank regression


#1

Hello Stan Community,

I have been attempting to fit a reduced-rank regression model with Stan. Briefly, reduced-rank regression takes the multivariate regression Y=XB+e with p predictors (in X) and q outcomes (in Y), and decomposes the p-by-q coefficient matrix B as B = UV', with U being p-by-k and V being q-by-k, where k < min(p, q).

This general form introduces a couple of complications, in that different U and V can result in identical likelihoods. More specifically, flipping the sign on both U and V and permuting the columns of U and V produce equivalent models.

I have managed to get a model where k=1 to work to my satisfaction [it recovers parameters of simulated data and produces interpretable results on real data] by using a transformed parameter that forces the signed sum of squares of V to be positive. It would be simpler to constrain just one entry of V to be positive, but I have no good reason to expect a particular value of V to be positive, and with the data I have so far, none of the posterior distributions of the entries in V are far enough from zero to not fold some non-trivial part of the tail over the zero line.

functions {
    int sign(real x) {
      if (x == 0)
        return 0;
      else if (x < 0)
        return -1;
      else
        return 1;
    }
}
data {
    int N; //number of rows
    int P; //number of predictors
    int C; //number of outcomes
    real<lower=0> sig; //sigma of prior t
    real<lower=1> nuf; //nu of priors on coefficients
    matrix[N, C] y; //outcome
    matrix[N, P] X; //design matrix
}
parameters {
    real<lower=1> nu; //nu of noise student t distribution
    vector[P] urU;
    vector[C] urV;
    vector<lower=0>[C] sigma;
}
transformed parameters {
    vector[P] U;
    vector[C] V;
    { //scope hider
        //point V in a positive direction
        real s;
        real sgn;
        s = 0;
        for (i in 1:C) {
          s += sign(urV[i])*urV[i]^2;
        }
        sgn = sign(s);
        for (i in 1:C)
          V[i] = urV[i] * sgn;
        for (j in 1:P)
          U[j] = urU[j] * sgn;
    } // end scope hider
}
model {
    nu ~ gamma(2, 0.1);
    sigma ~ cauchy(0, 1);
    urU ~ student_t(nuf, 0, sig);
    urV ~ student_t(nuf, 0, sig);
    {
        matrix[N,C] mu;
        mu = X*U*V';
        for (j in 1:C)
            y[:,j] ~ student_t(nu, mu[:,j], sigma[j]);
    }
}

Anyway, things really fall apart when k>1. I suspect it is because of the column-ordering issue, but there may be some additional problem that I am missing. I have tried introducing a k by k diagonal matrix S (i.e. Y=XUSV') with the diagonal entries constrained to be positive and ordered, but that hasn’t really helped. A while ago I looked at orthogonalizing the columns of U with the Gram-Schmidt process, but I have learned a lot about Stan since then, so I should perhaps revisit that idea. Here is the Stan model for k>1 (that doesn’t really work and doesn’t use the ordered S). Apologies for changing variable names, here Y=XBA'+e, and B and A have qstar columns.

functions {
  int sign(real x) {
    if (x == 0)
      return 0;
    else if (x < 0)
      return -1;
    else
      return 1;
  }
}
data {
  int N; //number of rows
  int P; //number of predictors
  int<lower=2> C; //number of outcomes
  int<upper=min(C,P)> qstar; //ceiling on rank
  matrix[N, C] y; //outcome
  matrix[N, P] X; //design matrix
}
parameters {
  matrix[P, qstar] B0;
  matrix[C, qstar] A0;
  vector<lower=0>[C] sigma;
  matrix<lower=0>[P, qstar] lam;
  positive_ordered[qstar] tau;
  real<lower=0> nu;
}
transformed parameters {
  matrix[P, qstar] B;
  matrix[C, qstar] A;
  for (j in 1:P) {
    for (h in 1:qstar) {
      B[j,h] = B0[j,h]*lam[j,h]*tau[h];
    }
  }
  A = A0;
  //point A in a generally positive direction
  for (c in 1:qstar) {
      real ss = 0;
      for (r in 1:C)
          ss += sign(A[r, c])*A[r, c]*A[r, c];
      A[:,c] = A[:,c]*sign(ss);
      B[:,c] = B[:,c]*sign(ss);
  }
}
model {
  matrix[N, C] mu;
  nu ~ gamma(2, 0.1);
  sigma ~ cauchy(0, 1); // different from reference paper
  to_vector(A0) ~ normal(0, 1);
  to_vector(B0) ~ normal(0, 1);
  to_vector(lam) ~ cauchy(0, 1);
  tau ~ cauchy(0, 1);
  mu = X*B*A';
  for (i in 1:C)
    y[:,i] ~ student_t(nu, mu[:,i], sigma[i]);
}

As a rank newbie to Stan (and Bayesian modelling in general), I would appreciate any help figuring out how to get something like rank-reduced regression working. I know there are a lot of similar approaches with different names, but in the papers I have found describing related methods, I get lost before I understand how the sign and column permutation problems get solved.

I’d also welcome any general tips or advice on my models. Thanks!


#2

We don’t usually recommend this other than in doing something like a QR decomposition of the predictors.

If you do want to do this, then I would worry about correcting for sign after the fact. You probably only care about preiction anyway with a technique like this.

You can simplify:

real s = sign(urV) .* square(urV);
vector[C] V = urV * sgn;
vector[C] U = urU * sgn;

If what you want to do is a kind of variable selection, then I’d recommend the recent work of Piironen and Vehtari on horsehoe priors.


#3

Reduced rank regression is not something you see Bayesians do very often, but in order to get around problems you mention you can use a rectangular Cholesky factor for \mathbf{U}, as in

data {
  cholesky_factor_cov[p, k] U;
  ...

The columns cannot be swapped without destroying the lower-trapezoidal structure and the signs cannot be flipped without making the diagonal elements negative.


#4

Thanks for the thoughtful replies!

So it sounds like my approach is pretty unusual, but I suspect that my problem is not that unusual, so maybe there is a more conventional approach to the problem:

I have a dataset from an experiment with about 10 outcome measures per case and a number of possible predictors/covariates that approaches the number of cases. The expectation is that there will be correlations among the predictors and among the outcomes, and I expect that some groups of the correlated predictors will be related to some groups of correlated outcomes. My goal is to identify relationships between predictors and outcomes so I might use those relationships in future studies and practical applications.

For example, I want to be able to say that if you care about outcome measure 4, then you should find people who score high on these three predictors and low on these other two, and by the way, you can then expect for outcomes 5-7 to be relatively high, but 8 will probably be low.

So the question is whether there exists a conventional approach to solve problems like this in a Bayesian framework.

Their papers, along with the related case study of Michael Betancourt, Bayes Sparse Regression have been helpful. The examples (unless I’m missing something) seem focused on univariate models, and it isn’t clear to me how to extend them to multivariate cases. If I assume there is one pattern or group of predictors that accounts for one pattern or group of outcomes, then the results I get make sense, but I suspect I might be missing possible explanations for the other outcomes that don’t fall in that pattern.

If I understand this suggestion, I would be constraining the model such that the ordering of the columns of X now becomes an important decision. The second and subsequent latent factors in X would be prohibited from having any contribution from column 1, because of the zeros in the upper triangular part of U. I might be able to come up with a reason to justify some particular column order for X. I’ll certainly give it a try.

It might be the case that there is no perfect solution, here. Thanks again for the suggestions so far (and the vectorization tip).


#5

@avehtari?


#6

For rows 2 … k, there are exclusion restrictions in the previous columns, so yes the order is relevant. But I don’t get the sense that people who are doing reduced rank regression are particularly interested in interpreting particular coefficients. The elements of V\prime can adjust to still fit the first k variables well.


#7

But it seems that’s not the case?

Our projection predictive variable selection code does not currently take into account the correlations, although if used with latent variable models it might be usable for making some latent variables to shrink near zero.

Let’s be careful with word multivariate as many would say that many covariates and one target is also multivariate model. We demonstrate only with univariate targets, but it’s easy to extend to multivariate targets, but it doesn’t matter if you are more interested in the reduced dimensional latent linear model than variable selection.

So it seems I can’t help with the reduced rank part, and it seems you don’t need yet the variable selection part.


#8

Reframing your problem, you are essentially trying to infer a factor model of your latent regression weight matrix. I spent this spring trying to get factor models in Stan to work. Due to the non-identifiability UV^T=U(R^{-1}R)V^T, you need to impose a lot of constraints, as you have already started doing. I would say that you can get it to work if all you want is a low-rank decomposition i.e. some arbitrary, but consistent, pair \hat{U} and \hat{V}, but I would say that it is almost impossible to get e.g. PCA vectors with uncertainty quantification unless their effective singular values (your S) are very well-separated.

The gold standard is probably triangular constraints like used in this blog post by Rick Farouni. This article details how you can construct a proper prior for that setting, equivalent to the standard Gaussian choice.

If you want to try something with orthogonal matrices, I can dig around for the forum entry where we discuss a suitable design using Cayley transforms that works in Stan (Gram-Schmidt not necessary). But I did not really find them to be all that helpful personally since there is still the permutation issue.

Ironically, it appears that Stan’s strength makes it ``harder’’ to sample from these models since it can effortlessly jump between symmetry modes :)


#9

Thanks to the helpful suggestions above, I have arrived at something that appears to work on simulated data most of the time [if my simulation pulls from the tails of the cauchy priors, that sometimes louses things up]. It doesn’t seem to be anything novel, but I’m going to post my model and some comments to maybe save others (and future me) some time. If anyone spots any problems or improvements on this method, I would be very happy to hear about it. The model and some code to simulate data are at the end of the post, if you want to skip right to that.

I want interpretable reduced rank regression. Here I explain what I think that means.

To restate the goal, here: I’ve got some predictors/covariates in a matrix X, and some outcomes of interest in matrix Y. I’m interested in patterns of predictors that are related to patterns of outcomes, and I want to interpret the coefficients that describe these patterns (i.e. I’m not exclusively interested in predicting new cases). This is described in the literature as reduced rank regression, and I think partial least squares regression is a special case.

The approach is to fit a model like Y=XBA'+e where B and A are matrices with p and c rows, respectively and k columns, where p is the number of predictors and c is the number of outcomes. To try to get a sparser solution, I used a horseshoe prior for B. In order to get a stable solution to this problem with Stan, we need several constraints on the coefficients A and/or B.

I went with the gold standard. As the blog post indicates, this introduces a complication in that if you constrain B to be lower trapezoidal, then the column order of X matters, since the first column of X only contributes to the first latent factor, and the second column only to the first two, etc. And it turns out that this constraint only successfully forces the model to be unimodal if the coefficients on the diagonal are large relative to the rest of the values in the column. If the model fits about as well whether a particular coefficient is positive or negative (because its true value is close to zero), then constraining that coefficient doesn’t really constrain the model.

It also seems like it is critical to set some reasonable initial values, and that each chain needs to start with approximately the same initial values. At first, I cheated by using the simulated values as the initial values, and that works but is implausible in real data.

The answer to the column ordering and initial value problems are sort of related, in that if you can get a good initial guess, you can re-order X to give yourself a better chance at finding a stable solution.

So I came up with a way to find an approximate solution with about the right shape. Find the least-squares estimate of \beta (where Y=X\beta) and use QR decomposition to decompose \beta' to a unitary square matrix and a triangular matrix, and then take the first k rows of each to get \hat{A} and \hat{B} (transposed). Now, \hat{A} is constrained by the QR decomposition to have L2 norm = 1, which is not part of the constraint in our model. If we use the standard normal prior for A, then the expected L2 norm of A is \sqrt{c*k}, so just multiply \hat{A} by that and divide \hat{B} by the same.

Now I can inspect \hat{B} and rearrange the columns of X so that coefficients with relatively large absolute values would be moved to the diagonals of the rearranged \hat{B}. After this re-arrangement, I can do the above approximation again and confirm the new decomposition has large values on the diagonals. This second round of estimates can then serve as the initial values for the chains in Stan.

Here is the stan code I’m using. I’ve only really tested it with smallish p and c, and k<=3:

data {
  int n;
  int p;
  int c;
  int k;
  real<lower=0> sigma;
  matrix[n, p] X;
  matrix[n, c] Y;
}
transformed data {
  int ntrap = p*k - k*(k-1)/2 - k;
}
parameters {
  vector<lower=0>[k] diags;
  vector<lower=0>[k] diag_sig;
  vector[ntrap] lowtrap;
  vector<lower=0>[ntrap] lowtrap_sig;
  real<lower=0> global_shrink;
  matrix[c, k] A;
}
transformed parameters {
  matrix[p,k] B;
  { //hide integer
  int idx;
  idx=0;
  B = rep_matrix(0, p, k);
  for (col in 1:k) {
    B[col, col] = diags[col]*diag_sig[col]*global_shrink;
    for (r in (col+1):p) {
      idx+=1;
      B[r, col] = lowtrap[idx]*lowtrap_sig[idx]*global_shrink;
    }
  }
  } //close hider
}
model {
  matrix[n, c] mu;
  lowtrap_sig ~ cauchy(0,1);
  diag_sig ~ cauchy(0,1);
  global_shrink ~ cauchy(0,1);
  lowtrap ~ normal(0,1);
  diags ~ normal(0,1);
  to_vector(A) ~ normal(0,1);
  mu = X*B*A';
  for (i in 1:c)
    Y[:,i] ~ normal(mu[:,i], sigma);
}
generated quantities {
  vector[n] log_lik = rep_vector(0, n);
  { //hide mu to not clutter output
    matrix[n, c] mu;
    mu = X*B*A';
    for (row in 1:n)
      log_lik[row] = normal_lpdf(Y[row,:] | mu[row,:], sigma);
  }
}

And some MATLAB (2017a) code to generate data and initial values:

%% Simulate Y from known X and parameters (B, A)
p = 8; %number of columns in X
c = 6; %number of columns in Y
k = 2; %number of columns in B and A
n = 93;%number of rows in X and Y
NC = 12; %number of chains we're going to run
noise_sig = 0.1; % fixed for now.

init_sd = 0.05; % initializations use this to jitter the initial values for different chains.

% some convenient indeces
diag_idx = find(eye(p),k);
trap_idx = tril(true(p),-1);
trap_idx = trap_idx(:, 1:k);

% Select sigma B and sigma G (global)
tdist = makedist('tlocationscale', 'mu', 0, 'sigma', 1, 'nu', 1);
B_sig = abs(tdist.random(p,k));
G_sig = abs(tdist.random(1));

% give X some interesting structure so the method has something to find
sigma = vine(p,1); % a random correlation matrix, from https://stats.stackexchange.com/questions/2746/how-to-efficiently-generate-random-positive-semidefinite-correlation-matrices
X = mvnrnd(zeros(1,p), sigma, n);

% A is simple
A = normrnd(0,1,c, k);
% Simulate B as lower-trapezoidal with positive diagonal
B = normrnd(0,1,p,k); %actually full, we'll apply constraints later
% Apply shrinkage
B = B.*B_sig*G_sig;

% massage B here, so it will generate a problem solvable within constraints
% This is cheating, but I think I can use ML methods to cheat in the same
% way in the real problem.
ridx = zeros(1,k);
for col = 1:k
    % Rearrange B so column col has the largest absolute value not already
    % claimed by a previous column
    tmp = B(:,col);
    tmp(ridx(ridx~=0)) = 0;
    [~, idx] = max(abs(tmp));
    ridx(col) = idx;
end
new_idx = [ridx setdiff(1:p, ridx)];

%% re-arrange B and B_Sig to remain consistent
B = B(new_idx,:);
B_sig = B_sig(new_idx,:);

%% Apply constraints to B
for icol = 1:k
    B(icol,icol) = abs(B(icol,icol));
    B(1:(icol-1), icol) = 0;
end

% Add some random noise
noise = normrnd(0, noise_sig, n, c);
y = X*B*A'+noise;

%% Initial values seem to matter an awful lot.
%
%  The idea is to initialize the chains at some approximate maximum
%  likelihood estimate.

% get a maximum likelihood estimate of beta
beta = X\y; 

% decompose int B*A';
[Q,L] = qr(beta'); 

% This produces iA with l-2 norm of 1. Actual A does not have norm = 1.
% When A is N(0,1), then expected norm is sqrt(numel(A)).
iB = L(1:k, :)'./sqrt(numel(A));
iA = Q(:,1:k).*sqrt(numel(A));

% flip iB so diags are positive
sgn = sign(iB(diag_idx))';
iB = iB.*sgn; %this will only work in recent versions of MATLAB.
iA = iA.*sgn;

% Add some jitter so each chain starts in a slightly different spot.
for i = 1:NC
    idiags = abs(iB(diag_idx) + normrnd(0, init_sd, k,1));
    ilowtrap = abs(iB(trap_idx) + normrnd(0, init_sd, sum(trap_idx(:)), 1));
    ml_init(i) = struct('diags', idiags, 'lowtrap', ilowtrap, 'A', iA); %#okAGROW
end

Thanks again for everyone’s help so far, and I’m happy to get further feedback.


#10

For reference, the prior I mentioned in the context of the blog post is specifically designed to get around the column-ordering problems. It shows what prior a Gaussian prior is transformed to after applying a QR transform (i.e. transforming it to the trapezoidal form).


#11

Thanks for the reminder. I didn’t mean to blow past this part of the suggestion, but I’m having trouble figuring out how the logic in Leung & Drton maps onto this problem. I’ll try to briefly summarize my understanding of what they’re doing, in hopes it will expose some error in my thinking.

They are working with the form y=\beta f+e [1.1], where f is a k-vector of latent factors and \beta is m-by-k. The first thing they do is integrate out f, so now y\sim N(0,\Sigma), where \Sigma=\Omega+\beta \beta' [1.2]. This is great, because if you fix the order of y, you’ve uniquely specified \beta \beta', and if you shuffle the order of y, you can get the same answer by shuffling the rows and columns of \Sigma in the same way. But if we work backwards, re-ordering f has no effect on \beta \beta'. In latent factor analysis, that’s fine, because we want to be invariant to rearrangement of f. Here, though, we can think of the rows of XB as analogous to f, but I don’t think I want to integrate out XB to arrive at some covariance matrix with the order invariance I’m after.

The prior Leung & Drton explicate for the diagonals of L really makes the distribution of L into the LQ decomposition of \beta. What I’m doing is not that, and I probably should if I want to be fair to the model in simulations, but I don’t see how that would remove the need to be careful about the column order of X. I would love to be wrong about that, though.

Anyway, the prior they indicate is L_{i,i}\sim x^{k-i}exp\{-\frac{1}{2C_0}x^2\}. Since I’m modeling each entry in the lower trapezoidal part of B as coming from a normal distribution with (non-identical) variance, I don’t have a convenient C_0 for the distribution of the diagonal. I could back off from the assumption that I need diag_sig, lowtrap_sig and global_shrink in my model, or I could dig into the math and figure out how to adapt that prior. The former isn’t hard, so if it works, it might indicate that the latter option is worth pursuing.


#12

Let’s see if I am on the right track myself. The original non-identifiability arises as XB=(XO)(O^{-1}B) for any invertible O. Getting rid of the non-identifiability corresponds to finding a way to pick out O uniquely. The triangular constraint corresponds to picking the orthogonal matrix O=Q^{T} matching the LQ factorization of X, such that the equation becomes XB=L(QB). Now, if B has a uniform Gaussian distribution then \hat{B}=QB has the same distribution, and we can write XB=L\hat{B} with \hat{B} having the same prior as the original B. The problem now is to pick a prior on L . Any choice of prior then maps back to X=LQ^{T}, so many priors will induce a weird skewed distribution on the original X. Leung and Drton show that if X has a standard Gaussian prior, then L must have the particular prior they detail.

There is an issue, as you note, with imposing irregular variances: if B is not row-wise uniform (column-wise should be fine), then QB will not be identically distributed. Likewise, the Leung and Drton result only holds if X is i.i.d. in its elements. We can effectively change the row-wise variance of X, though, and column-wise variance of B by multiplying with scaling matrices: D_1 XB D_2. So we have a bit of a control still.


#13

Thanks again for taking the time to help me out. Earlier, I was thinking about B and A^T as analogues to the LQ factors of \beta, where Y=X\beta. In that context, I understand why I should use the Leung & Drton priors on the diagonals of L.

But now we’re talking about LQ factorization of the design/predictor matrix X. I can see how that would eliminate worries about the order of columns of X. So now that L and Q are specified by X, I don’t really know what it means for them to have priors. Bringing it back to Stan, I could do the LQ factorization of X in the transformed data block, but I’m not sure what it then means to put a sampling statement for L in the model block.


#14

Ah, sorry, I should have done my due diligence and double-checked the notation you used. In my notation X corresponds to your B and B to your A', i.e. I am solely considering the factor model part of the problem. For your convenience, here is my previous text with the necessary swaps:

edit: with respect to your comment, the key is not to match L and Q to B and A^{T}, but rather to decompose B using LQ decomposition.


#15

Thanks for the updated notation. The idea that B=LQ was indeed what I was missing. So at the end of the day, I don’t have a posterior sample on B and A^T, I have a sample from L and QA^T. I don’t see a direct way to recover B and A. Since that’s a matter of interpretation rather than modeling, I might just be satisfied with L and QA^T, or I could do some other decomposition after-the-fact.

Simulations with standard normal priors on B and A are working great, even with relatively wide X predictor matrices. To get a stable solution, it’s still important that the k diagonals of L are not too close to zero, and this can be achieved by permuting the columns of X. In practice, it looks like I can get a decent ordering by doing a QRP factorization on \beta^T, where \beta is a least-squares estimate in the form Y=X\beta. The ideal ordering would come from QRP factorization of B, but of course B is not easy to estimate.

I’m going to try to tackle the scaling matrix idea next, and assuming that works I’ll post the full model.


#16

Yeah, it is my impression that it will work with Gaussian priors as well (I assume you mean with the triangular constraints?), the issue is that it is quite hard to interpret the prior because of the ordering effect.

Since B is not really uniquely defined, what are you hoping to retrieve with your post hoc analysis? Which of the latents do you need for your further analysis?


#17

Right, I phrased that confusingly. In my simulations I select B and A from standard normal to generate the simulated data, so I model L as lower trapezoidal with the Leung & Drton priors on L.

I’ve actually realized that my proposed column re-ordering approach yields a distribution of the diagonals of L that is not what L&D describe. It makes sense; I’m manipulating the situation to try to get bigger values on the diagonals, so of course it will change the distribution. Maybe that’s what you meant about the prior being hard to interpret. Now, the L&D prior is still pretty reasonable, so maybe that’s not a problem.

Well, the first thing I thought to do was SVD on LQA^T, since that gets me something shaped like the output of partial least squares regression. I don’t really have a particular justification I can articulate for wanting B and A^T instead of L and QA^T. I think it’s just that the zeros in the upper triangle are harder for me to intuitively grasp. Nothing is usually exactly zero, but there those zeros are providing the uniqueness of the solution.


#18

LQA^T is good because it is an identifiable quantity. Doing SVD on each sample should then give you the distribution over singular vectors and singular values, but when I tried it I found it to not be as easy as I had hoped. First off, the SVD is also non-identifiable since you can do sign switches of each left and right vector simultaneously. This is a minor type of ambiguity, but quickly gets annoying to resolve once you take the uncertainty into account. Next, think about what happens when the eigenvalues are identical, then you can also apply simultaneous permutations to the vectors. While it is unlikely that you have perfectly equal eigenvalues, once you start looking across samples the ordering will become inconsistent as singular values switch spaces.

I agree completely with your misgivings about the forced triangular structure, but recall that due to the non-identifiability it is mostly a cosmetic issue. I feel that the nicest way to interpret it is by noting that if each column a of QA^T is Gaussian with \mathcal{N}(\mu,\Sigma), then x=La+e is Gaussian with x\sim\mathcal{N}(L\mu,LL^T+D) for e\sim\mathcal{N}(0,D) and both a and e marginalized. So LL^T is the low-rank covariance structure of each column of x, and it has no pesky zeroes or non-identifiabilities. The triangular bit of L is functionally just a cholesky decomposition of the covariance for the first few elements.