Marginalization of high-dimensional Gaussian

I have N vectors of dimensionality D that I assume are drawn from a multi_normal with full covariance, but some of these vectors are only partially observed with some inconsistent pattern of missing values. Assume that D is big enough that manually implementing marginalization like in the manual is unrealistic.

Representing the missing values is already a bit of a hassle; in other programming languages I would have used a boolean mask and boolean indexing. I guess I might need to use a boolean mask for each of the N vectors so I can store them in an indexed array and then convert them inside the loop?

Since the marginal covariance is given by the matching submatrix, is my best bet to form a list with the indices of the observed features and then index into the covariance matrix and the mean using that?

No. Combine observed values (from the data block) with missing values (from the parameters) block into a set of N local vectors each of size D in the model block and evaluate the multi_normal_cholesky_lpdf on all of them jointly.

Ben has helped me out with this before:

data{
  int<lower=0> N;
  int<lower=0> K;
  int<lower=0> N_miss;
  vector[K] y_obs[N]; 	//my data were censored at zero
}
parameters{
  real y_miss[N_miss];
  vector[K] mu;
  cholesky_factor_cov[K] L_Sigma;
}
model{
  vector[K] y[N];
  int pos = 1;

  // Combine missing and observed data
  for(i in 1:N) {
    for (j in 1:K) {
      if (y_obs[i, j] > 0) {  
        y[i, j] = y_obs[i, j];
      }
      else {
        y[i, j] = y_miss[pos];
        pos = pos + 1;
      }
    }
  }
  
  y ~ multi_normal_cholesky(mu, L_Sigma);
}

I was doing that in a previous implementation, but that was because I actually needed the imputed values. Is this really the preferred solution in general? It seems like it might at the very least incur a computational penalty, and for high degrees of missingness the difference in number of latent variables that need to be sampled over could be considerable.

What about the effect on MAP solutions? I am guessing the non-marginalized and the marginalized joint can have significantly different MAP solutions.

edit: @aornugent, that looks to be a pretty nice template, so unless an alternative method comes up I will likely use that, thanks.

edit: thinking a bit more about the implementation, one key drawback to doing manual marginalization is that it is almost impossible to reuse Cholesky decompositions, which could be a major drain. Still a bit worried about MAP and the parametric explosion, though, but I guess there is no free lunch.

If I do the imputation, is there still an advantage to vectorizing it by using e.g. segment on the y_obs and y_miss in @aornugent 's script? My gut says it might be better than doing everything element-wise, but Stan’s loop vs. vectorization performance has confounded me before.

At the moment the imputation strategy seems to incur a significant computational cost. Sampling goes from costing 3 seconds for N=35 to 138 seconds with N=35 and N_{corrupt}=31. Simply doubling the dataset to N=70 costs 7 seconds, so the computational cost is definitely due to the imputation.

edit: using segment takes the cost from the absurd 100+ to a much more manageable 30 seconds.

Here is my current script:

functions {
    real multi_normal_lowrank_cholesky_lpdf(vector x, vector mu, vector sigma, matrix B, matrix L) {
        int lowrank = cols(B);  
        vector[lowrank] v = mdivide_left_tri_low(L, B' * ((x-mu) ./ square(sigma)));
        return normal_lpdf(x | mu, sigma) + dot_self(v)/2. - sum(log(diagonal(L)));
    }

    real multi_normal_lowrank_lpdf(vector x, vector mu, vector sigma, matrix B) {
        int lowrank = cols(B);  
        matrix[rows(B),lowrank] Bscaled = diag_pre_multiply(inv(sigma), B);
        matrix[lowrank,lowrank] L = cholesky_decompose(diag_matrix(rep_vector(1,lowrank)) + crossprod(Bscaled));
        return multi_normal_lowrank_cholesky_lpdf(x | mu, sigma, B, L);
    }    
}

data {
    int include_prior;
    int include_likelihood;
    int N; //observed rows
    int K; // number of factors
    int P; // number of features

    vector[P] features[N];

    int N_corrupt;
    vector[P] corrupt_features[N_corrupt];
    int<lower=0,upper=1> iscorrupt[N_corrupt,P];
}

transformed data {
    int P_corrupt[N_corrupt];
    int total_corrupt = 0;
    for (n in 1:N_corrupt) {
        P_corrupt[n] = 0;
        for (p in 1:P) {
            P_corrupt[n] += iscorrupt[n,p];
        }
        total_corrupt += P_corrupt[n];
    } 
}



parameters {
    cholesky_factor_cov[K,K] B_tri;
    matrix[P-K,K] B_tilde;
    vector<lower=0>[P] sigma;

    vector[total_corrupt] missing_features;

    vector[N] llk;
}

transformed parameters {
    matrix[P,K] B = append_row(B_tri, B_tilde);
    matrix[P,K] Bscaled = diag_pre_multiply(inv(sigma), B);
    matrix[K,K] L = cholesky_decompose(diag_matrix(rep_vector(1,K)) + crossprod(Bscaled)); 

    vector[P] imputed_features[N_corrupt];

    for (n in 1:N_corrupt) {
        int pos = 1;
        for (p in 1:P) {
            if (iscorrupt[n,p]) {
                imputed_features[n,p] = missing_features[pos];
                pos = pos + 1;  
            } else {
                imputed_features[n,p] = corrupt_features[n,p];
            }
        }
    }
}   

model { 
    if (include_prior) {
        //Leung and Drton prior on B_tri
        for (l in 1:K) {
            for (k in l:K) {
                if (k==l) {
                    B_tri[k,l] ~ normal(0.,1.) T[0,];
                    target += (K-k)*log(B_tri[k,l]);
                } else {
                    B_tri[k,l] ~ normal(0,1);
                }
            }
        }
        to_vector(B_tilde) ~ normal(0,1);
        sigma ~ normal(0,1); 
    }

    if (include_likelihood) {
        for (n in 1:N) {
            llk[n] = multi_normal_lowrank_cholesky_lpdf(to_vector(features[n]) | rep_vector(0,P), sigma, B, L);
        }
        target += sum(llk)

        for (n in 1:N_corrupt) {
            target += multi_normal_lowrank_cholesky_lpdf(to_vector(imputed_features[n]) | rep_vector(0,P), sigma, B, L);
        }
    } 
}

It is basically a marginalized factor model. The weird prior on B_tri is designed so that B has the distribution of the triangular factor of a Gaussian matrix, so just ignore it.

edit: here is the segment based imputation strategy:

    vector[P] imputed_features[N_corrupt];
    if (N_corrupt>0) {
        int pos = 1;
        for (n in 1:N_corrupt) {
            int observed[P-P_corrupt[n]];
            int corrupted[P_corrupt[n]];
            int pos_corrupt = 1;
            int pos_obs = 1;
            for (p in 1:P) {
                if (iscorrupt[n,p]) {
                    corrupted[pos_corrupt] = p;
                    pos_corrupt += 1;
                } else {
                    observed[pos_obs] = p;
                    pos_obs += 1;
                }
            }
            imputed_features[n,observed] = corrupt_features[n,observed];
            imputed_features[n,corrupted] = segment(missing_features, pos, P_corrupt[n]);
            pos += P_corrupt[n];
        }
    }