Sampling from sparse models with symmetries

fitting-issues

#1

I have been running into some trouble getting sparsely loaded factor models to work.

My model features an ordinal likelihood, but I think the problem would persist in the continuous setting:

functions { 
  /* sratio-probit log-PDF for a single response 
   * Args: 
   *   y: response category 
   *   mu: linear predictor 
   *   thres: ordinal thresholds 
   *   disc: discrimination parameter 
   * Returns: 
   *   a scalar to be added to the log posterior 
   */ 
   real sratio_probit_lpmf(int y, real mu, vector thres, real disc) { 
     int ncat = num_elements(thres) + 1; 
     vector[ncat] p; 
     vector[ncat - 1] q; 
     for (k in 1:(ncat - 1)) { 
       q[k] = 1 - Phi(disc * (thres[k] - mu)); 
       p[k] = 1 - q[k]; 
       for (kk in 1:(k - 1)) p[k] = p[k] * q[kk]; 
     } 
     p[ncat] = prod(q); 
     return categorical_lpmf(y | p); 
   }
} 


data {
    int<lower=1> D;
    int<lower=1> N;
    int<lower=1> S;
    int<lower=1> K;
    int testmin;
    int testmax;
    int<lower=1,upper=D> group[N];
    int<lower=testmin, upper=testmax> y[S,N];
    real<lower=1> p0;           // Expected number of large slopes
}

transformed data {
  real slab_scale = 1;    // Scale for large slopes
  real slab_scale2 = square(slab_scale);
  real slab_df = 25;      // Effective degrees of freedom for large slopes
  real half_slab_df = 0.5 * slab_df;
  real tau0 = (p0 / (K - p0)) * (1. / sqrt(1.0 * S));  
}

parameters {
    matrix<lower=0>[K,N] W_tilde;
    vector[testmax-2] gamma_tilde[D];
    unit_vector[S] B_tilde[K];
    
    //horseshoe
    matrix<lower=0>[K,N] lambda;
    real<lower=0> tau_tilde;
    real<lower=0> c2_tilde;
}

transformed parameters {
    matrix[K,N] W;
    matrix[S,N] F;
    real llk = 0;
    matrix<lower=0,upper=1>[K,N] shrinkage;
    
    {
    matrix[S,K] B;
    vector[testmax-1] gamma[D];
    
    matrix[K,N] lambda_tilde;

    //horseshoe    
    real tau = tau0 * tau_tilde;
    real c2 = slab_scale2 * c2_tilde;
    lambda_tilde = sqrt( c2 * square(lambda) ./ (c2 + square(tau) * square(lambda)) );
    shrinkage = (tau/sqrt(c2))*lambda_tilde;
   
    //factors
    for (k in 1:K) {
        B[,k] = B_tilde[k];
    }

    //ordinal likelihood weights
    for (d in 1:D) {
        gamma[d,1] = 0.;
        gamma[d,2:] = gamma_tilde[d];
    }
    
    W = tau * lambda_tilde .* W_tilde;
    F = (B*W);
    for (n in 1:N) {
        for (s in 1:S) {
            llk += sratio_probit_lpmf(y[s,n] | F[s,n], gamma[group[n]], disc);
        }
    }    
    }
}

model {
    //basis vectors
    for (k in 1:K) {
        B_tilde[k] ~ normal(0.,1.);
    }
    //loadings
    to_vector(W_tilde) ~ normal(0.,1.);
    
    //sparsity
    to_vector(lambda) ~ cauchy(0., 1.);
    tau_tilde ~ cauchy(0., 1.);
    c2_tilde ~ inv_gamma(half_slab_df, half_slab_df);
    
    //
    for (d in 1:D)
        gamma_tilde[d] ~ normal(0.,1.);
    
    target += llk;
}

The key quantities are the unit-vector factors B and the “finnish horseshoe” sparse loadings W. I have followed Betancourt’s horseshoe implementation, adapted to factor models.

My problem seems to be one of non-identifiability, as the traceplots of the local sparsity scales lambda features spiking behaviour where each random variable is predominantly zero, but with several tall spikes, so that none of the variables remain uniformly sparse or unshrunk throughout the trace. Furthermore, the (non-sparse) factor elements are all almost symmetrically centered around zero, indicating that the factors are either irrelevant or unidentified. One hypothesis is that there might be something at play similar to the butterfly geometry of http://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html, where the highly shrunk weights makes it possible for factor permutations to occur.


#2

I don’t really understand these models, but why do you think it’s non-identifiability? What’s “spiking bheavior” of traceplots? Non-identifiability typically shows up in pairs plots when two parameters are highly correlated (additive non-identifiabilty) or form a banana (multiplicative).

You don’t need to write 0. and 1. in Stan. Integers are automatically promoted to real values.


#3

I tried to describe the “spiking behaviour”, but I really just mean that the sampler does not settle into a particular sparsity pattern - rather, it switches individual elements in W off and on during the course of running the sampler. This then manifests as e.g. a trace which is mostly zero, except at a small set of samples where it deviates considerably from zero.

The model certainly has a permutation non-identifiability, as you can switch around the factors and their corresponding loadings. (I think the classical rotation non-identifiability issue of e.g. PPCA is fixed by the sparse ICA-like prior on the loadings). But usually these permutation modes should be distant enough to not cause trouble. I’m fearing that sparsity might force the permutation modes to be closer to zero, making transitions easier.

The behaviour I expected was that the loadings W would have a somewhat concentrated posterior, with a large mass of elements always being on and off across all samples. In turn this should induce a concentrated posterior for the factors B. Instead, I get a B where all elements seem to distributed symmetrically around 0 and W has an undetermined sparsity pattern.