Bayesian Neural Network with Categorical Likelihood: efficient implementation

jmaronas · March 16, 2021, 9:14am

Hello,

I am coding up some Bayesian Neural Networks in Stan. Below I provide the simplest model (no hidden layers) to expose my question more easily.

I am wondering if there is an alternative to my code to make things more efficient. I have been going through stan docs and it seems there are some things I might be able to do, but then in practice the model is not compiling. I provide my questions below the code.

data {

      // It is usefull to set the lower or upper values for the variables. This improves readiability
      int<lower=0> N;     // number of draws from the model
      int<lower=2> C;     // number of classes. Also for simplicity assume the input has same dimensionality
      matrix[N,C]  x;     //  iid observations. X is a matrix with rows being samples and colums the features
      int<lower=1> y[N];  //  iid observations. Y is an array of length N containing the class label

      // definition of the prior p(w|0,I), passed as argument
      real          mu_w; 
      real<lower=0> std_w;

}

parameters {
      matrix[C,C] W; // the parameter we want to infer
}

model {

      // prior over the weights p(w)
      to_vector(W) ~ normal(mu_w,std_w);
      
      // likelihood p(t|NNet_w(x)) 
      matrix[N,C] p = x*W; // get parameters from Categorical Likelihood
      
      for (n in 1:N)
        y[n] ~ categorical_logit(p[n]');
   
}

First question: I was wondering if there would be a way to vectorize the likelhood evaluation. From Stan docs it seems we can 12.5 Categorical Distribution | Stan Functions Reference, but from stan examples it seems we can’t 1.6 Multi-Logit Regression | Stan User’s Guide . I was thinking in doing something like this:


      // prior over the weights p(w)
      to_vector(W) ~ normal(mu_w,std_w);
      
      // likelihood p(t|NNet_w(x)) 
      matrix[N,C] p = x*W; // get parameters from Categorical Likelihood
      
     // reshape to work with batched categorical
      vector[C] p_vec[N];
      for (n in 1:N)
          p_vec[n] = p[n]';
        
      y ~ categorical_logit(p_vec);

But this code is not compiling, something weird because from what I understand, the categorical_logit can works with int arrays of y and arrays of vectors p_vec , as stated in section 12.5.4 here 12.5 Categorical Distribution | Stan Functions Reference . Not sure if I am interpreting the documentation correctly. In any case, what of the two options I expose is more efficient?. I am not sure since the later requires to copy elements in memory, but I have also understood from the documentation that evaluation of log probabilities using vectorized operations is much faster.

Following my previous observation, I have also tried the following code, which seems to compile and sample:

    // prior over the weights p(w)
      to_vector(W) ~ normal(mu_w,std_w);
      
      // likelihood p(t|NNet_w(x)) 
      matrix[N,C] p = x*W; // get parameters from Categorical Likelihood
      
      y ~ categorical_logit(to_vector(p));

but I am not sure why this is working. From what I understand to_vector(p) will create a column vector with N*C elements. How does fit with the fact that the vector y has shape N?. Does Stan automatically split the vector p in vectors matching the number of elements in y?.

Second question: For placing a prior that goes beyond the standard normal, let say with some correlation, and since there is no matrix normal distribution implemented yet in Stan, I guess the only way to do that would be through reshaping?. For example:

parameters {
      matrix[C,C] W; // the parameter we want to infer
}

model {

      // prior over the weights p(w)
     W ~ to_matrix(multi_normal_cholesky(mu_w,L_w), C, C) // row_major vs column_major would depend on how do you want to establish correlations
      
      // likelihood p(t|NNet_w(x)) 
      matrix[N,C] p = x*W; // get parameters from Categorical Likelihood

bbbales2 · March 16, 2021, 12:36pm

I think the only way vectorization works on the categorical_logit is if it’s the same set of parameters for every output. Since your parameters change for every output, I think the for loop is the best you can do there.

y ~ categorical_logit(to_vector(p)) is doing the wrong thing in this case. It is saying every output comes from the same distribution with probabilities defined by the to_vector(p) variable, which is not what you want.

Yeah the only way to do a matrix normal right now is manually.

If you have performance questions though, best to start with profiling. It’s a thing that just got added recently. There’s an example for using it with cmdstanr over here. You can figure out pretty quickly where the work is going.

jmaronas · March 16, 2021, 12:41pm

Thanks for the answer. I have a small question following up your answer. In the case in which you answered:

y ~ categorical_logit(to_vector(p)) is doing the wrong thing in this case. It is saying every output comes from the same distribution with probabilities defined by the to_vector(p) variable, which is not what you want.

What is happening under the hood? I mean, if y has shape N and to_vector(p) has shape N*C, how does stan broadcast here?

On the other side, I am trying to speed this up using reduce_sum function. Will post the answer

bbbales2 · March 16, 2021, 12:44pm

For each integer in y[n], it increments the target log density as if y[n] came from the categorical distribution softmax(to_vector(p)).

So like:

for(n in 1:N) {
  y[n] ~ categorical_logit(to_vector(p));
}

jmaronas · March 16, 2021, 1:23pm

Thank you so much. So since there is no parallel version of the Categorical log likelihood I have figure out the following Stan code, that attempts to parallelize the code provided above. It make use of the reduce_sum function:

// ** Juan Maroñas **
// An example of Stan to perform inference in a Bayesian Neural Network

functions {
  
  real partial_sum(int [] y_slice, int start, int end, matrix p ) 
  {

    real __target = 0; 
    int counter   = 1;
    for (n in start:end)
    {
      __target += categorical_logit_lpmf( y_slice[counter] | p[n]');
      counter  += 1;
    }

    return __target;
  }

  
  }

data {

      // It is usefull to set the lower or upper values for the variables. This improves readiability
      int<lower=0> N;     // number of draws from the model
      int<lower=2> C;     // number of logits, i.e number of classes. We assume that the input dimension is also given by C
      matrix[N,C]  x;     //  iid observations. X is a matrix containing the data
      int<lower=1> y[N];  //  iid observations. Y is an array of length N containing the class label

      // definition of the prior p(w|0,I), passed as argument
      real          mu_w; 
      real<lower=0> std_w;

}

parameters {
      matrix[C,C] W; // the parameter we want to infer
}

model {

      // prior over the weights p(w)
      to_vector(W) ~ normal(mu_w,std_w); 
     
      // likelihood p(t|NNet_w(x)) 
      matrix[N,C] p = x*W; // get parameters from Categorical Likelihood

      // parallelize the computation
      int grainsize = 1;
      target += reduce_sum(partial_sum, y, grainsize, p);
}

By compiling with STAN_THREADS cmdstan flag, the above code can be parallelized. The only thing that should be considered is the for loop in the partial_sum function. As the reduce_sum provides (through y_slice ), the y variable sliced over start:end, then the loop should start from 1. I would appreciate any feedback attempting to make this faster or correct any possible errors (working out how to work on GPU atm).

Christopher-Peterson · March 16, 2021, 7:43pm

For matrix normal, the easiest way would probably be to extend the multivariate normal non-centered parameterization to work with the left and right covariance matrices.

Topic		Replies	Views
How to model mixed variables Gaussian, Ranking binary and categorical in stan all together to generate a joint distribution of them? Modeling	2	438	October 14, 2023
Newly Specified Bayesian Hierarchical Model with High Runtime Modeling techniques , specification	17	686	October 12, 2023
Advice requested on using categorical_lpmf() in a complex model Modeling	3	410	November 22, 2023
Bayesian categorical annotation Modeling	4	815	January 11, 2019
Large data sets with stan code Modeling	1	676	September 18, 2018

Bayesian Neural Network with Categorical Likelihood: efficient implementation

Related topics