Bayesian Neural Network with Categorical Likelihood: efficient implementation

Hello,

I am coding up some Bayesian Neural Networks in Stan. Below I provide the simplest model (no hidden layers) to expose my question more easily.

I am wondering if there is an alternative to my code to make things more efficient. I have been going through stan docs and it seems there are some things I might be able to do, but then in practice the model is not compiling. I provide my questions below the code.

data {

      // It is usefull to set the lower or upper values for the variables. This improves readiability
      int<lower=0> N;     // number of draws from the model
      int<lower=2> C;     // number of classes. Also for simplicity assume the input has same dimensionality
      matrix[N,C]  x;     //  iid observations. X is a matrix with rows being samples and colums the features
      int<lower=1> y[N];  //  iid observations. Y is an array of length N containing the class label

      // definition of the prior p(w|0,I), passed as argument
      real          mu_w; 
      real<lower=0> std_w;

}

parameters {
      matrix[C,C] W; // the parameter we want to infer
}

model {

      // prior over the weights p(w)
      to_vector(W) ~ normal(mu_w,std_w);
      
      // likelihood p(t|NNet_w(x)) 
      matrix[N,C] p = x*W; // get parameters from Categorical Likelihood
      
      for (n in 1:N)
        y[n] ~ categorical_logit(p[n]');
   
}     

  1. First question: I was wondering if there would be a way to vectorize the likelhood evaluation. From Stan docs it seems we can 12.5 Categorical Distribution | Stan Functions Reference, but from stan examples it seems we can’t 1.6 Multi-Logit Regression | Stan User’s Guide . I was thinking in doing something like this:

      // prior over the weights p(w)
      to_vector(W) ~ normal(mu_w,std_w);
      
      // likelihood p(t|NNet_w(x)) 
      matrix[N,C] p = x*W; // get parameters from Categorical Likelihood
      
     // reshape to work with batched categorical
      vector[C] p_vec[N];
      for (n in 1:N)
          p_vec[n] = p[n]';
        
      y ~ categorical_logit(p_vec);

But this code is not compiling, something weird because from what I understand, the categorical_logit can works with int arrays of y and arrays of vectors p_vec , as stated in section 12.5.4 here 12.5 Categorical Distribution | Stan Functions Reference . Not sure if I am interpreting the documentation correctly. In any case, what of the two options I expose is more efficient?. I am not sure since the later requires to copy elements in memory, but I have also understood from the documentation that evaluation of log probabilities using vectorized operations is much faster.

Following my previous observation, I have also tried the following code, which seems to compile and sample:

    // prior over the weights p(w)
      to_vector(W) ~ normal(mu_w,std_w);
      
      // likelihood p(t|NNet_w(x)) 
      matrix[N,C] p = x*W; // get parameters from Categorical Likelihood
      
      y ~ categorical_logit(to_vector(p)); 

but I am not sure why this is working. From what I understand to_vector(p) will create a column vector with N*C elements. How does fit with the fact that the vector y has shape N?. Does Stan automatically split the vector p in vectors matching the number of elements in y?.

  1. Second question: For placing a prior that goes beyond the standard normal, let say with some correlation, and since there is no matrix normal distribution implemented yet in Stan, I guess the only way to do that would be through reshaping?. For example:
parameters {
      matrix[C,C] W; // the parameter we want to infer
}

model {

      // prior over the weights p(w)
     W ~ to_matrix(multi_normal_cholesky(mu_w,L_w), C, C) // row_major vs column_major would depend on how do you want to establish correlations
      
      // likelihood p(t|NNet_w(x)) 
      matrix[N,C] p = x*W; // get parameters from Categorical Likelihood
      
1 Like

I think the only way vectorization works on the categorical_logit is if it’s the same set of parameters for every output. Since your parameters change for every output, I think the for loop is the best you can do there.

y ~ categorical_logit(to_vector(p)) is doing the wrong thing in this case. It is saying every output comes from the same distribution with probabilities defined by the to_vector(p) variable, which is not what you want.

Yeah the only way to do a matrix normal right now is manually.

If you have performance questions though, best to start with profiling. It’s a thing that just got added recently. There’s an example for using it with cmdstanr over here. You can figure out pretty quickly where the work is going.

1 Like

Thanks for the answer. I have a small question following up your answer. In the case in which you answered:

y ~ categorical_logit(to_vector(p)) is doing the wrong thing in this case. It is saying every output comes from the same distribution with probabilities defined by the to_vector(p) variable, which is not what you want.

What is happening under the hood? I mean, if y has shape N and to_vector(p) has shape N*C, how does stan broadcast here?

On the other side, I am trying to speed this up using reduce_sum function. Will post the answer

For each integer in y[n], it increments the target log density as if y[n] came from the categorical distribution softmax(to_vector(p)).

So like:

for(n in 1:N) {
  y[n] ~ categorical_logit(to_vector(p));
}
1 Like

Thank you so much. So since there is no parallel version of the Categorical log likelihood I have figure out the following Stan code, that attempts to parallelize the code provided above. It make use of the reduce_sum function:

// ** Juan Maroñas **
// An example of Stan to perform inference in a Bayesian Neural Network

functions {
  
  real partial_sum(int [] y_slice, int start, int end, matrix p ) 
  {

    real __target = 0; 
    int counter   = 1;
    for (n in start:end)
    {
      __target += categorical_logit_lpmf( y_slice[counter] | p[n]');
      counter  += 1;
    }

    return __target;
  }

  
  }

data {

      // It is usefull to set the lower or upper values for the variables. This improves readiability
      int<lower=0> N;     // number of draws from the model
      int<lower=2> C;     // number of logits, i.e number of classes. We assume that the input dimension is also given by C
      matrix[N,C]  x;     //  iid observations. X is a matrix containing the data
      int<lower=1> y[N];  //  iid observations. Y is an array of length N containing the class label

      // definition of the prior p(w|0,I), passed as argument
      real          mu_w; 
      real<lower=0> std_w;

}

parameters {
      matrix[C,C] W; // the parameter we want to infer
}

model {

      // prior over the weights p(w)
      to_vector(W) ~ normal(mu_w,std_w); 
     
      // likelihood p(t|NNet_w(x)) 
      matrix[N,C] p = x*W; // get parameters from Categorical Likelihood

      // parallelize the computation
      int grainsize = 1;
      target += reduce_sum(partial_sum, y, grainsize, p);
}

By compiling with STAN_THREADS cmdstan flag, the above code can be parallelized. The only thing that should be considered is the for loop in the partial_sum function. As the reduce_sum provides (through y_slice ), the y variable sliced over start:end, then the loop should start from 1. I would appreciate any feedback attempting to make this faster or correct any possible errors (working out how to work on GPU atm).

For matrix normal, the easiest way would probably be to extend the multivariate normal non-centered parameterization to work with the left and right covariance matrices.

2 Likes