Large memory overhead for multinomial regression

Operating System: CentOS x86_64 GNU/Linux
Interface Version: pystan
Compiler/Toolkit: stan

I’ve been tinkering around with Stan fairly recently – so far I have found it to be an incredibly expressive language. It is truly amazing that these resources are under active development.

Right now, I’m trying to implement a hierarchical multinomial-normal regression model. It actually seems to be working with really small examples. The code is below

data {
  int<lower=0> N;    // number of samples
  int<lower=0> D;    // number of dimensions
  int<lower=0> r;    // rank of partition matrix
  matrix[D-1, D] psi;// partition matrix for ilr transform
  real g[N];         // covariate values
  int x[N, D];       // observed counts
parameters {
  vector[D-1] beta0;
  vector[D-1] beta1;
  matrix[N, D-1] x_; // ilr transformed abundances
  vector<lower=0>[D-1] sigma;
  matrix[D-1, r] W; // factor for covariance matrix
transformed parameters{
  cov_matrix[D-1] Sigma;
  matrix[D-1, D-1] I;
  I = diag_matrix(rep_vector(1.0, D-1));
  Sigma = W * W' + diag_matrix(sigma) + 0.001*I;

model {
  // setting priors ...
  for (i in 1:D-1){
    beta0[i] ~ normal(0, 15); 
    beta1[i] ~ normal(0, 15); 
    sigma[i] ~ lognormal(0, 1); 

  for (i in 1:D-1){
    for (j in 1:r){
      W[i, j] ~ normal(0, 5);

  for (j in 1:N){
    x_[j] ~ multi_normal(g[j] * beta1 + beta0, Sigma);
    // perform ilr inverse and multinomial sample
    x[j] ~ multinomial( softmax(to_vector( x_[j] * psi )) );

The problem that I am currently running into is trying to scale this on larger datasets. I am able to get this to run on a N=40, D=256 dataset using ADVI. However, when N=40, D=512, the program crashes with a memory overhead of 9GB.

The largest data structure in this program is the covariance matrix, which would be on the order of 511 x 511. I’d think that would be the size of a few MB, not GB right?

Not sure what is going on with the memory footprint, or how to best diagnose this. Any insights will be greatly appreciated!!

1 Like

It’s not the size of the matrices, it’s the size of the expression graphs that are produced when evaluating the log density. That includes all of the functions and operations applied, which is usually 24 bytes plus 8 bytes per operand. Factoring a covariance matrix as required by multi-normal is a cubic operation.

This is a fairly elaborate model with a sparse covariance matrix based on factors.

Sigma = W * W' + diag_matrix(sigma) + 0.001*I;

This isn’t an efficient way to do this in Stan. You want to use

Sigma = multiply_self_transpose(W);
for (d in 1:D - 1)
  Sigma[d, d] += sigma[d] + 0.001;

That way, you don’t create two matrices full of zeroes. But you might find everything works better with a dense covariance matrix represented in terms of a Cholesky factor. I have no idea how good a simple factor model like this is for covarianceor how big the diag matrices need to be to ensure positive definiteness.

We really need a multinomial-logit distribution to parallel categorical-logit so you don’t have to softmax. There’s also a performance issue with K coefficient vectors vs. K - 1. The latter is identified, but the former is easier to work with in terms of priors.

@Matthijs just wrote a bunch of great GLM functions that reduce a lot of this memory internally. So that might get down the multinomial part of this if he or someone else keeps going and adds the multivariate GLMs like multi-logit.


I forgot to mention that you want to vectorize all this, as in

to_vector(W) ~ normal(0, 5);

and you can just use things ike

sigma ~ lognormal(0, 1);


I’d like to implement some multivariate GLMs like multinomial regression in the not so far future. It’s not that much work and the payoff seems to be pretty good.


Multinomial Regression would be fantastic!