Large memory overhead for multinomial regression

mortonjt · November 7, 2017, 9:58pm

Operating System: CentOS x86_64 GNU/Linux
Interface Version: pystan 2.17.0.0
Compiler/Toolkit: stan 2.17.0.0

I’ve been tinkering around with Stan fairly recently – so far I have found it to be an incredibly expressive language. It is truly amazing that these resources are under active development.

Right now, I’m trying to implement a hierarchical multinomial-normal regression model. It actually seems to be working with really small examples. The code is below

data {
  int<lower=0> N;    // number of samples
  int<lower=0> D;    // number of dimensions
  int<lower=0> r;    // rank of partition matrix
  matrix[D-1, D] psi;// partition matrix for ilr transform
  real g[N];         // covariate values
  int x[N, D];       // observed counts
}
parameters {
  vector[D-1] beta0;
  vector[D-1] beta1;
  matrix[N, D-1] x_; // ilr transformed abundances
  vector<lower=0>[D-1] sigma;
  matrix[D-1, r] W; // factor for covariance matrix
}
transformed parameters{
  cov_matrix[D-1] Sigma;
  matrix[D-1, D-1] I;
  I = diag_matrix(rep_vector(1.0, D-1));
  Sigma = W * W' + diag_matrix(sigma) + 0.001*I;
}

model {
  // setting priors ...
  for (i in 1:D-1){
    beta0[i] ~ normal(0, 15); 
    beta1[i] ~ normal(0, 15); 
    sigma[i] ~ lognormal(0, 1); 
  }

  for (i in 1:D-1){
    for (j in 1:r){
      W[i, j] ~ normal(0, 5);
    }
  }

  for (j in 1:N){
    x_[j] ~ multi_normal(g[j] * beta1 + beta0, Sigma);
    // perform ilr inverse and multinomial sample
    x[j] ~ multinomial( softmax(to_vector( x_[j] * psi )) );
  }
}

The problem that I am currently running into is trying to scale this on larger datasets. I am able to get this to run on a N=40, D=256 dataset using ADVI. However, when N=40, D=512, the program crashes with a memory overhead of 9GB.

The largest data structure in this program is the covariance matrix, which would be on the order of 511 x 511. I’d think that would be the size of a few MB, not GB right?

Not sure what is going on with the memory footprint, or how to best diagnose this. Any insights will be greatly appreciated!!

Bob_Carpenter · November 9, 2017, 12:29am

It’s not the size of the matrices, it’s the size of the expression graphs that are produced when evaluating the log density. That includes all of the functions and operations applied, which is usually 24 bytes plus 8 bytes per operand. Factoring a covariance matrix as required by multi-normal is a cubic operation.

This is a fairly elaborate model with a sparse covariance matrix based on factors.

Sigma = W * W' + diag_matrix(sigma) + 0.001*I;

This isn’t an efficient way to do this in Stan. You want to use

Sigma = multiply_self_transpose(W);
for (d in 1:D - 1)
  Sigma[d, d] += sigma[d] + 0.001;

That way, you don’t create two matrices full of zeroes. But you might find everything works better with a dense covariance matrix represented in terms of a Cholesky factor. I have no idea how good a simple factor model like this is for covarianceor how big the diag matrices need to be to ensure positive definiteness.

We really need a multinomial-logit distribution to parallel categorical-logit so you don’t have to softmax. There’s also a performance issue with K coefficient vectors vs. K - 1. The latter is identified, but the former is easier to work with in terms of priors.

@Matthijs just wrote a bunch of great GLM functions that reduce a lot of this memory internally. So that might get down the multinomial part of this if he or someone else keeps going and adds the multivariate GLMs like multi-logit.

Bob_Carpenter · November 9, 2017, 12:30am

mortonjt:

for (i in 1:D-1){
    beta0[i] ~ normal(0, 15); 
    beta1[i] ~ normal(0, 15); 
    sigma[i] ~ lognormal(0, 1); 
  }

  for (i in 1:D-1){
    for (j in 1:r){
      W[i, j] ~ normal(0, 5);
    }
  }

I forgot to mention that you want to vectorize all this, as in

to_vector(W) ~ normal(0, 5);

and you can just use things ike

sigma ~ lognormal(0, 1);

directly.

Matthijs · November 10, 2017, 5:06pm

I’d like to implement some multivariate GLMs like multinomial regression in the not so far future. It’s not that much work and the payoff seems to be pretty good.

jsilve24 · November 11, 2017, 3:49pm

Multinomial Regression would be fantastic!

Topic		Replies	Views
Parallelization in cmdstanr: using reduce_sum() on a multinomial likelihood Modeling cmdstanr , multinomial-response	14	1177	January 9, 2024
Simple Multinomial Logistic Regression Performance Modeling performance	5	1002	September 30, 2020
Troubleshooting a "MaxDiff" hierarchical multinomial regression RStan	4	1860	April 17, 2020
Ram problem when pytan is running with very large files General	3	40	October 30, 2024
Hierarchical Multinomial Model Optimization/Specification Modeling performance , multinomial-response , hierarchical-model	15	1420	May 7, 2020

Large memory overhead for multinomial regression

Related topics