Bayesian Poisson Tensor Factorization optimization

Hello Stan community,

I am looking for ways to computationally improve my Bayesian Poisson Tensor Decomposition model fit with VI-meanfield. The model is heavily based on Schein et al 2015 paper. Here is my model specification:

\begin{aligned} X_{sbm} &\sim \text{Poisson}(A_{sb} \lambda_{sbm}) \\[4pt] \lambda_{sbm} &= \sum_{k=1}^{K} \sigma_{sk}\,\beta_{bk}\,\mu_{mk} \\[8pt] \sigma_{sk} &\sim \text{Gamma}\!\bigl(a,\; a b^{(\sigma)}\bigr) \\[4pt] \beta_{bk} &\sim \text{Gamma}\!\bigl(a,\; a b^{(\beta)}\bigr) \\[4pt] \mu_{mk} &\sim \text{Gamma}\!\bigl(a,\; a b^{(\mu)}\bigr) \end{aligned}

X is an observed tensor of size (S,M,B), A is an observed matrix of size (S,B). \sigma, \beta, and \mu are factor matrices that I’m interested in.

Here is my Stan code (I’ve tried adapting vectorization methods mentioned in this Stan forum post):

data {
    int<lower=0> S;  
	int<lower=0> M;  
    int<lower=0> B; 
	int<lower=0> K;  

    array[S, M, B] int<lower=0> X;
    array[S] row_vector<lower=0>[B] A; 

    real a;
    real b_sigma;
    real b_mu;
    real b_beta;
}

parameters {
    array[S] row_vector<lower=0>[K] sigma;
    array[M] row_vector<lower=0>[K] mu;
    matrix<lower=0>[K, B] beta;
}

model {

    for (s in 1:S) {
        sigma[s] ~ gamma(a, b_sigma);
    }

    for (m in 1:M) {
        mu[m] ~ gamma(a, b_mu);
    }

    to_vector(beta) ~ gamma(a, b_beta);

    for (s in 1:S) {
        for (m in 1:M) {
            
            row_vector[B] lambda = (sigma[s] .* mu[m]) * beta .* A[s];
            row_vector[B] lambda_eps = lambda + rep_row_vector(0.000001, B);
            
            X[s,m] ~ poisson(lambda_eps);
        }
    }
}

I’m compiling and running it using Cmdstanpy:

stan_data = dict(
    S=S,
    M=M,
    B=B,
    K=K,
    X=X.astype(int),
    A=A,
    a=0.1,
    b_sigma=3,
    b_mu=1,
    b_beta=1
)

model = cs.CmdStanModel(stan_file='models/3D_V1.stan')

vi = model.variational(
    data=stan_data,
    algorithm = "meanfield",    
    grad_samples = 1,         
    iter=100,
    draws = 1,     
    seed = 11,
    require_converged = False,   
    show_console=True
)

The problem I’m facing is that when running with my data tensor of dimensions (536, 96, 680), the model is fitting very slowly and displaying the following message:

Chain [1] 1000 transitions using 10 leapfrog steps per transition would take 1.05478e+06 seconds.
Chain [1] Adjust your expectations accordingly!

What I have already tested / observations:

  • Running with a small tensor of size (10, 96, 15) makes the model finish with no problem
  • Removing the line row_vector[B] lambda_eps = lambda + rep_row_vector(0.000001, B); results in the model rejecting the initial values and having log probability evaluate to log(0), i.e. negative infinity.
  • During adaptation, the target may jump to very low numbers (like -1.56092e+171). This happens right after the gamma sampling, not yet reaching poisson.

Any comments would be welcome! I appreciate the effort and patience of those replying to my post.

Andrey

I can’t give more than guesses on the optimization of the code itself. I’m hoping that that @Bob_Carpenter shows up to educate us both.

But I do use CmdStanPy very often with a model with gamma and poisson distributions. I have found that the issues with initialization can be solved by either passing explicit inits or using pathfinder to find reasonable initial values. There’s an example of the later approach in the CmdStanPy docs: Variational Inference using Pathfinder — CmdStanPy 1.2.5 documentation

.

1 Like

I believe Stan will always show that warning – you can set up any valid model and it will run if correctly implemented, but it may be the case that it’s a lot of computation and it will take a long time, hence “adjust your expectations accordingly”.

How does the number of operations scale with that increase in size? Does the run time scale accordingly or is it blowing up, there is such thing as too-heavy a calculation per iteration that makes the whole thing infeasible.

The Poisson distribution will only accept values greater or equal to zero (zero is tricky in practice, since the likelihood of anything else would be zero, and the logP would be minus infinity for anything but), so even if you are not expecting values smaller than zero, it is possible they are fluctuating to something vanishingly small, but still negative – if this is happening my suggestion would rather be to make sure any of those values are set to zero before passing them to the distribution (what you are doing will also work if no value is smaller than 0.000001 AND artificially setting this value doesn’t distort the model, i.e. it’s in practice equivalent to being zero).

I’m not sure what you mean by this, are you debugging by printing the probabilities at each step? It may well be that some proposal has very low probability and ends up rejected (or is accepted, but just ends up being low), but is this a large discrepancy from other proposals? If you have several thousand observations, it is entirely possible that adding all of them up as (log) small probabilities will give you a large negative number. The question is whether this is not expected for some reason.

Without going into your model in detail, maybe some those loops can be parallelized or eliminated altogether.

I haven’t done this in a while, but I think for a vector you either have the distribution parameters match its size, or be a scalar and then it applies to all of its elements. So this could probably be just sigma ~ gamma(a, b_sigma), for instance. I’m not sure this will save computation on the order you need, but maybe start looking at what loops are really necessary, and then what can be replaced with more efficient vector/matrix multiplication as well as parallelization functions.

1 Like