Stan model with ~700K parameters: Troubleshooting slow, stuck chains

Hi!
I’m fairly new to Stan. I have a large Supply-Use Table balancing problem with ~700k parameters. For small data, Stan works fine, but when I scale to 5–10% of real data, sampling becomes extremely slow (8+ hours for 500 warmup) and the chains appear stuck. I tried reduce_sum, raising adapt_delta, etc. Code below. Suggestions on reparameterizations, tuning, or alternatives are welcome!

Some background: I want to balance an economic Supply-Use Table (SUT). SUTs consist of activities (e.g. pizza production) and products (e.g. pizza, cheese, …). Each activity has one (or several) output product (e.g. pizza) and several inputs of products (e.g. cheese, tomatoes, dough). All flows of products from/to activities are recorded in tonnes.

The goal of my model is to make sure that:

  1. The sum of all inputs into an activity equals the sum of all outputs (activity balance)
  2. The sum of all products supplied equals the sum of all products used (product balance)
  3. (… other constraints that don’t matter for now)

That’s my Stan code:


functions {
  // penalize relative difference in supply vs. use
  real apply_rel_diff_constraint(vector v1, vector v2, real epsilon, real rel_tol) {
    int N = num_elements(v1);
    vector[N] mean_val = (v1 + v2) / 2 + epsilon;
    vector[N] rel_diff = (v1 - v2) ./ mean_val;
    return normal_lpdf(rel_diff | 0, rel_tol);
  }
}

data {
  int<lower=1> N_sup;
  int<lower=1> N_use;
  
  int<lower=1> N_pr; // Number of unique product-region combinations
  int<lower=1> N_ar; // Number of unique activity-region combinations
  
  // "Observed" means and stdevs on log-scale
  vector[N_sup] mu_log_sup;  
  vector<lower=0>[N_sup] sigma_log_sup;
  
  vector[N_use] mu_log_use;  
  vector<lower=0>[N_use] sigma_log_use;
  
  // Sparse matrix data for use
  vector[N_use] w_pr_use;
  array[N_use] int v_pr_use;
  array[N_pr + 1] int u_pr_use;
  
  vector[N_use] w_ar_use;
  array[N_use] int v_ar_use;
  array[N_ar + 1] int u_ar_use;
  
  // Sparse matrix data for supply
  vector[N_sup] w_pr_sup;
  array[N_sup] int v_pr_sup;
  array[N_pr + 1] int u_pr_sup;
  
  vector[N_sup] w_ar_sup;
  array[N_sup] int v_ar_sup;
  array[N_ar + 1] int u_ar_sup;
  
  // Tolerances and other constants
  real<lower=0> rel_tol_product;
  real<lower=0> rel_tol_activity;
  int<lower=0> grainsize;
  real<lower=0> epsilon;
  
  // Indices of constraints (as they do not apply to all products/activities)
  int<lower=1> N_p_constr;
  array[N_p_constr] int<lower=1, upper=N_pr> idx_p_constr;
  
  int<lower=1> N_a_constr;
  array[N_a_constr] int<lower=1, upper=N_ar> idx_a_constr;
}


parameters {
  // parameterize supply & use in log-space
  vector[N_sup] x_log_sup;
  vector[N_use] x_log_use;
}

transformed parameters {
  // Exponentiate to get supply/use on the original absolute scale
  vector<lower=0>[N_sup] x_sup = exp(x_log_sup);
  vector<lower=0>[N_use] x_use = exp(x_log_use);
  
  // Compute total use by product-region
  vector[N_pr] total_use_pr;
  total_use_pr = csr_matrix_times_vector(N_pr, N_use, w_pr_use, v_pr_use, u_pr_use, x_use);
  
  // Comput total supply by product-region 
  vector[N_pr] total_sup_pr;
  total_sup_pr = csr_matrix_times_vector(N_pr, N_sup, w_pr_sup, v_pr_sup, u_pr_sup, x_sup);
  
  // Compute total input by activity-region
  vector[N_ar] total_input_ar;
  total_input_ar = csr_matrix_times_vector(N_ar, N_use, w_ar_use, v_ar_use, u_ar_use, x_use);
  
  // Compute total output by activity-region
  vector[N_ar] total_output_ar;
  total_output_ar = csr_matrix_times_vector(N_ar, N_sup, w_ar_sup, v_ar_sup, u_ar_sup, x_sup);
}

model {
  // 1) Supply log-likelihood 
  target += normal_lpdf(x_log_sup | mu_log_sup,sigma_log_sup);
  // 2) Use log-likelihood 
  target += normal_lpdf(x_log_use | mu_log_use, sigma_log_use);
  
  // 3) Product-balance constraints (on the absolute scale of x_sup, x_use)
  vector[N_p_constr] sup_sub = total_sup_pr[idx_p_constr];
  vector[N_p_constr] use_sub = total_use_pr[idx_p_constr];
  target += apply_rel_diff_constraint(sup_sub, use_sub, epsilon, rel_tol_product);
  
  // 4) activity-balance constraints
  vector[N_a_constr] output_sub = total_output_ar[idx_a_constr];
  vector[N_a_constr] input_sub = total_input_ar[idx_a_constr];
  target += apply_rel_diff_constraint(output_sub, input_sub, epsilon, rel_tol_activity);
}

In my case, products and activities have a location (i.e. “pizza production” in Italy uses “cheese” from Switzerland). In the code, product-region pairs are noted as *_pr and activity-region pairs as *_ar. I model the two constraints as “soft constraints”, with a relative tolerance.

My full data set currently has the following dimensions: N_sup = 5E4, N_use = 7E5, and is supposed to be larger in the end.

I run the model with cmdstanr on a cluster with 4 cores. For the most recent runs I set the following configurations:

  • max_treedepth = 15 (as I got warnings about max_threshold being reached)
  • adapt_delta = 0.95
  • iter_warmup = 1000
  • iter_sampling = 500
  • rel_tol_product = 0.01, rel_tol_activity = 0.1 (as the activities show a much higher prior imbalance)
  • I set the initial values close to the priors/observations (with some random noise).

The good thing: The model does what it is supposed to do for small toy data (around 1% of the full data). However, when I run the model with a larger data (i.e. more “parameters” in my case), say 5-10% of the full data, I notice the following issues:

  • chains get stuck (they more or less stay at their initial state)
  • sampling takes very long (8h for 500 warm-up samples). Most of the time is consumed for the “use log-likelihood”, followed by “compute total use/inputs …”.

My questions:

  1. Does anyone have an idea how to reformulate/reparameterize the model to increase sampling speed and/or make a more “favourable geometry”? (I tried threading for the “use log-likelihood” using reduce_sum but did not observe any gains in speed)
  2. What are the most promising configuration settings I can try out to “tune” the sampling algorithm? (further increase warmup/adapt_delta/max_treedepth?, any other?)
  3. Or, is my problem just “too big” for Stan/MCMC? If so, do you have any suggestions for alternatives to try out, e.g. pathfinder, ADVI, etc. (I haven’t yet looked into that all).

Please let me know if you need more information on background, diagnostics, etc.! Thanks for your help!

Did you start small and build up the model from an extremely basic one? That’s the user guide advice, because we are building software products when we compile Stan models. I’ve always stuck strictly to that, and it highlights where the bottlenecks are as one adds complications one at a time. I suspect your constraints are responsible, but it could also be the size of the data + parameters. I really don’t wish to sound condescending here, so please don’t feel under attack, but if you are new to Stan and your first task is a 7e5 parameter model with n=1.2e6 and OR-style constraints, I think you have been put into a very difficult position and would benefit from stepping back and building up slowly. It’s hard to say anything more specific. But don’t give up!

2 Likes

N_pr, N_use, w_pr_use, v_pr_use, and u_pr_use are all data. Consequently, a lot of the work that csr_matrix_times_vector is doing just repeats the same thing each iteration, reconstructing the sparse matrix. The only thing that changes is the multiplication of the resulting matrix and x_use. I’m not familiar with the efficiency of csr_matrix_times_vector but I suspect you could speed this up a lot by pre-computing the matrix in the transformed data block.

transformed data{
    matrix[N_ar, N_use] region_use = csr_to_dense_matrix(N_pr, N_use, w_pr_use, v_pr_use, u_pr_use);
}
...
transformed parameters{
    vector[N_pr] total_use_pr = region_use * x_use;
}

There are some crucial details about your model that I’m not sure about. It looks like there are only some cases where the supply-use equality constraint must apply (based on the comment about “Indices of constraints”). Could you instead use simplexes to enforce the constraints? I see that there’s a complex relationship between the individual counts and the total use/supply. But that might by one approach that enforces equality rather than shrinks towards it.

2 Likes

Thanks for your suggestions.

The reason I use csr_matrix_times_vector is that the matrices (which map use/supply to the individual product-region/activity-region) are very large (700k X 50k) and extremely sparse. Pre-computing them would just blow up my memory.

Yes, indeed. The constraints apply for most products/activities but not all (e.g. activities also “supply” CO2-emissions or “use” resources, both of which are omitted in the product balance, but count in the activity balance). Yes, in principal good idea with the simplexes, but in my understanding this could only work for one constraint at a time, but not for both constraints at the same time? But I might have to think it through..

1 Like

Thanks for your comment and the encouragement (I put myself into that difficult position :) ). Yes, indeed I started with a very small, simple toy data set, for which the model works fine (also for very high prior imbalances it finds a solution). The problem really seems to be when increasing the size of the data.

Is it the goal of the model to ensure that or to evaluate whether it holds? In a simpler case, suppose I have a time series and a simple autoregressive model. I can parameterize the AR model to ensure that this leads to a stationary distribution. Or I can fit the unconstrained model to test whether the data lead to a stationary time series or not. If you really want to ensure this constraint, then you need to come up with a parameterization that guarantees it holds. It looks like you’re trying to enforce this with a “soft” constraint (quotes because how soft it is depends on tolerance).

How small are your tolerances here? If it’s very small, you’ll induce some seriously problematic regions of high curvature into the model. Stan has trouble with that because we use the standard first-order Hamiltonian integrator (the leapfrog algorithm), which only uses gradients according to a step size. So while conceptually this might make sense, it won’t fit.

What you really want to do is parameterize in such a way that will hold. For example, if dividing a pie into N pieces that have to add up to one pie, then you can use a simplex. These are sometimes simple and sometimes really complicated or even impossible.

You can parameterize directly with x_sup and x_use by declaring:

parameters {
  vector<lower=0>[N_sup] x_sup;
  ...
model {
   x_sup ~ lognormal(x_sup | mu_log_sup, sigma_log_sup);

I would also like to second what everyone else is saying, which is to start with something simpler and start with a subset of the data, then scale up. It’ll lead to much faster development iterations. As we say in CS, it’s much easier to improve something that works than to debug something that doesn’t (especially with Stan’s poor debugging tools). Fitting nearly a million parameters is really pushing what Stan will be able to do in a model this complicated.

The fact that you require max_treedepth is showing that the model has extreme curvature—the adaptation has chosen a very small step size to try to compensate. This will turn out to be a losing battle if you’re really using that tolerance to impose a tight constraint. On the other hand, I don’t know how to formulate a constraint based a sparse matrix multiply—someone who’s better at linear algebra might be able to reduce it to an easier problem to solve.

2 Likes

thanks @Bob_Carpenter for your thoughts! really appreciate.

The goal is indeed to ensure it. Our “unbalanced” data come from many sources and show discrepancies (e.g., country A reports 100 t of pizza exports to B, but B reports 50 t imports). My model aims to reconcile those into a balanced table that respects physical conservation of mass.

Initially I used a soft constraint because others did so for similar tasks [1,2] and I assumed it might be easier for sampling. But, what I got from you comments is that finding a direct parameterization that cannot violate balance might be more efficient, right?

I tried out values between 0.1 and 0.01, which might be too tight already?

Using your example, my problem is that I have cake pieces of different height and color. I need that the pieces have to add up to a cake when sorting them by height and color. So it’s more complicated than a single simplex. If anyone has examples or references on enforcing multi‐dimensional constraints in Stan, I’d really appreciate them!

Oh yes, this looks cleaner, I will try.

@spinkney has a parameterization of doubly stochastic matrices in the works, per this post

1 Like

Do you mean you have a matrix where the rows have constrained sums and the columns have constrained sums?

As @jsocolar noted, @spinkney is about to roll out a matrix type where both rows and columns sum to zero and values are allowed to be negative or positive—it would presumably be adaptable to sums other than zero! We can do the same and make a doubly stochastic matrix where rows and columns are all simplexes. It’s not out in Stan yet, though.

@Bob_Carpenter I thought a doubly stochastic matrix is one where all rows and columns are simplexes? Doubly stochastic matrix - Wikipedia

But now I’m not totally sure which one @spinkney is working on–a sum-to-zero or the wikipedia version.

The sum to zero matrix will be in the next version of Stan where both the rows and columns sum to zero. This can be a M x N matrix and so the number of rows can be different from the columns.

In addition, I have a doubly stochastic matrix parameterization where the rows and columns are both simplexes. The matrix must be square in this case. There’s a doubly stochastic @bgoodri put together that you can use in the mean time. I suspect it doesn’t have great numerical properties because it depends on the stick breaking process. The one I’m putting together should be better and faster. It doesn’t use the sum to zero matrix but it does use the sum to zero vector. Basically a bunch of inverse ILR simplex transforms with some other things. I’m finishing up other stuff so I hope to put this in late summer in time for the fall release.

1 Like

Wow, thank you all for your time and your valuable inputs!

That might indeed help me. By treating use/inputs as negative entries and supply/outputs as positive, I can write down my problem as a M x N matrix where each row and column sum to zero (M = number of products, N = number of activities).

One naive question, though: Will it be possible to place different priors (or constraints) on individual elements of such a sum-to-zero matrix? In my case, some entries have very narrow prior uncertainty, while others are essentially unconstrained, and a few elements are required to be exactly zero.

Yes, but it’s a bit complicated to work out the implications.

By default, every parameter gets a uniform prior on its support in Stan with appropriate change-of-variables adjustments so that you can put further distributions on the resulting parameters. This is how a positive-constrained parameter can be assigned a lognormal prior. We first do the adjustment for the mapping from (-inf, inf) to (0, inf) via exponentiation, then the user can add further priors, such as a lognormal.

The complication that arises is that the components of the constrained parameters are usually not themselves uniformly distributed. For example, if I declare a simplex[5], then I get a 5-vector with non-negative values that sum to 1. But they’re heavily correlated in that if I add to one, I have to subtract somewhere else to maintain the sum. So putting a prior on one component of a simplex has a knock-on effect on the other parameters. You can write a model with just your prior and simulate from it using Stan to see how the prior you place on components works out on the joint distribution. For example, if I have a sum-to-zero constraint and apply a standard normal prior, the resulting variables have a slightly less than unit variance.

2 Likes