Monitor subset of parameters in cmdstanr?

Hi,

Is there a way to only monitor & save a subset of parameters in cmdstanr?

The models I am fitting have a lot of parameters (which cannot be declared locally since they have constraints), and so it takes an extremely long time to load the model after it is done fitting

2 Likes

Hi Enzo,

Thereā€™s not currently a way to only save a subset of parameters to disc, but you can select a subset when reading them back into R. The draws(), summary(), and I think some other methods have a variables argument for that. Weā€™re also working on speeding up the CSV reading, which should also help:

1 Like

Just wondering if there has been any update to this or whether itā€™s being looked at at all for future releases? would be really useful for models with lots of nuisance parameters

2 Likes

@morganbale and I have just switched to cmdstanr. We are running out of disc space saving nuisance parameters. Any updates on this?

1 Like

Unfortunately that behaviour would have to be implemented in cmdstan as an option. Itā€™s been discussed before but does not look likely to have an implementation on the horizon: stan-dev/cmdstan#553

@andrjohns thanks for the answer and the link to the github issue. Is there any documentation on how to work around this. I donā€™t have a good mental model for the mapping between where parameters are declared in Stan and whether they are saved. We have a number of parameters declared in transformed_parameters that we donā€™t really need to save. Thanks for the help.

The transformed parameters that are stored in the CSV files are those that are declared in the top level.

In this example:

transformed parameters {
      vector[N] a;
      vector[N] b;
      
      if(something) {
          vector[N] c =....
          b = ...
     } else {
         b = ...
     }
     {
        vector[N] d = ...
     }
}

only the vectors a and b would be output to the results, while vectors c and d would not be in the output as they are declared in a ā€œlocal scopeā€.

Another way to ā€œsolveā€ the nuisance parameter issue is to allow parameters with bounds as local variables - maybe this would be easier to implement than only monitoring a subset of parameters?

Note that bounds on transformed parameters donā€™t have the same impact as bounds on parameters, they simply act as error-checks

In other words, if you correctly specify your constraints and likelihood, then the bounds on transformed parameters are redundant

Yes sorry I was referring to the parameters block

A ā€œlocalā€ parameter in the parameters block wouldnā€™t make sense unfortunately - since it would not exist outside of the parameters block and could not be sampled from or used in transformations

1 Like

My solution is to use local parameters and they donā€™t get saved. For example, the way using transformed_parameters is:

data{
    .....
}

parameters{
  
  real<lower=0> inv_alpha;
  real<lower=0> r_over_a;

}
  
transformed parameters{

  real alpha = inv(inv_alpha);
  real r = r_over_a * alpha;

}
  
  
model{
  
  inv_alpha ~ normal(0,1);
  r_over_a ~ normal(0,1);
  
  //...likelihood involving r and alpha ...
 
}

Instead, you can use local parameters:

data{
    .....
}

parameters{
  
  real<lower=0> inv_alpha;
  real<lower=0> r_over_a;

}
  
model{

  real alpha = inv(inv_alpha);
  real r = r_over_a * alpha;
  
  inv_alpha ~ normal(0,1);
  r_over_a ~ normal(0,1);
  
  //...likelihood involving r and alpha ...
  
}

Hope this helps.

1 Like

nice but this doesnt work if you have lots of nuisance parameters with bounds, e.g. with data augmentation models, im surprised that monitoring a subset of parameters isnt one of the Stan devā€™s team top priorities as it would probably make these models a lot faster

One would think so (I certainly have thought so as well), but we have done performance checks in the past, and the performance hit is actually not as big as one would think: writing 1000 samples for 50k parameters to the CSV file takes around 20 seconds (so 20ms per sample - not per gradient, per sample), and with 10k parameters itā€™s 4 seconds.

The way I currently see it is that for a model with 50k parameters/quantities, 20 seconds of additional runtime is essentially negligible for models of that size, and a few GBs of disk space is generally not a big problem. This is not an official stan-dev position - this is how I see it, and these are the reasons that I havenā€™t invested a lot of development time in improving the current state of our I/O - and I have invested a substantial amount of time of improving it in the past.

I am definitely open to changing my stance and discussing and hearing about your use case. For now, I see several other avenues to work on before writing to CSV files becomes the bottleneck with the best return on development time investment.

I acknowledge that writing to a disk could be problematic if you write directly to a network disk. In that case, writing to CSV files could be much slower, so you should avoid doing that if possible. The other case is if you are running many really simple models with 200k+ parameters.

Can you give an example of such a model or just a part of the parameters/transf. parameters blocks.

1 Like

@rok_cesnovar

multivariate probit

I agree with @CerulloE. There are many examples where one is forced to create ā€œrawā€ versions of the parameters (e.g., vectors with varying bounds, non-centered parameterizations, etc.). These raw versions are required in order for the MCMC sampler to run efficiently, but we are typically only interested in the non-raw parameters.

While 20 seconds doesnā€™t seem so bad, some of us must run simulations where we test a particular method on, say, 10,000 data sets to explore operating characteristics (e.g., bias, MSE, credible interval coverage). Thus, it would take over 55 hours of run time to do such a simulation on 50k parameters as opposed to 11 hours for the other case.

@tarheel just a minor note that in both of the examples you mention (vectors with varying bounds, non-centered parameterizations) it is possible to avoid monitoring the raw parameters using <lower, upper> and <offset, muliplier> syntax respectively. This isnā€™t to distract from your overall point, but in case itā€™s useful to you.

Thanks! But I was thinking of the case where the bounds of each component of the parameter are (possibly) different, e.g., if \theta_1 > l_1 and \theta_2 > l_2. Also, if the parameters are a priori correlated, I do not think using multiplier will help either since, e.g., if Z \sim N(0, I) then \mu + LZ \sim N(\mu, \Sigma) so multiplier would only work if \Sigma were a diagonal matrix.

1 Like

Itā€™s possible to pass vectors of bounds via <lower, upper>. Unless Iā€™m being dense, I think that solves that particular issue? :)

I see your point with the correlated parameters! I think that multiplier should still be sufficient to remove funnel geometries, but of course warmup and sampling will likely be more efficient if the raw parameters are uncorrelated (in particular, sampling will be more efficient on uncorrelated parameters if you use a diagonal mass matrix, and warmup will be more efficient and/or require fewer iterations if you use a dense mass matrix).

Apologiesā€“I missed that you said you can pass vectors. That would indeed work in the case that l_1, l_2 are fixed. But if they are parameters in the model, I think you would have to declare raw parameters.

Agreed. The issue that Iā€™m currently facing is that I have a problem where parameters are correlated and I have vectors with varying random bounds. I tried programming it directly and the posterior results are pretty terrible.

At any rate, this is a very good discussion!

1 Like