Monitor subset of parameters in cmdstanr?

Hi,

Is there a way to only monitor & save a subset of parameters in cmdstanr?

The models I am fitting have a lot of parameters (which cannot be declared locally since they have constraints), and so it takes an extremely long time to load the model after it is done fitting

2 Likes

Hi Enzo,

There’s not currently a way to only save a subset of parameters to disc, but you can select a subset when reading them back into R. The draws(), summary(), and I think some other methods have a variables argument for that. We’re also working on speeding up the CSV reading, which should also help:

1 Like

Just wondering if there has been any update to this or whether it’s being looked at at all for future releases? would be really useful for models with lots of nuisance parameters

2 Likes

@morganbale and I have just switched to cmdstanr. We are running out of disc space saving nuisance parameters. Any updates on this?

1 Like

Unfortunately that behaviour would have to be implemented in cmdstan as an option. It’s been discussed before but does not look likely to have an implementation on the horizon: stan-dev/cmdstan#553

@andrjohns thanks for the answer and the link to the github issue. Is there any documentation on how to work around this. I don’t have a good mental model for the mapping between where parameters are declared in Stan and whether they are saved. We have a number of parameters declared in transformed_parameters that we don’t really need to save. Thanks for the help.

The transformed parameters that are stored in the CSV files are those that are declared in the top level.

In this example:

transformed parameters {
      vector[N] a;
      vector[N] b;
      
      if(something) {
          vector[N] c =....
          b = ...
     } else {
         b = ...
     }
     {
        vector[N] d = ...
     }
}

only the vectors a and b would be output to the results, while vectors c and d would not be in the output as they are declared in a “local scope”.

Another way to “solve” the nuisance parameter issue is to allow parameters with bounds as local variables - maybe this would be easier to implement than only monitoring a subset of parameters?

Note that bounds on transformed parameters don’t have the same impact as bounds on parameters, they simply act as error-checks

In other words, if you correctly specify your constraints and likelihood, then the bounds on transformed parameters are redundant

Yes sorry I was referring to the parameters block

A “local” parameter in the parameters block wouldn’t make sense unfortunately - since it would not exist outside of the parameters block and could not be sampled from or used in transformations

1 Like

My solution is to use local parameters and they don’t get saved. For example, the way using transformed_parameters is:

data{
    .....
}

parameters{
  
  real<lower=0> inv_alpha;
  real<lower=0> r_over_a;

}
  
transformed parameters{

  real alpha = inv(inv_alpha);
  real r = r_over_a * alpha;

}
  
  
model{
  
  inv_alpha ~ normal(0,1);
  r_over_a ~ normal(0,1);
  
  //...likelihood involving r and alpha ...
 
}

Instead, you can use local parameters:

data{
    .....
}

parameters{
  
  real<lower=0> inv_alpha;
  real<lower=0> r_over_a;

}
  
model{

  real alpha = inv(inv_alpha);
  real r = r_over_a * alpha;
  
  inv_alpha ~ normal(0,1);
  r_over_a ~ normal(0,1);
  
  //...likelihood involving r and alpha ...
  
}

Hope this helps.

1 Like

nice but this doesnt work if you have lots of nuisance parameters with bounds, e.g. with data augmentation models, im surprised that monitoring a subset of parameters isnt one of the Stan dev’s team top priorities as it would probably make these models a lot faster

One would think so (I certainly have thought so as well), but we have done performance checks in the past, and the performance hit is actually not as big as one would think: writing 1000 samples for 50k parameters to the CSV file takes around 20 seconds (so 20ms per sample - not per gradient, per sample), and with 10k parameters it’s 4 seconds.

The way I currently see it is that for a model with 50k parameters/quantities, 20 seconds of additional runtime is essentially negligible for models of that size, and a few GBs of disk space is generally not a big problem. This is not an official stan-dev position - this is how I see it, and these are the reasons that I haven’t invested a lot of development time in improving the current state of our I/O - and I have invested a substantial amount of time of improving it in the past.

I am definitely open to changing my stance and discussing and hearing about your use case. For now, I see several other avenues to work on before writing to CSV files becomes the bottleneck with the best return on development time investment.

I acknowledge that writing to a disk could be problematic if you write directly to a network disk. In that case, writing to CSV files could be much slower, so you should avoid doing that if possible. The other case is if you are running many really simple models with 200k+ parameters.

Can you give an example of such a model or just a part of the parameters/transf. parameters blocks.

1 Like

multivariate probit