Generated Quantities returns error "Mismatch between model and fitted_parameters csv file"

I’m attempting to use the generate_quantities function to predict on some new data. I’ve fit my initial model using cmdstanr

fit <- model$sample(model_data)

and have written a new model to perform the generated quantities, as in the documentation for the function here.

pred_model$generate_quantities(fit, data = pred_data)

When I do this, my chains finish unexpectedly with the error “Mismatch between model and fitted_parameters csv file”.

What does this mean? Where does this mismatch typically occur?

Here are the associated stan models

First, the model I use to fit my data

data{
  int n; //Total number of observations
  int subjectids[n]; //Subject idendification number as an integer.  Mine go from 1 - 36
  int n_subjectids; //How many unique subjects do I have?
  vector[n] time; //time at which subjects were observed?  Length N
  real yobs[n]; //Observed concentraitons
  
  //Covars
  vector[n] sex;
  vector[n] weight;
  vector[n] creatinine;
  vector[n] age;
  vector[n] D;


}
parameters{
  
  real<lower=0>  mu_cl;
  real<lower=0> s_cl;                                                   
  vector[n_subjectids] z_cl;
  
  real<lower=0> mu_tmax;
  real<lower=0> s_t;
  vector[n_subjectids] z_t;

  real<lower=0, upper=1> phi;
  real<lower=0, upper=1> kappa;
  vector<lower=0, upper=1>[n_subjectids] delays;
  
  real<lower=0> sigma;
  
  real mu_alpha;
  real<lower=0> s_alpha;
  vector[n_subjectids] z_alpha;
  
  real beta_cl_sex;
  real beta_cl_weight;
  real beta_cl_creatinine;
  real beta_cl_age;
  
  
}
transformed parameters{
  vector<lower=0>[n] Cl = exp(mu_cl + z_cl[subjectids]*s_cl + beta_cl_sex*sex + beta_cl_weight*weight + beta_cl_creatinine*creatinine + beta_cl_age*age);
  vector<lower=0>[n] t = exp(mu_tmax + z_t[subjectids]*s_t);
  vector<lower=0, upper=1>[n] alpha = inv_logit(mu_alpha + z_alpha[subjectids]*s_alpha);
  vector<lower=0>[n]ka = log(alpha)./(t .* (alpha-1));
  vector<lower=0>[n] ke = alpha .* log(alpha)./(t .* (alpha-1));
  vector<lower=0>[n] delayed_time = time - 0.5*delays[subjectids];
   
  vector<lower=0>[n] C = (0.5*D ./ Cl) .* (ke .* ka) ./ (ke - ka) .* (exp(-ka .* delayed_time) -exp(-ke .* delayed_time));
}
model{
  mu_tmax ~ normal(log(3.3), 0.25);
  s_t ~ gamma(10, 100);
  z_t ~ normal(0,1);
  
  mu_cl ~ normal(log(3.3),0.15);
  s_cl ~ gamma(15,100);
  z_cl ~ normal(0,1);
  
  
  mu_alpha ~ normal(0,1);
  s_alpha ~ gamma(10, 100);
  z_alpha ~ normal(0,1);
  
  
  phi ~ beta(20,20);
  kappa ~ beta(20,20);
  delays ~ beta(phi/kappa, (1-phi)/kappa);
  
  beta_cl_sex ~ student_t(3,0,2.5);
  beta_cl_weight ~ student_t(3,0,2.5);
  beta_cl_creatinine ~ student_t(3,0,2.5);
  
  sigma ~ lognormal(log(0.1), 0.2);
  yobs ~ lognormal(log(C), sigma);
}

And now, the model I use for generated quantities


data{
  int n; //Total number of observations
  int subjectids[n]; //Subject idendification number as an integer.  Mine go from 1 - 36
  int n_subjectids; //How many unique subjects do I have?
  vector[n] time; //time at which subjects were observed?  Length N
  real yobs[n]; //Observed concentraitons
  
  //Covars
  vector[n] sex;
  vector[n] weight;
  vector[n] creatinine;
  vector[n] age;
  vector[n] D;


}
parameters{
  
  real<lower=0>  mu_cl;
  real<lower=0> s_cl;                                                   
  vector[n_subjectids] z_cl;
  
  real<lower=0> mu_tmax;
  real<lower=0> s_t;
  vector[n_subjectids] z_t;

  real<lower=0, upper=1> phi;
  real<lower=0, upper=1> kappa;
  vector<lower=0, upper=1>[n_subjectids] delays;
  
  real<lower=0> sigma;
  
  real mu_alpha;
  real<lower=0> s_alpha;
  vector[n_subjectids] z_alpha;
  
  real beta_cl_sex;
  real beta_cl_weight;
  real beta_cl_creatinine;
  real beta_cl_age;
  
  
}

generated quantities{
  vector<lower=0>[n] Cl = exp(mu_cl + beta_cl_sex*sex + beta_cl_weight*weight + beta_cl_creatinine*creatinine + beta_cl_age*age);
  real<lower=0> t = exp(mu_tmax);
  real<lower=0> alpha = inv_logit(mu_alpha);
  real<lower=0> ka = log(alpha)/(t * (alpha-1));
  real<lower=0> ke = alpha * ka;
  vector<lower=0>[n] delayed_time = time - 0.5*phi;
   
  vector<lower=0>[n] C = (0.5*D ./ Cl) * (ke * ka) / (ke - ka) .* (exp(-ka * delayed_time) -exp(-ke * delayed_time));
}

I haven’t had a chance to look into this, but I wonder if it doesn’t like that you’ve used the same names as the transformed parameters in the original model. What happens if you use different names in generated quantities?

In the second model, I’ve changed the generated quantities block so that variables now have a p at the end of their name


generated quantities{
  vector<lower=0>[n] Clp = exp(mu_cl + beta_cl_sex*sex + beta_cl_weight*weight + beta_cl_creatinine*creatinine + beta_cl_age*age);
  real<lower=0> tp = exp(mu_tmax);
  real<lower=0> alphap = inv_logit(mu_alpha);
  real<lower=0> kap = log(alphap)/(tp * (alphap-1));
  real<lower=0> kep = alphap * kap;
  vector<lower=0>[n] delayed_timep = time - 0.5*phi;
   
  vector<lower=0>[n] Cp = (0.5*D ./ Clp) * (kep * kap) / (kep - kap) .* (exp(-kap * delayed_timep) -exp(-kep * delayed_timep));
}

This results in the same error.

Thanks for trying. I’m glad that’s not the problem but also sorry there’s still a problem. Maybe @mitzimorris or @rok_cesnovar will know what’s up, but if not then is it possible to share model_data and pred_data (or fake versions if you need to keep those private) so that we can reproduce this?

@mitzimorris @rok_cesnovar Any ideas? Also, if possible we should update that error message if possible (I think that needs to be done in CmdStan not CmdStanR, right?).

Models here

Data here

You can run these via

library(cmdstanr)

model = cmdstan_model('models/model.stan')
model_data = readRDS('fit_data/model_data.RDS')
fit = model$sample(model_data)


pred_model = cmdstan_model('models/pred_model.stan')
pred_data = readRDS('fit_data/pred_data.RDS')
pred_model$generate_quantities(fit, pred_data)

Thanks!

the model is fit to model_data: fit <- model$sample(model_data)
but the predictions use pred_data - is the value for data variable n_subjectids the same in both datasets?

No. There are 300 some on in the former and 100 in the latter. I’ve linked the models and data in a comment above.

the problem is that in the 2nd model, the parameters block declares vectors z_t and z_alpha of length 300, but the fit from the 1st model contains vectors of length 100, so not enough parameters (potentially), therefore CmdStan objects.

since the generated quantities block doesn’t use z_t or z_alpha, if you remove them from the 2nd model, you should be OK.

So I’ve updated the model to

data{
  int n; //Total number of observations
  vector[n] time; //time at which subjects were observed?  Length N
  vector[n] sex;
  vector[n] weight;
  vector[n] creatinine;
  vector[n] age;
  vector[n] D;
  
  
}
parameters{
  
  real mu_cl;
  real mu_tmax;
  real phi;
  real mu_alpha;
  
  real beta_cl_sex;
  real beta_cl_weight;
  real beta_cl_creatinine;
  real beta_cl_age;
  
}

generated quantities{
  vector<lower=0>[n] Clp = exp(mu_cl + beta_cl_sex*sex + beta_cl_weight*weight + beta_cl_creatinine*creatinine + beta_cl_age*age);
  real<lower=0> tp = exp(mu_tmax);
  real<lower=0> alphap = inv_logit(mu_alpha);
  
  real<lower=0> kap = log(alphap)/(tp * (alphap-1));
  real<lower=0> kep = alphap * kap;
  vector<lower=0>[n] delayed_timep = time - 0.5*phi;
  
  vector<lower=0>[n] Cp = (0.5*D ./ Clp) * (kep * kap) / (kep - kap) .* (exp(-kap * delayed_timep) -exp(-kep * delayed_timep));
}

And the same error occurs. I played around with the code last night and find even this very simple model will return the same error

data{
  int n; //Total number of observations
  vector[n] time; //time at which subjects were observed?  Length N
  vector[n] sex;
  vector[n] weight;
  vector[n] creatinine;
  vector[n] age;
  vector[n] D;
  
  
}
parameters{
  
  real mu_cl;
  real mu_tmax;

  
}

generated quantities{
  real x = mu_cl+1;
  real xx = mu_tmax+1;
}

OK, CmdStan implementation is too stupid/brittle.

what should work is a pair of models, {data, pred}, where the variables in the parameters block are arranged such that all variables needed by the prediction model are declared first in the data model, and the order in which they are declared is the same.

this was an oversight/lack of imagination on my part when implementing this feature - I will file issue on CmdStan to remove the ordering constraint (however, size constraint will remain).

filed issue: https://github.com/stan-dev/cmdstan/issues/927

1 Like

So to confirm, this is a problem with cmdstan? Can you make a high level comment on what the issue is?

1 Like

yes - it’s a problem with CmdStan.

currently, CmdStan takes a shortcut in that it assumes that the parameters blocks in the data fitting model and in the add’l quantities-of-interest generating model are the same:

  • parameter variables declared in the same order
  • variables have same sizes, i.e. if the size is passed in as a dimension, the size must be the same

it looks like in your initial report, the parameter blocks matched, but the sizes were different. but then you tried models where the parameters blocks weren’t the same.

the fix requires adding more logic to CmdStan so that instead of just slicing out the block of columns from the fitted_params csv file, it picks and chooses which columns to use. I’m not sure how easy this is to do in Eigen - investigating.

I’ve just added the above to the CmdStan issue.

1 Like

I misspoke - this isn’t a shortcut - it’s the only robust and efficient way to code this. trying to re-order and/or subset the fited parameters to match the ordering expected by the standalone generated quantities model would be slow and inefficient. every change made to the parameters block is an opportunity for error. so perhaps requiring the parameter blocks to match is a kind of shortcut - but it’s a good one.

the use case reported here is totally valid - the solution is to rewrite the model along the lines described in the Stan User’s Guide - 1.14 Prediction, forecasting, and backcasting | Stan User’s Guide - Predictions as Generated Quantities -

  • the data passed in consists of both datasets
  • the parameter block is the same as in the data fitting model
  • the new data is used in the generated quantities block

does this outline make sense?

note: closed CmdStan issue; added docs issue instead: CmdStan - generate-quantities - document how to use new data for prediction · Issue #269 · stan-dev/docs · GitHub

1 Like