Post-Hoc Prediction: Use samples from fitted model in a prediction task

@avehtari question for you (or others associated with Bridgestan) -

Is there a “size” (I am unsure if this is the best way to describe it?) limit on the data input in the Stanmodel$new() call? I have combed through the Bridgestan code in R and C++ in the source code and can see where the error is thrown, but cannot figure out a way around? If at all?

mod2bs <- StanModel$new(lib = mod2bs_lib,
                        data = to_stan_json(standati),
                        seed = 67134)

I have a far more complex model that I can sample in Stan but when I try to do some testing with Bridgestan I get the error below. I am pretty positive this is related to the “size” or precision of the input. If I change the “digits” input in your function to make json literal I can see more data input, but still not the whole json. This is also pasted below. These are draws from a fit model that I am trying to push through aghq.

Error in initialize(...) : construct({
  "N": 1,
  "M": 3,
  "y1": 91.94,
  "y2": 77.33,
  "y3": 75.21,
  "y4": 76.5,
  "y5": 62.11,
  "y6": 70.99,
  "FDL_constant": 65.9283,
  "FDL_exponent": 0.613656,
  "FDL_offset": 64.7095,
  "FDL_noise_intercept": 6.881,
  "FDL_noise_slope": 2.62499e-05,
  "TDL_constant": 50.7127,
  "TDL_exponent": 0.641788,
  "TDL_offset": 57.3204,
  "TDL_noise_intercept": 6.36881,
  "TDL_noise_slope": 2.65672e-05,
  "FBDL_constant": 52.5393,
  "FBDL_exponent": 0.626294,
  "FBDL_offset": 52.2559,
  "FBDL_noise_intercept": 6.7143,
  "FBDL_noise_slope": 0.000138265,
  "HDL_constant": 44.7564,
  "HDL_exponent": 0.599894,
  "HDL_offset": 58.3632,
  "HDL_noise_intercept": 4.57218,
  "HDL_noise_slope": 0.00113608,
  "RDL_constant": 32.1942,
  "RDL_exponent": 0.607234,
  "RDL_offset": 47.3041,
  "RDL_noise_intercept": 3.67682,
  "RDL_noise_slope": 1.54867e-05,
  "UDL_constant": 35.409,
  "UDL_exponent": 0.604277,
  "UDL_offset": 54.1493,
  "UDL_noise_intercept": 4.22831,
  "UDL_noise_slo
{
  "N": 1,
  "M": 3,
  "y1": 91.94,
  "y2": 77.33,
  "y3": 75.21,
  "y4": 76.5,
  "y5": 62.11,
  "y6": 70.99,
  "FDL_constant": 65.9283,
  "FDL_exponent": 0.613656,
  "FDL_offset": 64.7095,
  "FDL_noise_intercept": 6.881,
  "FDL_noise_slope": 2.62499e-05,
  "TDL_constant": 50.7127,
  "TDL_exponent": 0.641788,
  "TDL_offset": 57.3204,
  "TDL_noise_intercept": 6.36881,
  "TDL_noise_slope": 2.65672e-05,
  "FBDL_constant": 52.5393,
  "FBDL_exponent": 0.626294,
  "FBDL_offset": 52.2559,
  "FBDL_noise_intercept": 6.7143,
  "FBDL_noise_slope": 0.000138265,
  "HDL_constant": 44.7564,
  "HDL_exponent": 0.599894,
  "HDL_offset": 58.3632,
  "HDL_noise_intercept": 4.57218,
  "HDL_noise_slope": 0.00113608,
  "RDL_constant": 32.1942,
  "RDL_exponent": 0.607234,
  "RDL_offset": 47.3041,
  "RDL_noise_intercept": 3.67682,
  "RDL_noise_slope": 1.54867e-05,
  "UDL_constant": 35.409,
  "UDL_exponent": 0.604277,
  "UDL_offset": 54.1493,
  "UDL_noise_intercept": 4.22831,
  "UDL_noise_slope": 8.70097e-05,
  "L": [
    [1, 0, 0, 0, 0, 0],
    [0.732028, 0.675949, 0, 0, 0, 0],
    [0.652455, 0.421771, 0.606169, 0, 0, 0],
    [0.569638, 0.0885797, 0.0922823, 0.799, 0, 0],
    [0.521716, 0.239545, 0.189282, 0.361423, 0.688776, 0],
    [0.517301, 0.258034, 0.192246, 0.295687, 0.481027, 0.518619]
  ]
}

I don’t know where the error comes from, but to reduce optionsm can you try
whether you get the same error if you use write_stan_json()?

file <- tempfile(fileext = ".json")
cmdstanr::write_stan_json(data, file)
mod2bs <- StanModel$new(lib = mod2bs_lib,
                        data = file,
                        seed = 67134)

Can confirm it is something to do with size / precision. write_stan_json does not work.
However, if I drop it to fewer response variables (3 instead of 6, Stanmodel$new() will run. Now to figure out how to do this with more variables…

{
  "N": 1,
  "M": 3,
  "y1": [91.94],
  "y2": [77.33],
  "y3": [75.21],
  "FDL_constant": 65.9283,
  "FDL_exponent": 0.613656,
  "FDL_offset": 64.7095,
  "FDL_noise_intercept": 6.881,
  "FDL_noise_slope": 2.62499e-05,
  "TDL_constant": 50.7127,
  "TDL_exponent": 0.641788,
  "TDL_offset": 57.3204,
  "TDL_noise_intercept": 6.36881,
  "TDL_noise_slope": 2.65672e-05,
  "FBDL_constant": 52.5393,
  "FBDL_exponent": 0.626294,
  "FBDL_offset": 52.2559,
  "FBDL_noise_intercept": 6.7143,
  "FBDL_noise_slope": 0.000138265,
  "L": [
    [1, 0, 0],
    [0.732028, 0.675949, 0],
    [0.652455, 0.421771, 0.606169]
  ]
}

I guess then it’s a problem with L having too few digits, so that it doesn’t pass the check for being a valid cholesky_factor_corr. Probability of failing this check increases with higher dimension and fewer digits. Can you try with more digits?

I can confirm it is more of an issue with the JSON size as input into Stanmodel$new(). Even if I adjust the model to input the correlation matrix instead of L. For example, if I run the line below I can get a complete JSON. This is with the correlations instead of the L.

data = to_stan_json(standati)

> data
{
  "N": 1,
  "M": 6,
  "y1": 91.94,
  "y2": 77.33,
  "y3": 75.21,
  "y4": 76.5,
  "y5": 62.11,
  "y6": 70.99,
  "FDL_constant": 65.9283,
  "FDL_exponent": 0.613656,
  "FDL_offset": 64.7095,
  "FDL_noise_intercept": 6.881,
  "FDL_noise_slope": 2.62499e-05,
  "TDL_constant": 50.7127,
  "TDL_exponent": 0.641788,
  "TDL_offset": 57.3204,
  "TDL_noise_intercept": 6.36881,
  "TDL_noise_slope": 2.65672e-05,
  "FBDL_constant": 52.5393,
  "FBDL_exponent": 0.626294,
  "FBDL_offset": 52.2559,
  "FBDL_noise_intercept": 6.7143,
  "FBDL_noise_slope": 0.000138265,
  "HDL_constant": 44.7564,
  "HDL_exponent": 0.599894,
  "HDL_offset": 58.3632,
  "HDL_noise_intercept": 4.57218,
  "HDL_noise_slope": 0.00113608,
  "RDL_constant": 32.1942,
  "RDL_exponent": 0.607234,
  "RDL_offset": 47.3041,
  "RDL_noise_intercept": 3.67682,
  "RDL_noise_slope": 1.54867e-05,
  "UDL_constant": 35.409,
  "UDL_exponent": 0.604277,
  "UDL_offset": 54.1493,
  "UDL_noise_intercept": 4.22831,
  "UDL_noise_slope": 8.70097e-05,
  "corr_mat": [
    [1, 0.732028, 0.652455, 0.569638, 0.521716, 0.517301],
    [0.732028, 1, 0.775104, 0.48837, 0.555404, 0.564089],
    [0.652455, 0.775104, 1, 0.487538, 0.576895, 0.586152],
    [0.569638, 0.48837, 0.487538, 1, 0.641185, 0.590982],
    [0.521716, 0.555404, 0.576895, 0.641185, 1, 0.832786],
    [0.517301, 0.564089, 0.586152, 0.590982, 0.832786, 1]
  ]
} 

But if I run this line:

mod2bs <- StanModel$new(lib = mod2bs_lib,
                        data = to_stan_json(standati),
                        seed = 67134)

I get the same “initialize construct” error. It must be something to do with the way the data is input here. Again, this will work if I have say a 3x3 matrix instead of a 6x6.

However, I have larger issues to rectify… I attach the model code below. It is a Gaussian copula based on some code from spinkney, andrjohns, and others. If I try to do this with fewer variables, (say 3 instead of 6), I can at least get the model to initialize.

> standati
$N
[1] 1

$M
[1] 3

$y1
[1] 91.94

$y2
[1] 77.33

$y3
[1] 75.21

$FDL_constant
[1] 65.9283

$FDL_exponent
[1] 0.613656

$FDL_offset
[1] 64.7095

$FDL_noise_intercept
[1] 6.881

$FDL_noise_slope
[1] 2.62499e-05

$TDL_constant
[1] 50.7127

$TDL_exponent
[1] 0.641788

$TDL_offset
[1] 57.3204

$TDL_noise_intercept
[1] 6.36881

$TDL_noise_slope
[1] 2.65672e-05

$FBDL_constant
[1] 52.5393

$FBDL_exponent
[1] 0.626294

$FBDL_offset
[1] 52.2559

$FBDL_noise_intercept
[1] 6.7143

$FBDL_noise_slope
[1] 0.000138265

$cors
         [,1]     [,2]     [,3]
[1,] 1.000000 0.732028 0.652455
[2,] 0.732028 1.000000 0.775104
[3,] 0.652455 0.775104 1.000000

mod2bs2_lib <- compile_model("D:/stan/MDCGC_publication/stan_models/copula_age_3var.stan")

mod2bs2 <- StanModel$new(lib = mod2bs2_lib,
                        data = to_stan_json(standati),
                        seed = 67134)

However, if I try to use quadrature I get a hessian = 0 error:

  Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'forceSymmetric': Lapack routine dgesv: system is exactly singular: U[1,1] = 0

Or if I try to use slice sampling, I get a normal_lpdf goes to infinite error. I recognize the models are a little verbose, but I can sample either one relatively successful in cmdstanr. Now to just get this downstream task to cooperate… (Note, I recognize these 3 variables are continuous and can be fit with a MVN, I initially tried to do this with mixed discrete data as well, but wanted to get this running first…)

functions{
  
  // Gaussian Copula Log Probability Density
  
  // Gaussian Copula Log Probability Density
  
  real multi_normal_cholesky_copula_lpdf(matrix U, matrix L) {
    int N = rows(U);
    int J = cols(U);
    matrix[J, J] Gammainv = chol2inv(L);
    return -N * sum(log(diagonal(L))) - 0.5 * sum(add_diag(Gammainv, -1.0) .* crossprod(U));
  }
  
  // Prepare data for LPDF
  
  real centered_gaussian_copula_cholesky_(array[,] matrix marginals, matrix L) {
    // Extract dimensions of final outcome matrix
    int N = rows(marginals[1][1]);
    int J = rows(L);
    matrix[N, J] U;
  
    // Iterate through marginal arrays, concatenating the outcome matrices by column
    // and aggregating the log-likelihoods (from continuous marginals) and jacobian
    // adjustments (from discrete marginals)
    real adj = 0;
    int pos = 1;
    for (m in 1 : size(marginals)) {
      int curr_cols = cols(marginals[m][1]);
    
      U[ : , pos : (pos + curr_cols - 1)] = marginals[m][1];
    
      adj += sum(marginals[m][2]);
      pos += curr_cols;
    }
  
    // Return the sum of the log-probability for copula outcomes and jacobian adjustments
    return multi_normal_cholesky_copula_lpdf(U | L) + adj;
  }
  

  // Continuous Marginal Distribution (Normal)
  
  array[] matrix normal_marginal(array[] real y, array[] real mu, array[] real sigma, int N) {
    array[2] matrix[N, 1] rtn; // empty 2D array to return
    // Initialise the jacobian adjustments to zero, as vectorised lpdf will be used
    rtn[2] = rep_matrix(0, N, 1);

    for (n in 1 : N) {
      rtn[1][n, 1] = (y[n] - mu[n]) / sigma[n]; // center RV
      rtn[2][n, 1] = normal_lpdf(y[n] | mu[n], sigma[n]); // "jacobian"
    }
  return rtn;
  } 

}
data{
  
  // Global Variables
  
  int N; // total number of observations (rows)
  int M; // total number of response variables (columns)
  
  // Continuous Variables
  
  array[N] real y1;
  array[N] real y2; 
  array[N] real y3;

  real FDL_constant; // mean function
  real FDL_exponent; // mean function
  real FDL_offset; // mean function 
  real FDL_noise_intercept; // sd function
  real FDL_noise_slope; // sd function
  
  real TDL_constant; // mean function
  real TDL_exponent; // mean function
  real TDL_offset; // mean function 
  real TDL_noise_intercept; // sd function
  real TDL_noise_slope; // sd function

  real FBDL_constant; // mean function
  real FBDL_exponent; // mean function
  real FBDL_offset; // mean function 
  real FBDL_noise_intercept; // sd function
  real FBDL_noise_slope; // sd function

  matrix[M,M] cors;

}
parameters{
  
  array[N] real<lower=0, upper=20> X;
  
}
transformed parameters{

  array[N] real<lower=0> FDL_mean;
  array[N] real<lower=0> FDL_noise;
  array[N] real<lower=0> TDL_mean;
  array[N] real<lower=0> TDL_noise;
  array[N] real<lower=0> FBDL_mean;
  array[N] real<lower=0> FBDL_noise;


  for(n in 1:N){
    
    FDL_mean[n] = FDL_constant*pow(X[n],FDL_exponent) + FDL_offset;
    FDL_noise[n] = FDL_noise_intercept*(1 + X[n]*FDL_noise_slope);
    TDL_mean[n] = TDL_constant*pow(X[n],TDL_exponent) + TDL_offset;
    TDL_noise[n] = TDL_noise_intercept*(1 + X[n]*TDL_noise_slope);
    FBDL_mean[n] = FBDL_constant*pow(X[n],FBDL_exponent) + FBDL_offset;
    FBDL_noise[n] = FBDL_noise_intercept*(1 + X[n]*FBDL_noise_slope);

   }

}
model{

  X ~ normal(0,7);
  
  target += centered_gaussian_copula_cholesky_(
    {normal_marginal(y1, FDL_mean, FDL_noise, N),
     normal_marginal(y2, TDL_mean, TDL_noise, N),
     normal_marginal(y3, FBDL_mean, FBDL_noise, N)}, cholesky_decompose(cors));

}

Strange, because that is not a big JSON file.

ping @WardBrian

Could I get some copy-pastable code to try this locally? I can’t seem to tease apart which data is for which piece of stan code in this thread

Oh, this is probably the old “R truncates your error messages at an arbitrary length and doesn’t warn you!” issue.

If you try options(warning.length=8000L) then the above code will actually print until the end of the error message, which will tell you that the JSON you’re supplying has some problems (like the name of cors vs corr_mat)

2 Likes

Well - I feel like an idiot!
When I created the more complex (i.e., 6 variable model) I added an extra space into the data names… It was indeed a truncated warning… Sorry for all the confusion.

Now if I can figure out how to get the log density and associated gradients and hessians to cooperate in the downstream analyses using quadrature or slice sampling…

1 Like

We could probably do a better job in BridgeStan. We can’t control R’s odd behavior, but we could try to detect it and give a warning.

1 Like

Thanks for the note @cwolfe. @WardBrian is fixing this upstream to make the error message less confusing:

2 Likes

You all work fast. Thanks @Bob_Carpenter and @WardBrian!

1 Like