Post-Hoc Prediction: Use samples from fitted model in a prediction task

cwolfe · May 2, 2024, 3:25pm

@avehtari question for you (or others associated with Bridgestan) -

Is there a “size” (I am unsure if this is the best way to describe it?) limit on the data input in the Stanmodel$new() call? I have combed through the Bridgestan code in R and C++ in the source code and can see where the error is thrown, but cannot figure out a way around? If at all?

mod2bs <- StanModel$new(lib = mod2bs_lib,
                        data = to_stan_json(standati),
                        seed = 67134)

I have a far more complex model that I can sample in Stan but when I try to do some testing with Bridgestan I get the error below. I am pretty positive this is related to the “size” or precision of the input. If I change the “digits” input in your function to make json literal I can see more data input, but still not the whole json. This is also pasted below. These are draws from a fit model that I am trying to push through aghq.

Error in initialize(...) : construct({
  "N": 1,
  "M": 3,
  "y1": 91.94,
  "y2": 77.33,
  "y3": 75.21,
  "y4": 76.5,
  "y5": 62.11,
  "y6": 70.99,
  "FDL_constant": 65.9283,
  "FDL_exponent": 0.613656,
  "FDL_offset": 64.7095,
  "FDL_noise_intercept": 6.881,
  "FDL_noise_slope": 2.62499e-05,
  "TDL_constant": 50.7127,
  "TDL_exponent": 0.641788,
  "TDL_offset": 57.3204,
  "TDL_noise_intercept": 6.36881,
  "TDL_noise_slope": 2.65672e-05,
  "FBDL_constant": 52.5393,
  "FBDL_exponent": 0.626294,
  "FBDL_offset": 52.2559,
  "FBDL_noise_intercept": 6.7143,
  "FBDL_noise_slope": 0.000138265,
  "HDL_constant": 44.7564,
  "HDL_exponent": 0.599894,
  "HDL_offset": 58.3632,
  "HDL_noise_intercept": 4.57218,
  "HDL_noise_slope": 0.00113608,
  "RDL_constant": 32.1942,
  "RDL_exponent": 0.607234,
  "RDL_offset": 47.3041,
  "RDL_noise_intercept": 3.67682,
  "RDL_noise_slope": 1.54867e-05,
  "UDL_constant": 35.409,
  "UDL_exponent": 0.604277,
  "UDL_offset": 54.1493,
  "UDL_noise_intercept": 4.22831,
  "UDL_noise_slo

{
  "N": 1,
  "M": 3,
  "y1": 91.94,
  "y2": 77.33,
  "y3": 75.21,
  "y4": 76.5,
  "y5": 62.11,
  "y6": 70.99,
  "FDL_constant": 65.9283,
  "FDL_exponent": 0.613656,
  "FDL_offset": 64.7095,
  "FDL_noise_intercept": 6.881,
  "FDL_noise_slope": 2.62499e-05,
  "TDL_constant": 50.7127,
  "TDL_exponent": 0.641788,
  "TDL_offset": 57.3204,
  "TDL_noise_intercept": 6.36881,
  "TDL_noise_slope": 2.65672e-05,
  "FBDL_constant": 52.5393,
  "FBDL_exponent": 0.626294,
  "FBDL_offset": 52.2559,
  "FBDL_noise_intercept": 6.7143,
  "FBDL_noise_slope": 0.000138265,
  "HDL_constant": 44.7564,
  "HDL_exponent": 0.599894,
  "HDL_offset": 58.3632,
  "HDL_noise_intercept": 4.57218,
  "HDL_noise_slope": 0.00113608,
  "RDL_constant": 32.1942,
  "RDL_exponent": 0.607234,
  "RDL_offset": 47.3041,
  "RDL_noise_intercept": 3.67682,
  "RDL_noise_slope": 1.54867e-05,
  "UDL_constant": 35.409,
  "UDL_exponent": 0.604277,
  "UDL_offset": 54.1493,
  "UDL_noise_intercept": 4.22831,
  "UDL_noise_slope": 8.70097e-05,
  "L": [
    [1, 0, 0, 0, 0, 0],
    [0.732028, 0.675949, 0, 0, 0, 0],
    [0.652455, 0.421771, 0.606169, 0, 0, 0],
    [0.569638, 0.0885797, 0.0922823, 0.799, 0, 0],
    [0.521716, 0.239545, 0.189282, 0.361423, 0.688776, 0],
    [0.517301, 0.258034, 0.192246, 0.295687, 0.481027, 0.518619]
  ]
}

avehtari · May 2, 2024, 6:30pm

I don’t know where the error comes from, but to reduce optionsm can you try
whether you get the same error if you use write_stan_json()?

file <- tempfile(fileext = ".json")
cmdstanr::write_stan_json(data, file)
mod2bs <- StanModel$new(lib = mod2bs_lib,
                        data = file,
                        seed = 67134)

cwolfe · May 2, 2024, 7:40pm

Can confirm it is something to do with size / precision. write_stan_json does not work.
However, if I drop it to fewer response variables (3 instead of 6, Stanmodel$new() will run. Now to figure out how to do this with more variables…

{
  "N": 1,
  "M": 3,
  "y1": [91.94],
  "y2": [77.33],
  "y3": [75.21],
  "FDL_constant": 65.9283,
  "FDL_exponent": 0.613656,
  "FDL_offset": 64.7095,
  "FDL_noise_intercept": 6.881,
  "FDL_noise_slope": 2.62499e-05,
  "TDL_constant": 50.7127,
  "TDL_exponent": 0.641788,
  "TDL_offset": 57.3204,
  "TDL_noise_intercept": 6.36881,
  "TDL_noise_slope": 2.65672e-05,
  "FBDL_constant": 52.5393,
  "FBDL_exponent": 0.626294,
  "FBDL_offset": 52.2559,
  "FBDL_noise_intercept": 6.7143,
  "FBDL_noise_slope": 0.000138265,
  "L": [
    [1, 0, 0],
    [0.732028, 0.675949, 0],
    [0.652455, 0.421771, 0.606169]
  ]
}

avehtari · May 3, 2024, 6:49am

I guess then it’s a problem with L having too few digits, so that it doesn’t pass the check for being a valid cholesky_factor_corr. Probability of failing this check increases with higher dimension and fewer digits. Can you try with more digits?

cwolfe · May 3, 2024, 6:20pm

I can confirm it is more of an issue with the JSON size as input into Stanmodel$new(). Even if I adjust the model to input the correlation matrix instead of L. For example, if I run the line below I can get a complete JSON. This is with the correlations instead of the L.

data = to_stan_json(standati)

> data
{
  "N": 1,
  "M": 6,
  "y1": 91.94,
  "y2": 77.33,
  "y3": 75.21,
  "y4": 76.5,
  "y5": 62.11,
  "y6": 70.99,
  "FDL_constant": 65.9283,
  "FDL_exponent": 0.613656,
  "FDL_offset": 64.7095,
  "FDL_noise_intercept": 6.881,
  "FDL_noise_slope": 2.62499e-05,
  "TDL_constant": 50.7127,
  "TDL_exponent": 0.641788,
  "TDL_offset": 57.3204,
  "TDL_noise_intercept": 6.36881,
  "TDL_noise_slope": 2.65672e-05,
  "FBDL_constant": 52.5393,
  "FBDL_exponent": 0.626294,
  "FBDL_offset": 52.2559,
  "FBDL_noise_intercept": 6.7143,
  "FBDL_noise_slope": 0.000138265,
  "HDL_constant": 44.7564,
  "HDL_exponent": 0.599894,
  "HDL_offset": 58.3632,
  "HDL_noise_intercept": 4.57218,
  "HDL_noise_slope": 0.00113608,
  "RDL_constant": 32.1942,
  "RDL_exponent": 0.607234,
  "RDL_offset": 47.3041,
  "RDL_noise_intercept": 3.67682,
  "RDL_noise_slope": 1.54867e-05,
  "UDL_constant": 35.409,
  "UDL_exponent": 0.604277,
  "UDL_offset": 54.1493,
  "UDL_noise_intercept": 4.22831,
  "UDL_noise_slope": 8.70097e-05,
  "corr_mat": [
    [1, 0.732028, 0.652455, 0.569638, 0.521716, 0.517301],
    [0.732028, 1, 0.775104, 0.48837, 0.555404, 0.564089],
    [0.652455, 0.775104, 1, 0.487538, 0.576895, 0.586152],
    [0.569638, 0.48837, 0.487538, 1, 0.641185, 0.590982],
    [0.521716, 0.555404, 0.576895, 0.641185, 1, 0.832786],
    [0.517301, 0.564089, 0.586152, 0.590982, 0.832786, 1]
  ]
}

But if I run this line:

mod2bs <- StanModel$new(lib = mod2bs_lib,
                        data = to_stan_json(standati),
                        seed = 67134)

I get the same “initialize construct” error. It must be something to do with the way the data is input here. Again, this will work if I have say a 3x3 matrix instead of a 6x6.

However, I have larger issues to rectify… I attach the model code below. It is a Gaussian copula based on some code from spinkney, andrjohns, and others. If I try to do this with fewer variables, (say 3 instead of 6), I can at least get the model to initialize.

> standati
$N
[1] 1

$M
[1] 3

$y1
[1] 91.94

$y2
[1] 77.33

$y3
[1] 75.21

$FDL_constant
[1] 65.9283

$FDL_exponent
[1] 0.613656

$FDL_offset
[1] 64.7095

$FDL_noise_intercept
[1] 6.881

$FDL_noise_slope
[1] 2.62499e-05

$TDL_constant
[1] 50.7127

$TDL_exponent
[1] 0.641788

$TDL_offset
[1] 57.3204

$TDL_noise_intercept
[1] 6.36881

$TDL_noise_slope
[1] 2.65672e-05

$FBDL_constant
[1] 52.5393

$FBDL_exponent
[1] 0.626294

$FBDL_offset
[1] 52.2559

$FBDL_noise_intercept
[1] 6.7143

$FBDL_noise_slope
[1] 0.000138265

$cors
         [,1]     [,2]     [,3]
[1,] 1.000000 0.732028 0.652455
[2,] 0.732028 1.000000 0.775104
[3,] 0.652455 0.775104 1.000000

mod2bs2_lib <- compile_model("D:/stan/MDCGC_publication/stan_models/copula_age_3var.stan")

mod2bs2 <- StanModel$new(lib = mod2bs2_lib,
                        data = to_stan_json(standati),
                        seed = 67134)

However, if I try to use quadrature I get a hessian = 0 error:

  Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'forceSymmetric': Lapack routine dgesv: system is exactly singular: U[1,1] = 0

Or if I try to use slice sampling, I get a normal_lpdf goes to infinite error. I recognize the models are a little verbose, but I can sample either one relatively successful in cmdstanr. Now to just get this downstream task to cooperate… (Note, I recognize these 3 variables are continuous and can be fit with a MVN, I initially tried to do this with mixed discrete data as well, but wanted to get this running first…)

functions{
  
  // Gaussian Copula Log Probability Density
  
  // Gaussian Copula Log Probability Density
  
  real multi_normal_cholesky_copula_lpdf(matrix U, matrix L) {
    int N = rows(U);
    int J = cols(U);
    matrix[J, J] Gammainv = chol2inv(L);
    return -N * sum(log(diagonal(L))) - 0.5 * sum(add_diag(Gammainv, -1.0) .* crossprod(U));
  }
  
  // Prepare data for LPDF
  
  real centered_gaussian_copula_cholesky_(array[,] matrix marginals, matrix L) {
    // Extract dimensions of final outcome matrix
    int N = rows(marginals[1][1]);
    int J = rows(L);
    matrix[N, J] U;
  
    // Iterate through marginal arrays, concatenating the outcome matrices by column
    // and aggregating the log-likelihoods (from continuous marginals) and jacobian
    // adjustments (from discrete marginals)
    real adj = 0;
    int pos = 1;
    for (m in 1 : size(marginals)) {
      int curr_cols = cols(marginals[m][1]);
    
      U[ : , pos : (pos + curr_cols - 1)] = marginals[m][1];
    
      adj += sum(marginals[m][2]);
      pos += curr_cols;
    }
  
    // Return the sum of the log-probability for copula outcomes and jacobian adjustments
    return multi_normal_cholesky_copula_lpdf(U | L) + adj;
  }
  

  // Continuous Marginal Distribution (Normal)
  
  array[] matrix normal_marginal(array[] real y, array[] real mu, array[] real sigma, int N) {
    array[2] matrix[N, 1] rtn; // empty 2D array to return
    // Initialise the jacobian adjustments to zero, as vectorised lpdf will be used
    rtn[2] = rep_matrix(0, N, 1);

    for (n in 1 : N) {
      rtn[1][n, 1] = (y[n] - mu[n]) / sigma[n]; // center RV
      rtn[2][n, 1] = normal_lpdf(y[n] | mu[n], sigma[n]); // "jacobian"
    }
  return rtn;
  } 

}
data{
  
  // Global Variables
  
  int N; // total number of observations (rows)
  int M; // total number of response variables (columns)
  
  // Continuous Variables
  
  array[N] real y1;
  array[N] real y2; 
  array[N] real y3;

  real FDL_constant; // mean function
  real FDL_exponent; // mean function
  real FDL_offset; // mean function 
  real FDL_noise_intercept; // sd function
  real FDL_noise_slope; // sd function
  
  real TDL_constant; // mean function
  real TDL_exponent; // mean function
  real TDL_offset; // mean function 
  real TDL_noise_intercept; // sd function
  real TDL_noise_slope; // sd function

  real FBDL_constant; // mean function
  real FBDL_exponent; // mean function
  real FBDL_offset; // mean function 
  real FBDL_noise_intercept; // sd function
  real FBDL_noise_slope; // sd function

  matrix[M,M] cors;

}
parameters{
  
  array[N] real<lower=0, upper=20> X;
  
}
transformed parameters{

  array[N] real<lower=0> FDL_mean;
  array[N] real<lower=0> FDL_noise;
  array[N] real<lower=0> TDL_mean;
  array[N] real<lower=0> TDL_noise;
  array[N] real<lower=0> FBDL_mean;
  array[N] real<lower=0> FBDL_noise;


  for(n in 1:N){
    
    FDL_mean[n] = FDL_constant*pow(X[n],FDL_exponent) + FDL_offset;
    FDL_noise[n] = FDL_noise_intercept*(1 + X[n]*FDL_noise_slope);
    TDL_mean[n] = TDL_constant*pow(X[n],TDL_exponent) + TDL_offset;
    TDL_noise[n] = TDL_noise_intercept*(1 + X[n]*TDL_noise_slope);
    FBDL_mean[n] = FBDL_constant*pow(X[n],FBDL_exponent) + FBDL_offset;
    FBDL_noise[n] = FBDL_noise_intercept*(1 + X[n]*FBDL_noise_slope);

   }

}
model{

  X ~ normal(0,7);
  
  target += centered_gaussian_copula_cholesky_(
    {normal_marginal(y1, FDL_mean, FDL_noise, N),
     normal_marginal(y2, TDL_mean, TDL_noise, N),
     normal_marginal(y3, FBDL_mean, FBDL_noise, N)}, cholesky_decompose(cors));

}

avehtari · May 3, 2024, 7:22pm

Strange, because that is not a big JSON file.

ping @WardBrian

WardBrian · May 3, 2024, 7:36pm

Could I get some copy-pastable code to try this locally? I can’t seem to tease apart which data is for which piece of stan code in this thread

WardBrian · May 3, 2024, 7:47pm

Oh, this is probably the old “R truncates your error messages at an arbitrary length and doesn’t warn you!” issue.

If you try options(warning.length=8000L) then the above code will actually print until the end of the error message, which will tell you that the JSON you’re supplying has some problems (like the name of cors vs corr_mat)

cwolfe · May 3, 2024, 7:58pm

Well - I feel like an idiot!
When I created the more complex (i.e., 6 variable model) I added an extra space into the data names… It was indeed a truncated warning… Sorry for all the confusion.

Now if I can figure out how to get the log density and associated gradients and hessians to cooperate in the downstream analyses using quadrature or slice sampling…

WardBrian · May 3, 2024, 8:00pm

We could probably do a better job in BridgeStan. We can’t control R’s odd behavior, but we could try to detect it and give a warning.

Bob_Carpenter · May 3, 2024, 8:53pm

Thanks for the note @cwolfe. @WardBrian is fixing this upstream to make the error message less confusing:

github.com/roualdes/bridgestan

R: Warn if error is too long to print properly

roualdes:main ← roualdes:R/warn-if-message-too-long

opened 08:15PM - 03 May 24 UTC

WardBrian

+11 -3

R has a [maximum length for warnings/error messages,](https://stackoverflow.com/…questions/50387667/how-to-prevent-truncation-of-error-messages-in-r) controlled by `options(warning.length)`. This has lead to some confusion before, see [1](https://discourse.mc-stan.org/t/post-hoc-prediction-use-samples-from-fitted-model-in-a-prediction-task/34319/29?u=wardbrian), [2](https://discourse.mc-stan.org/t/various-compilation-problems-with-various-interfaces/34155/7?u=wardbrian) This adds a check to places where we know there could be long error messages generated by BridgeStan and adds a warning if we know they will be truncated. This warning is reported alongside the error and should help people debug

cwolfe · May 4, 2024, 1:11am

You all work fast. Thanks @Bob_Carpenter and @WardBrian!

Topic		Replies	Views
Analyzing the posterior prediction samples Modeling	26	2033	October 27, 2021
Copula fit (finally!) and questions about downstream prediction Modeling techniques , fitting-issues , specification , cmdstanr	2	565	December 21, 2023
Bayesian inference in Stan Modeling	40	5851	July 18, 2017
Best way to do prediction on new data? R rstan stanfit Modeling	16	14542	October 22, 2019
A hard modeling problem using Bayesian inference Modeling	23	1990	May 31, 2017

Post-Hoc Prediction: Use samples from fitted model in a prediction task

Related topics