Post-Hoc Prediction: Use samples from fitted model in a prediction task

cwolfe · March 1, 2024, 5:56pm

Is it possible to use previously fit posterior samples in a downstream prediction task?

Let’s say I fit a multivariate normal model to K continuous response variables. Each response y_k is conditional on an observed continuous predictor x. The relationship between y and x can be expressed as follows:

\mu(x) = a*x^r + b
\sigma(x) = \kappa_1*(1+x*\kappa_2)

I fit the model with somewhat reasonable priors (see code below for example) and everything looks good. What if I return later with a new dataset of similar context. Here, y is observed, but x is unobserved. I want to predict x given y (and parameters \theta). This also means I want to put a prior on x in this new scenario.

Is there a way, using cmdstanr’s generated_quantities function to complete this task? Can I use samples from one model to inform this prediction task? Or do I need to do this in r for instance? I think where I am stuck is placing this prior on x. It’s easy enough to do posterior inference with previous samples, but I am unsure how to incorporate this prior information on this secondary prediction task. On one hand, this task could be seen as solving a system of equations. Although, the higher K, the more difficult this may be in say optim() (nor is it Bayesian).

TLDR - I am trying to do Bayesian age estimation using samples from a model previously fit where x was observed.

Hope this makes sense and thank you for the help.

data{
  int N; // # of individuals
  int K; // # of responses
  vector[K] y[N]; // vector of growth responses per individual
  real X[N]; //age
}
parameters{
  vector<lower=0>[K] a; // multiplicative constant
  vector<lower=0>[K] r; // scaling parameter
  vector[K] b; // offset
  vector<lower=0>[K] s_scale; //constant noise
  vector<lower=0>[K] kappa; // slope/gradient of linear noise function
  cholesky_factor_corr[K] L_Omega; // cholesky factor the corr matrix for efficiency
}
transformed parameters{
  vector[K] mu[N]; //mean array of size N containing vectors of length K
  vector[K] s[N]; //sd array of size N containing vectors of length K
  matrix[K,K] Sigma[N]; // cov array of size N with K x K cov matrices
  for(i in 1:N){
    for(k in 1:K){
      mu[i,k] = a[k]*X[i]^(r[k])+b[k]; // mean function
      s[i,k] = s_scale[k]*(1 + kappa[k]*X[i]); // sd function
      }
  }
  for(i in 1:N){
    Sigma[i] = diag_pre_multiply(s[i],L_Omega); // combining the cholesky corr matrix with the noise  
      }
}
model{
  //priors
  a ~ normal(0,10); 
  r ~ normal(0,1); 
  b ~ normal(0,10);
  kappa ~ normal(0,1); 
  s_scale ~ exponential(2); 
  L_Omega ~ lkj_corr_cholesky(2);

  for(i in 1:N){
    y[i] ~ multi_normal_cholesky(mu[i], Sigma[i]); //likelihood
	}
}

Bob_Carpenter · March 1, 2024, 9:57pm

Yes, you can run what we call standalone generated quantities.

It looks different in all of our interfaces.

I don’t see a y in what follows. There’s a, b, x^r, kappa_1, x, and kappa_2, but no y.

Why do you want to place a prior on a continuous predictor? Do you mean a covariate? (It’s confusing because sometimes peoplet take “linear predictor” to mean the coefficients times the covariate.)

I’m guessing from this post that what you really want to do is fit sequentially. Let’s say you have a model with likelihood p(y \mid \theta) and prior p(\theta) and now you observe data y_1 . This gives you a posterior p(\theta \mid y_1) that maybe you want to use as a prior with the same likelihood when fitting the data y_2 . That is, you look at p(y_2 \mid \theta) \cdot p(\theta \mid y_1), the posterior for which is equivalent to p(\theta \mid y_1, y_2), which is what you’d get if you just fit y_1 and y_2 at the same time.

There’s no general way to do this because we can’t specify p(\theta \mid y_1) as a prior as it doesn’t have an analytic form. People sometimes try to approximate the posterior p(\theta \mid y_1), say with a multivariate normal approximation.

cwolfe · March 2, 2024, 2:43am

@Bob_Carpenter thank you for the response.

I should be more precise. That is my fault…

Let \mathbf{Y} be a multivariate response vector of dimension K such that \mathbf{Y} \sim MVN(\mathbf{\mu}, \Sigma). Here, \Sigma is decomposed into the noise \sigma and correlation (cholesky) component. Both \mu and \sigma are conditional on the relationship between an observed value x and unknown parameters that we find posterior samples for.

Essentially, y is some measurement and x is a continuous predictor (independent variable) of this meausuremnt that shares a proportional relationship akin to a power law (\mu) and heteroskedastic noise (\sigma). I fit the model and return samples of all unknown (random) variables. This includes the parameters of the mean and noise functions and the covariance structure.

Now, lets say I have a held out test point where I have again observed several Y values. But this time, I do not know x. I want to predict x given the observed y and the parameters I have previously learned. In biological anthropology this is an age estimation problem where y is some trait and x is age. We want the p(x| y, \theta).

I have used cmdstanr’s generated quantities to predict missing y with conditional correlations. But, here I want x. Whats more I want to use prior information on x. I am attempting to perform Bayesian age estimation. A lot of work usually finds \theta via MLE and then combines those values with some prior on x to get a ‘Bayesian’ age estimate. Instead of this I want to to go full Bayesian such that \theta are posterior samples from the previously fit model.

avehtari · March 2, 2024, 11:46am

If you initially make only the model p(y|x) and x is univariate, you could sample from p(x|y) in generated quantities by using inverse cdf sampling (for each parameter posterior draw from the earlier sampling).
If you initially make only the model p(y|x) and x is multivariate, you would probably need to write another Stan model where x is multivariate parameter, and you would sample x for a bunch of parameter posterior draws (from the earlier sampling) similar way as multiple imputation is usually handled. This another model may be simpler than your original model if conditionally on parameters some parts of the model don’t influence information about x.
You could initially build a model p(y, x) = p(y|x)p(x), that is include potentially missing x, and then run the inference for both theta and x at the same time (not using the draws from earlier sampling)

cwolfe · March 2, 2024, 5:21pm

@avehtari

If you initially make only the model p(y|x) and x is univariate, you could sample from p(x|y) in generated quantities by using inverse cdf sampling (for each parameter posterior draw from the earlier sampling).

This is exactly what I want to do. y is a vector of continuous response variables, x is a univariate continuous value (chronological age). I want the posterior probability p(x | y, theta). This should be proportional to the likelihood p(y | x, theta) and the prior information. The initial model I fit is p(y | theta, x) where x is known. This model describes the relationship between x, y, and associated parameters. Now, I’d like to use this information to predict x if it is unknown (this assumes missing x follows the same relationships as the previous model). I am trying to conduct Bayesian chronological age estimation.

using inverse cdf sampling (for each parameter posterior draw from the earlier sampling).

Do you have an example of how this may be coded up? Let’s assume I have 1 individual with 5 response variable (K=5) but x is missing. I want to predict x given the response and associated posterior samples from the previous model.

You could initially build a model p(y, x) = p(y|x)p(x), that is include potentially missing x, and then run the inference for both theta and x at the same time (not using the draws from earlier sampling)

This is have done. I essentially treat x as a missing variable, throw it in the parameter block, apply an appropriate prior, and run the inference. However, this can be computationally inefficient in higher dimensions of y. Further, generally we only have say 1 missing value of x where N = 1. We want to use a previously trained model to predict this 1 value. I suppose I could fit a model each time, but I feel as though in this case the prior would swamp the likelihood. This would make more sense if we assume x is missing across a larger dataset. I think?

avehtari · March 4, 2024, 12:59pm

No. And thinking a bit more, it would be a bit cumbersome with repeated computation in Stan generated quantities. If you really want to do in generated quantities, you could do it by computing first the normalization term for the p(x|theta,y)p(x) using integrated_1d, and do the inverse-cdf by with binary search and repeated calls to integrate_1d. Computationally much more efficient would be slice sampling (which you could also implement in Stan, but probably don’t want to) for which you could use some ready made implementation e.g. in R and get the fast log-density evaluations with bridgestan.

cwolfe · March 4, 2024, 3:34pm

Thank you for all of this. I am still trying to wrap my head around how exactly I can complete this task… In theory, the idea of inverse transform sampling is most parsimonious, although doing this conditional upon information from y and theta is throwing me off.

You are right - the more I think about this problem, the more it makes sense to pull the samples into R and complete the posterior inference (prediction) that way. I am not married to necessarily having to do it in generated quantities. The question is - how…

I am beginning to see now why MUCH of the research surrounding ‘Bayesian’ age estimation in forensic / biological anthropology uses mle to find parameters of the likelihood (e.g., parameters of the mean and sd/covariance) and then combines this information with p(x) to estimate posterior age p(x | theta, y).

Weirdly enough, the joint probability of p(y,x) is ‘easy’ to sample. It is trying to combine the information from p(y | x) to predict p(x |y) that is killing me. I’ve toyed with the idea of treating it as a root finding problem such that I have a system of equations and I solve for x. I do this M times where M is the number of posterior samples. But, this can become cumbersome and especially inefficient in higher dimensions of y. Further, this may give me an overly precise estimate of x without a proper accounting of the uncertainty (related to the standard deviation / covariance structure).

avehtari · March 4, 2024, 5:01pm

If you can make a simplified example with

data or data generation in R
Stan model which has x an y as data and parameters as parameters, and chech that you can sample from the posterior
Stan model which has parameters and y as data and x as parameter

Then I can show how to do slice sampling with bridgestan and some other R package implementing slice sampling. Including bridgestan preparation and iteration over the posterior draws of parameters, it will be about 10 lines of code.

cwolfe · March 5, 2024, 3:52pm

Thank you for this.

I attach 2 .stan files and an r script to simulate data and run the analysis. Note, because I am working in multivariate space, I am using a very general MVN model. If easier I can send along a univariate example, but given my use case I started with this. Note, there is an inherent inefficiency here because there are n covariance matrices because the standard deviation is some function of x.

The simulation uses cmdstanr and additional packages (randcorr and MASS).

Let \mathbf{y} be a multivariate response variable of K = 3 dimensions with n = 500 simulated samples - y is therefore a [500,3] matrix. y \sim MVN(\mu(x), \Sigma(x)) where elements of \mu and \sigma are conditional on a continuous predictor x and associated parameters \theta.

The file “multi_sim.stan” samples the log probability such x and y are data. The file “multi_sim_missx.stan” samples the log probability such that parameters and y are data and x is a parameter. In both instances I can recover the ‘true’ parameter values.
simulation_test_missingx.R (2.0 KB)
multi_sim_missx.stan (996 Bytes)
multi_sim.stan (1.4 KB)

avehtari · March 6, 2024, 3:16pm

Great. I’m busy this week, but I’ll get back to you

cwolfe · March 6, 2024, 4:16pm

Aki - that is totally okay. This conversation alone is helpful and I look forward to the additional help next week.

Relatedly, I have been thinking of additional ways to possibly complete this task and I thought I’d share here. The truth of the matter is I have simplified the problem to hopefully scale back up once I get a grasp on the best way to complete the task.

That said - I am working with a Gaussian copula. The lpdf (as defined by @spinkney here) I will eventually work towards is the following where: U is an N x K matrix of marginal calculations and L is the Cholesky factorization of the correlation matrix. I can do this function outside of Stan in r for ease.

real multi_normal_cholesky_copula_lpdf(matrix U, matrix L) {
  int N = rows(U);
  int J = cols(U);
  matrix[J, J] Gammainv = chol2inv(L);
  return -N * sum(log(diagonal(L))) - 0.5 * sum(add_diag(Gammainv, -1.0) .* crossprod(U));
}

Maybe we do what you suggested early on about taking it out of Stan. I am thinking a procedure similar to the following pseudocode:

Generate a sequence of possible x value

x <- seq(0,21,0.1)

Loop through M posterior sample from the model for all x values

for(n in 1:length(x){
  for (m in 1:length(samples)){
  likelihood calculation
    }
}

The above I believe should give an m by n matrix where each value is the likelihood of each presumed x-value across all posterior samples from the model. Given that we are on the log scale, I should be able to sum each column to give a total likelihood per value of x given the posterior samples. This will mean the length of likelihood and the length of the sequence of the x values are equal.

lik <- colSums(likelihood)

Now I want a prior on x. Let’s assume it’s uniform for ease

prior_x <- dunif(x, min, max, log = T)

The joint density of theta (and y) and x should then be:

joint <- lik + prior_x

We can then put this on the probability scale:

post_prob <- exp(joint)

Then we want to normalize to ensure the posterior integrates to 1:

norm_post <- post_prob / sum(post_prob) / 0.1 # 0.1 is the spacing of x

Perhaps I am crazy. But at least superficially this incorporates the posterior samples from a previously fit model, finds the likelihood at a given range of x values, combines this information with some prior on x, and possibly outputs the posterior probability of x given y and theta.

Regardless, I look forward to your help as well, which may make a whole heck of a lot more sense then this messy pseudo-code.

avehtari · March 6, 2024, 6:23pm

This works, too. Slice sampling would be more efficient, but what you propose is simpler.

avehtari · March 11, 2024, 8:39am

Here’s code using bridgestan package and adaptive Gauss-Hermite quadrature from aghq package. It computes posterior mean in 0.005s for one X given one set of parameter values (for simplicity, I used the true values as you did when sampling X). For n=500 it takes 2.5s, and multiply that e.g. by 100 for averaging over 100 posterior draws. I guess your actual model and data are more complex. In the code I compute CDF for one X, and posterior mean for all X’s, but you can compute moments and quantiles of arbitrary functions based on your needs.

# Additional libraries
library(posterior)
library(bridgestan)
library(aghq)
library(ggplot2)
theme_set(bayesplot::theme_default(base_family = "sans"))
library(tictoc)

# bridgestan compilation
mod2bs_lib <- compile_model("multi_sim_missx.stan")

# make json literal from Stan data list
to_stan_json <- function(data, always_decimal = FALSE) {
  if (!is.list(data)) {
    stop("'data' must be a list.", call. = FALSE)
  }

  data_names <- names(data)
  if (length(data) > 0 &&
      (length(data_names) == 0 ||
       length(data_names) != sum(nzchar(data_names)))) {
    stop("All elements in 'data' list must have names.", call. = FALSE)

  }
  if (anyDuplicated(data_names) != 0) {
    stop("Duplicate names not allowed in 'data'.", call. = FALSE)
  }

  for (var_name in data_names) {
    var <- data[[var_name]]
    if (!(is.numeric(var) || is.factor(var) || is.logical(var) ||
          is.data.frame(var) || is.list(var))) {
      stop("Variable '", var_name, "' is of invalid type.", call. = FALSE)
    }
    if (anyNA(var)) {
      stop("Variable '", var_name, "' has NA values.", call. = FALSE)
    }

    if (is.table(var)) {
      var <- unclass(var)
    } else if (is.logical(var)) {
      mode(var) <- "integer"
    } else if (is.data.frame(var)) {
      var <- data.matrix(var)
    } else if (is.list(var)) {
      var <- list_to_array(var, var_name)
    }
    data[[var_name]] <- var
  }

  # unboxing variables (N = 10 is stored as N : 10, not N: [10])
  jsonlite::toJSON(
    data,
    auto_unbox = TRUE,
    factor = "integer",
    always_decimal = always_decimal,
    digits = NA,
    pretty = TRUE
  )
}

# Test with first observation
i=1
standati <- list(
  N = 1,
  K = 3,
  y = matrix(c(y1[i],y2[i],y3[i]), nrow=1, ncol=3),
  a = c(a1, a2, a3),
  r = c(r1, r2, r3),
  b = c(b1, b2, b3),
  n1 = c(n1_1, n1_2, n1_3),
  n2 = c(n2_1, n2_2, n2_3),
  cors = randcors
)
# instantiate with data
mod2bs <- StanModel$new(lib = mod2bs_lib,
                        data = to_stan_json(standati),
                        seed = 67134)
# list of density, gradient and hessian functions
ffs <- list(fn=mod2bs$log_density,
            gr=\(x) {mod2bs$log_density_gradient(x)$gradient},
            he=\(x) {mod2bs$log_density_hessian(x)$hessian})

# adaptive Gauss-Hermite quadrature in unconstrained space
aghqi <- aghq(ffs,k=11,startingvalue=0)
# quantiles 1%,...,99% in unconstrained space
q <- seq(0.01,0.99,by=.02)
qt <- compute_quantiles(aghqi, q = q)[[1]]
# quantiles 1%,...,99% in constrained space
qx <- sapply(qt, mod2bs$param_constrain)
# plot CDF for X[1]
data.frame(qx=qx, q=q) |>
  ggplot(aes(x=qx, y=q)) +
  geom_line() +
  labs(x='X', y='CDF')

# compute posterior mean for all X
xmean <- array(dim=n)
tic()
for (i in 1:n) {
  standati$y<-matrix(c(y1[i],y2[i],y3[i]), nrow=1, ncol=3)
  suppressWarnings(mod2bs <- StanModel$new(lib = mod2bs_lib,
                                           data = to_stan_json(standati),
                                           seed = 1))
  ffs <- list(fn=mod2bs$log_density,
            gr=\(x) {mod2bs$log_density_gradient(x)$gradient},
            he=\(x) {mod2bs$log_density_hessian(x)$hessian})
  aghqi <- aghq(ffs,k=11,startingvalue=2)
  # compute posterior mean in constrained space
  xmean[i] <- compute_moment(aghqi, mod2bs$param_constrain, method="correct")
}
toc()

# compare to Stan posterior mean
data.frame(x=x, xpred=xmean) |>
  ggplot(aes(x=x_sum$mean, y=xpred)) +
  geom_point() +
  geom_abline() +
  labs(x="X posterior mean via Stan", y="X posterior mean via aghq")

cwolfe · March 11, 2024, 10:47am

Aki - Thank you for this. Truly.

I just have a few clarifying questions as I begin to scaling up and playing around with the code. PS- they may be simple questions, but I just want to ensure I am grasping it all:

qt ← compute_quantiles(aghqi, q = q)[[1]] #CDF of x[1]
xmean[i] ← compute_moment(aghqi, mod2bs$param_constrain, method=“correct”) #expectation of x[1:N]

The above 2 functions either a) find the CDF of 1 x value (quantile function) or b) find the expectation of x (compute_moment). As such, I could combine information from each function to plot the posterior density of x with associated expectation and say a 95% credible interval.

#compare to Stan posterior mean
data.frame(x=x, xpred=xmean) |>
ggplot(aes(x=x_sum$mean, y=xpred)) +
geom_point() +
geom_abline() +
labs(x=“X posterior mean via Stan”, y=“X posterior mean via aghq”)

What is x here? Is it the simulated values from my initial code or is it the posterior mean of x from HMC using the script multi_sim_missx.stan? It is not defined in the code, so I want to be sure.

for simplicity, I used the true values as you did when sampling X

Last question - the ultimate goal is to use the posterior samples of say a, r, b, n1, n2 (or any other parameter) to predict x. How could I adjust? Would I need to export the samples (e.g., M = 4000) from the initial model and loop over them? Would this give a cdf per sample? Something different? This has been the ‘hardest’ part. Moving past hardcoding parameters and using information from that first model ‘multi_sim.stan’.

avehtari · March 11, 2024, 11:33am

See the aghq doc for all the possibilities. Computing quantiles seems to be much slower than computing moments, so you may want to choose carefully what to compute.

Oops, should have been

data.frame(x=x_sum$mean, xpred=xmean) |>
ggplot(aes(x=x, y=xpred)) +
...

but that doesn’t change the plot

Yes. But I would expect M=100 would be enough in many cases.

It would give what you want (e.g. mean, variance, quantiles, cdf) per sample and then take average over those. If you prefer you could get also posterior draws by generating for each posterior draw a uniform random number u in (0,1) and compute the corresponding quantile (just one quantile per draw). This would be the inverse-CDF sampling approach and in the end you would have M draws from the full marginal (marginalized over the M posterior draws). But as I said you can also compute desired quantities directly for each M and then average to get the integrated quantity with less variability and thus smaller M is sufficient.

cwolfe · March 11, 2024, 11:50am

Thanks for this. Who knows - maybe with some tinkering I’ll translate this over to my copula example that his mixed continuous and discrete data. If I can do so, I’ll report back here with the results.

fwiw - Efficiency may be the name of the game here. Previous approaches to age estimation in this multivariate context are very slow in higher dimensions (unsurprisingly). I am trying to balance increased efficiency with a proper accounting of uncertainty (a missing component in models that use mle to estimate the values of the parameters). There is also a component here that is testing whether multivariate models are more ‘accurate’ and ‘precise’ as compared to univariate models. But again, this assumes an mle estimate of the parameters. I wonder if by doing this full Bayesian we get the accuracy and precision we want even in lower dimensions…

ah well. back to it.

avehtari · March 11, 2024, 4:12pm

Here’s a slice sampling approach which loops also over 100 draws of the parameter posterior. With my laptop for 500 X, 100 posterior draws, it takes 375s. It can be made faster by optimizing the manipulation of the posterior draws.
The benefit over the AGHQ approach is that you don’t need to know beforehand which posterior summaries you want for X’s.

uni.slice is from HBglm package, but I had to remove the argument checks to make it work. There might be better implementations, but right now it’s not dominating the time spent, so no need for better.

uni.slice <- function (x0, g, ..., w=1, m=0, lower=-Inf, upper=+Inf, gx0=NULL)
{
  # From HBglm slice_sampler.R
  # Find the log density at the initial point, if not already known.
  if (is.null(gx0)) {gx0 <- g(x0)}
  # Determine the slice level, in log terms.
  logy <- gx0 - rexp(1)
  # Find the initial interval to sample from.
  u <- runif(1,0,w)
  L <- x0 - u
  R <- x0 + (w-u)  # should guarantee that x0 is in [L,R], even with roundoff
  # Expand the interval until its ends are outside the slice, or until
  # the limit on steps is reached.
  if (is.infinite(m))  # no limit on number of steps
  { 
    repeat
    { if (L<=lower) break
      if (g(L)<=logy) break
      L <- L - w
    }
    repeat
    { if (R>=upper) break
      if (g(R)<=logy) break
      R <- R + w
    }
  }
  else if (m>1)  # limit on steps, bigger than one
  { 
    J <- floor(runif(1,0,m))
    K <- (m-1) - J
    while (J>0)
    { if (L<=lower) break
      if (g(L)<=logy) break
      L <- L - w
      J <- J - 1
    }
    while (K>0)
    { if (R>=upper) break
      if (g(R)<=logy) break
      R <- R + w
      K <- K - 1
    }
  }
  # Shrink interval to lower and upper bounds.
  if (L<lower) { L <- lower}
  if (R>upper) { R <- upper}
  # Sample from the interval, shrinking it on each rejection.
  repeat
  { 
    x1 <- runif(1,L,R)
    gx1 <- g(x1)
    if (gx1>=logy) break
    if (x1>x0) { R <- x1}
    else { L <- x1}
  }
  return (x1)
}


ths <- matrix(nrow=100,ncol=n)
tic()
rvs<- as_draws_rvars(fit$draws(variables=c("a","r","b","L_Omega")))
for (i in 1:n) {
  th<-2
  standati$y<-matrix(c(y1[i],y2[i],y3[i]), nrow=1, ncol=3)
  for (s in 1:100) {
    co <- capture.output(rvsi <- rvs |> subset_draws(draw=s), type="message")
    standati$a <- rvsi$a |> draws_of() |> as.numeric()
    standati$r <- rvsi$r |> draws_of() |> as.numeric()
    standati$b <- rvsi$b |> draws_of() |> as.numeric()
    standati$cors <- rvsi$L_Omega %*% t(rvsi$L_Omega) |> draws_of() |> matrix(nrow=3)
    suppressWarnings(mod2bs <- StanModel$new(lib = mod2bs_lib,
                                             data = to_stan_json(standati),
                                             seed = 1))
    for (r in 1:5) {
      th <- uni.slice(th, mod2bs$log_density)
    }
    ths[s,i] <- th
  }
}
toc()

ths <- sapply(ths, mod2bs$param_constrain) |> matrix(nrow=100)

# compare to Stan posterior mean
data.frame(x=x_sum$mean, xpred=colMeans(ths)) |>
  ggplot(aes(x=x, y=xpred)) +
  geom_point() +
  geom_abline() +
  labs(x="X posterior mean via Stan", y="X posterior mean via aghq")

avehtari · March 11, 2024, 4:53pm

And now 500 X’s, 100 posterior draws is about 4.5s, and with 1000 posterior draws about 38s. The same slice sampler, but instead of passing the posteriors draws as data, pass them as parameters, but sample only along X. Would this be fast enough for your purposes?

S<-1000
ths <- matrix(nrow=S,ncol=n)
tic()
drs<- as_draws_matrix(thin_draws(fit$draws(variables=c("a","r","b","L_Omega")),2))
for (i in 1:n) {
  standati$y<-matrix(c(y1[i],y2[i],y3[i]), nrow=1, ncol=3)
  suppressWarnings(mod2bs2 <- StanModel$new(lib = mod2bs_lib2,
                                            data = to_stan_json(standati),
                                            seed = 1))
  th<-2
  for (s in 1:S) {
    ldfun <- \(th) mod2bs2$log_density(c(th,drs[s,]))
    if (s==1) {
      # warmup
      for (r in 1:4) {
        th <- uni.slice(th, ldfun)
      }
    }
    th <- uni.slice(th, ldfun)    
    ths[s,i] <- th
  }
}
ths <- sapply(ths, mod2bs$param_constrain) |> matrix(nrow=S)
toc()

# compare to Stan posterior mean
data.frame(x=x_sum$mean, xpred=colMeans(ths)) |>
  ggplot(aes(x=x, y=xpred)) +
  geom_point() +
  geom_abline() +
  labs(x="X posterior mean via Stan", y="X posterior mean via slice sampling")

Just adding that this is now 500 times faster than if calling mod2$sample() with 1000 different posterior draws.

This can also be parallelized over observations, so you can get further x 4-10 wall clock time speedup (assuming a laptop with 4-10 cores)

cwolfe · March 12, 2024, 6:00pm

Aki-

Apologies for the delay - I was fighting with the compiler (university issue - the the antivirus throws all kinds of errors when I compile and execute the stan.exe - I’ve had to fight to get certain folders whitelisted…)

anyways. this is awesome! A few thoughts and questions for you:

My overall goal / use case here is to design say a Shiny app where a 3rd party could input their observed y values and the program would output an estimate with associated quantities (say expectation and 5th-95th credible interval). Generally, this may be 1 case at a time (n = 1). Certainly we could design it for bulk prediction of a whole sample (say a population of people), but in practice normally we estimate the single unobserved we may have.
That said do you have an opinion of the ‘validity’ or perhaps ‘reliability’ of say AGHQ vs. slice sampling?
A note about AGHQ - this runs great. And as you said, single predictions are fast. In a real world scenario, we’d probably want to present a point estimate (expectation) and associated range (quantiles at say 5th and 95th).
Can I ask you about the distinction between your 2 slice sampling approaches. Full caveat - the 1st approach where you treat draws as data takes forever to run. I’m up over an hour currently (32 core threadripper) - something doesn’t seem right there (final count was 62 minutes).perhaps we should expect this? As for the 2nd approach - it does run in about 60 seconds. Can you clarify the distinction / what you mean by the second sampling over x and treating the parameters as parameters. I think I follow - but I just want to be sure.
Ultimately, any of these could work in practice. This is early days here, but I am trying to scale up to essentially estimate x (age) based on both discrete/ordinal and continuous data. The prevailing theory in the field suggest providing more information (i.e. - more dimensions) will give us a more precise (smaller CI) and more accurate (closer to truth) estimation of age. But even with this I am wondering if we need like say K = 20 dimensions or even less. Perhaps that is overkill.

I’m learning as I go. This has been great.

avehtari · March 13, 2024, 8:52am

Ah, your simulated data having n=500 made me think the speed is a bigger problem.

Both are valid and reliable. Both can be made more accurate by adding more log density evaluations (more quadrature points in AGHQ, or more posterior draws in slice sampling). As you want to integrate over the posterior draws from the first model, both approaches include the Monte Carlo part. The choice would be based on speed and simplicity of implementation for the quantities you want to report.

Did you include the integration over the first model posterior draws?

The code was not parallelized so the number of cores doesn’t matter (you can parallelize it). I guess you are using Windows which is known to have super slow I/O that matters when instantiating the posterior with data. The second approach should be faster as there are less instantiation.

Oh, I forgot to include the Stan file:

data{
  int N; // # of individuals
  int K; // # of responses
  matrix[N,K] y; // vector of growth responses per individual
  vector[K] n1;
  vector[K] n2;
}
parameters{
  array[N] real<lower=0, upper=18> X;
  vector[K] a;
  vector[K] r;
  vector[K] b;
  cholesky_factor_corr[K] L_Omega; // cholesky factor the corr matrix for efficiency
}
transformed parameters{
  array[N] vector[K] mu; //mean array of size N containing vectors of length K
  array[N] vector[K] s; //sd array of size N containing vectors of length K
  array[N] matrix[K,K] Sigma; // cov array of size N with K x K cov matrices
  for(i in 1:N){
    for(k in 1:K){
      mu[i,k] = a[k]*X[i]^(r[k])+b[k]; // mean function
      s[i,k] = n1[k]*(1 + n2[k]*X[i]); // sd function
      }
  }
  for(i in 1:N){
    Sigma[i] = diag_pre_multiply(s[i],L_Omega); // combining the cholesky corr matrix with the noise  
      }
}
model{
  //priors
  X ~ normal(0,5);

  for(i in 1:N){
    y[i] ~ multi_normal_cholesky(mu[i], Sigma[i]); //likelihood
	}
}

Which is compiled with (this line was missing from the R code)

mod2bs_lib2 <- compile_model("multi_sim_missx2.stan")

Maybe this clarifies?

As long as x and parameters are continuous, my code examples do work.

Topic		Replies	Views
Copula fit (finally!) and questions about downstream prediction Modeling techniques , fitting-issues , specification , cmdstanr	2	557	December 21, 2023
Predictions from a fitted model's prior distribution brms brms	6	722	October 17, 2023
Including a posterior from previous experiment for a coefficient Modeling	2	416	September 15, 2020
Simulating a response using prior distributions brms	1	343	April 30, 2021
Measurement error model in covariates for future predictions Modeling	4	420	June 15, 2021

Post-Hoc Prediction: Use samples from fitted model in a prediction task

Related topics