Likelihood of out-of-sample data

Hi all,

I have a bit of an esoteric question so bear with me…

Let’s say I fit a multivariate cumulative probit model with @bgoodri 's code (see below). y is some ordinal outcome on the scale 1-12 and x is a continuous real value (age). Everything looks great here really. No issues.

Now, I have an out-of-sample dataset - \hat{y} is out of sample ordinal scores and \hat{x} is out of sample age. I want to ask one simple question: Given the model, what is the likelihood of this new sample occurring? I think this is p(\hat{y} | \hat{x}, \theta). Ultimately, I’d like to end up with a profile with age on the x-axis and likelihood on the y-axis to show how the likelihood changes with age. I am trying to address how rare this new data point may be in comparison to the much larger training dataset.

Is there a way to complete this step in the generated quantities block? Can I do this post-hoc after I’ve already fit the model in cmdstanr? I again ‘think’ this is just dmvnnorm in R (with augmented variables?)

I hope this makes sense!

data {
  int<lower=1> D;
  int<lower=0> N;
  array[N, D] int<lower=1> y;
  array [N] real X;
  int cats;
}
transformed data{
  int n_thr = cats-1;
}
parameters {
  cholesky_factor_corr[D] L;
  array[N, D] real<lower=0, upper=1> u; // nuisance that absorbs inequality constraints
  array[D] ordered[n_thr] thr;
  array [D] real<lower=0> beta;
}
model {
  beta ~ normal(0,5);
  L ~ lkj_corr_cholesky(4);
  for(d in 1:D){
    for(i in 1:n_thr){
      thr[d,i] ~ normal(i+1,1);
    }
  }
  
  // implicit: u is iid standard uniform a priori
  {
    for (n in 1 : N) {
      array [N] vector[D] z;
      array [N] vector[D] mu;
      real prev;
      prev = 0;
      for (d in 1 : D) {
        mu[n,d] = beta[d]*X[n];
        // Phi and inv_Phi may overflow and / or be numerically inaccurate
        real t;
        if (y[n,d] == 99) {
          z[n,d] = inv_Phi(u[n,d]);
          target += log(1);
        } else if (y[n, d] == 1) {
          real ub = Phi((thr[d,1] - (mu[n,d] + prev)) / L[d,d]);
          t = ub * u[n, d];
          z[n,d] = inv_Phi(t); // implies utility is positive
          target += log(ub); // Jacobian adjustment
        } else if (y[n,d] == n_thr + 1) {
          real lb = Phi((thr[d, n_thr] -(mu[n,d] + prev)) / L[d,d]);
          t = lb + (1 - lb) * u[n,d];
          z[n,d] = inv_Phi(t); // implies utility is negative
          target += log1m(lb); // Jacobian adjustment
        } else{
          real lb = Phi((thr[d, y[n,d] - 1] -(mu[n,d] + prev)) / L[d,d]);
          real ub = Phi((thr[d, y[n,d]    ] -(mu[n,d] + prev)) / L[d,d]);
          t = lb + (ub - lb) * u[n,d];
          z[n,d] = inv_Phi(t); // implies utility is negative
          target += log(ub - lb);
        }
        if (d < D) {
          prev = L[d + 1, 1 : d] * head(z[n], d);
        }
        // Jacobian adjustments imply z is truncated standard normal
        // thus utility --- mu + L_Omega * z --- is truncated multivariate normal
      }
    }
  }
}
generated quantities {
  corr_matrix[D] Omega;
  Omega = multiply_lower_tri_self_transpose(L);
}

Here’s the chapter I wrote for the User’s Guide answering this question:

That would be correct if \theta were a point estimate. But we’re doing Bayesian inference, so if we have training data x, y and a test covariate vector \hat{x}, we want to use posterior predictive inference

p(\hat{y} \mid \hat{x}, y, x) = \mathbb{E}[p(\hat{y} \mid \Theta) \mid x, y] = \int p(\hat{y} \mid \theta) \cdot p(\theta \mid x, y) \, \textrm{d}\theta.

To calculate with MCMC draws \theta^{(1)}, \ldots, \theta^{(M)} \sim p(\theta \mid x, y), you have

p(\hat{y} \mid \hat{x}, y, x) \approx \frac{1}{M} \sum_{m=1}^M p(\hat{y} \mid \theta^{(m)}),

but we need to calculate on the log scale,

\log p(\hat{y} \mid \hat{x}, y, x) \approx -\log M + \textrm{logSumExp}_{m=1}^M \log p(\hat{y} \mid \theta^{(m)}),

where logSumExp is the log-sum-of-exponentials function. (This is all just to prevent underflow in the log density calculations and keep as much arithmetic precision as possible.)

In Stan, this means calculating the log probabilities in generated quantities and then doing the log-sum-of-exponentials on the outside with the draws.

There’s a lot of duplicated computation in your code—eliminating that is important for efficiency.

3 Likes

Bob - As always, this is super helpful!

Below I include code for a bare bones model that I can use in cmdstanr’s generated_quantities() function. I used approximations to Phi() because I was having numerical issues, but that aside, I am trying to figure out what’s next.

I believe log_lik as I’ve described is the log probability. Yes? How would I combine this with -log M + logSumExp() ? I can run this code as is and get an ‘output’. But I am missing a step, no?


functions{
  real phi_logit_approx(real b) {
          real a;
          a = inv_logit( 1.702*b );
            return(a);
          }
  real inv_phi_logit_approx(real b) {
          real a;
          a = (1/1.702)*logit(b);    
            return(a);
          }
}
data {
  int<lower=1> D;
  int<lower=0> N;
  array[1,D] int<lower=1> y;
  array [N] real X;
  int cats;
}
transformed data{
  int n_thr = cats-1;
}
parameters {
  cholesky_factor_corr[D] L;
  array[N, D] real<lower=0, upper=1> u; // nuisance that absorbs inequality constraints
  array[D] ordered[n_thr] thr;
  array [D] real<lower=0> beta;
}
generated quantities {
  
  vector[N] log_lik;
  
    {
    for (n in 1 : N) {
      array [N] vector[D] z;
      array [N] vector[D] mu;
      real prev;
      prev = 0;
      for (d in 1 : D) {
        mu[n,d] = beta[d]*X[n];
        // Phi and inv_Phi may overflow and / or be numerically inaccurate
        real t;
        if (y[1,d] == 99) {
          z[n,d] = inv_phi_logit_approx(u[n,d]);
          log_lik[n] = log(1);
        } else if (y[1,d] == 1) {
          real ub = phi_logit_approx((thr[d,1] - (mu[n,d] + prev)) / L[d,d]);
          t = ub * u[n, d];
          z[n,d] = inv_phi_logit_approx(t); // implies utility is positive
          log_lik[n] = log(ub); // Jacobian adjustment
        } else if (y[1,d] == n_thr + 1) {
          real lb = phi_logit_approx((thr[d, n_thr] -(mu[n,d] + prev)) / L[d,d]);
          t = lb + (1 - lb) * u[n,d];
          z[n,d] = inv_phi_logit_approx(t); // implies utility is negative
          log_lik [n] = log1m(lb); // Jacobian adjustment
        } else{
          real lb = phi_logit_approx((thr[d, y[1,d] - 1] -(mu[n,d] + prev)) / L[d,d]);
          real ub = phi_logit_approx((thr[d, y[1,d]    ] -(mu[n,d] + prev)) / L[d,d]);
          t = lb + (ub - lb) * u[n,d];
          z[n,d] = inv_phi_logit_approx(t); // implies utility is negative
          log_lik[n] = log(ub - lb);
        }
        if (d < D) {
          prev = L[d + 1, 1 : d] * head(z[n], d);
        }
        // Jacobian adjustments imply z is truncated standard normal
        // thus utility --- mu + L_Omega * z --- is truncated multivariate normal
      }
    }
  }
}

This loo package vignette might also be helpful: Holdout validation and K-fold cross-validation of Stan programs with the loo package • loo