Likelihood of out-of-sample data

Hi all,

I have a bit of an esoteric question so bear with me…

Let’s say I fit a multivariate cumulative probit model with @bgoodri 's code (see below). y is some ordinal outcome on the scale 1-12 and x is a continuous real value (age). Everything looks great here really. No issues.

Now, I have an out-of-sample dataset - \hat{y} is out of sample ordinal scores and \hat{x} is out of sample age. I want to ask one simple question: Given the model, what is the likelihood of this new sample occurring? I think this is p(\hat{y} | \hat{x}, \theta). Ultimately, I’d like to end up with a profile with age on the x-axis and likelihood on the y-axis to show how the likelihood changes with age. I am trying to address how rare this new data point may be in comparison to the much larger training dataset.

Is there a way to complete this step in the generated quantities block? Can I do this post-hoc after I’ve already fit the model in cmdstanr? I again ‘think’ this is just dmvnnorm in R (with augmented variables?)

I hope this makes sense!

data {
  int<lower=1> D;
  int<lower=0> N;
  array[N, D] int<lower=1> y;
  array [N] real X;
  int cats;
}
transformed data{
  int n_thr = cats-1;
}
parameters {
  cholesky_factor_corr[D] L;
  array[N, D] real<lower=0, upper=1> u; // nuisance that absorbs inequality constraints
  array[D] ordered[n_thr] thr;
  array [D] real<lower=0> beta;
}
model {
  beta ~ normal(0,5);
  L ~ lkj_corr_cholesky(4);
  for(d in 1:D){
    for(i in 1:n_thr){
      thr[d,i] ~ normal(i+1,1);
    }
  }
  
  // implicit: u is iid standard uniform a priori
  {
    for (n in 1 : N) {
      array [N] vector[D] z;
      array [N] vector[D] mu;
      real prev;
      prev = 0;
      for (d in 1 : D) {
        mu[n,d] = beta[d]*X[n];
        // Phi and inv_Phi may overflow and / or be numerically inaccurate
        real t;
        if (y[n,d] == 99) {
          z[n,d] = inv_Phi(u[n,d]);
          target += log(1);
        } else if (y[n, d] == 1) {
          real ub = Phi((thr[d,1] - (mu[n,d] + prev)) / L[d,d]);
          t = ub * u[n, d];
          z[n,d] = inv_Phi(t); // implies utility is positive
          target += log(ub); // Jacobian adjustment
        } else if (y[n,d] == n_thr + 1) {
          real lb = Phi((thr[d, n_thr] -(mu[n,d] + prev)) / L[d,d]);
          t = lb + (1 - lb) * u[n,d];
          z[n,d] = inv_Phi(t); // implies utility is negative
          target += log1m(lb); // Jacobian adjustment
        } else{
          real lb = Phi((thr[d, y[n,d] - 1] -(mu[n,d] + prev)) / L[d,d]);
          real ub = Phi((thr[d, y[n,d]    ] -(mu[n,d] + prev)) / L[d,d]);
          t = lb + (ub - lb) * u[n,d];
          z[n,d] = inv_Phi(t); // implies utility is negative
          target += log(ub - lb);
        }
        if (d < D) {
          prev = L[d + 1, 1 : d] * head(z[n], d);
        }
        // Jacobian adjustments imply z is truncated standard normal
        // thus utility --- mu + L_Omega * z --- is truncated multivariate normal
      }
    }
  }
}
generated quantities {
  corr_matrix[D] Omega;
  Omega = multiply_lower_tri_self_transpose(L);
}

Here’s the chapter I wrote for the User’s Guide answering this question:

That would be correct if \theta were a point estimate. But we’re doing Bayesian inference, so if we have training data x, y and a test covariate vector \hat{x}, we want to use posterior predictive inference

p(\hat{y} \mid \hat{x}, y, x) = \mathbb{E}[p(\hat{y} \mid \Theta) \mid x, y] = \int p(\hat{y} \mid \theta) \cdot p(\theta \mid x, y) \, \textrm{d}\theta.

To calculate with MCMC draws \theta^{(1)}, \ldots, \theta^{(M)} \sim p(\theta \mid x, y), you have

p(\hat{y} \mid \hat{x}, y, x) \approx \frac{1}{M} \sum_{m=1}^M p(\hat{y} \mid \theta^{(m)}),

but we need to calculate on the log scale,

\log p(\hat{y} \mid \hat{x}, y, x) \approx -\log M + \textrm{logSumExp}_{m=1}^M \log p(\hat{y} \mid \theta^{(m)}),

where logSumExp is the log-sum-of-exponentials function. (This is all just to prevent underflow in the log density calculations and keep as much arithmetic precision as possible.)

In Stan, this means calculating the log probabilities in generated quantities and then doing the log-sum-of-exponentials on the outside with the draws.

There’s a lot of duplicated computation in your code—eliminating that is important for efficiency.

3 Likes

Bob - As always, this is super helpful!

Below I include code for a bare bones model that I can use in cmdstanr’s generated_quantities() function. I used approximations to Phi() because I was having numerical issues, but that aside, I am trying to figure out what’s next.

I believe log_lik as I’ve described is the log probability. Yes? How would I combine this with -log M + logSumExp() ? I can run this code as is and get an ‘output’. But I am missing a step, no?


functions{
  real phi_logit_approx(real b) {
          real a;
          a = inv_logit( 1.702*b );
            return(a);
          }
  real inv_phi_logit_approx(real b) {
          real a;
          a = (1/1.702)*logit(b);    
            return(a);
          }
}
data {
  int<lower=1> D;
  int<lower=0> N;
  array[1,D] int<lower=1> y;
  array [N] real X;
  int cats;
}
transformed data{
  int n_thr = cats-1;
}
parameters {
  cholesky_factor_corr[D] L;
  array[N, D] real<lower=0, upper=1> u; // nuisance that absorbs inequality constraints
  array[D] ordered[n_thr] thr;
  array [D] real<lower=0> beta;
}
generated quantities {
  
  vector[N] log_lik;
  
    {
    for (n in 1 : N) {
      array [N] vector[D] z;
      array [N] vector[D] mu;
      real prev;
      prev = 0;
      for (d in 1 : D) {
        mu[n,d] = beta[d]*X[n];
        // Phi and inv_Phi may overflow and / or be numerically inaccurate
        real t;
        if (y[1,d] == 99) {
          z[n,d] = inv_phi_logit_approx(u[n,d]);
          log_lik[n] = log(1);
        } else if (y[1,d] == 1) {
          real ub = phi_logit_approx((thr[d,1] - (mu[n,d] + prev)) / L[d,d]);
          t = ub * u[n, d];
          z[n,d] = inv_phi_logit_approx(t); // implies utility is positive
          log_lik[n] = log(ub); // Jacobian adjustment
        } else if (y[1,d] == n_thr + 1) {
          real lb = phi_logit_approx((thr[d, n_thr] -(mu[n,d] + prev)) / L[d,d]);
          t = lb + (1 - lb) * u[n,d];
          z[n,d] = inv_phi_logit_approx(t); // implies utility is negative
          log_lik [n] = log1m(lb); // Jacobian adjustment
        } else{
          real lb = phi_logit_approx((thr[d, y[1,d] - 1] -(mu[n,d] + prev)) / L[d,d]);
          real ub = phi_logit_approx((thr[d, y[1,d]    ] -(mu[n,d] + prev)) / L[d,d]);
          t = lb + (ub - lb) * u[n,d];
          z[n,d] = inv_phi_logit_approx(t); // implies utility is negative
          log_lik[n] = log(ub - lb);
        }
        if (d < D) {
          prev = L[d + 1, 1 : d] * head(z[n], d);
        }
        // Jacobian adjustments imply z is truncated standard normal
        // thus utility --- mu + L_Omega * z --- is truncated multivariate normal
      }
    }
  }
}

This loo package vignette might also be helpful: Holdout validation and K-fold cross-validation of Stan programs with the loo package • loo

Apologies to bring up an older thread, I just felt it was easier to continue from the previous train of thought.

A recap: I have fit a multivariate cumulative probit model based on @bgoodri 's implementation (see original post at top). Then per @avehtari and @Bob_Carpenter and section 30.1 in the manual, I have extracted the log-likelihood from the model using cmdstanr’s generated_quatities (see below).

functions{
  real phi_logit_approx(real b) {
          real a;
          a = inv_logit( 1.702*b );
            return(a);
          }
  real inv_phi_logit_approx(real b) {
          real a;
          a = (1/1.702)*logit(b);    
            return(a);
          }
}
data {
  int<lower=1> D;
  int<lower=0> N;
  array[1,D] int<lower=1> y;
  array [N] real X;
  int cats;
}
transformed data{
  int n_thr = cats-1;
}
parameters {
  cholesky_factor_corr[D] L;
  array[N, D] real<lower=0, upper=1> u; // nuisance that absorbs inequality constraints
  array[D] ordered[n_thr] thr;
  array [D] real<lower=0> beta;
}
generated quantities {
  
  vector[N] lp;
  
    {
    for (n in 1 : N) {
      array [N] vector[D] z;
      array [N] vector[D] mu;
      vector[D] log_lik;
      real prev;
      prev = 0;
      for (d in 1 : D) {
        mu[n,d] = beta[d]*X[n];
        // Phi and inv_Phi may overflow and / or be numerically inaccurate
        real t;
        if (y[1,d] == 99) {
          z[n,d] = inv_phi_logit_approx(u[n,d]);
          log_lik[d] = log(1);
        } else if (y[1,d] == 1) {
          real ub = phi_logit_approx((thr[d,1] - (mu[n,d] + prev)) / L[d,d]);
          t = ub * u[n, d];
          z[n,d] = inv_phi_logit_approx(t); // implies utility is positive
          log_lik[d] = log(ub); // Jacobian adjustment
        } else if (y[1,d] == n_thr + 1) {
          real lb = phi_logit_approx((thr[d, n_thr] -(mu[n,d] + prev)) / L[d,d]);
          t = lb + (1 - lb) * u[n,d];
          z[n,d] = inv_phi_logit_approx(t); // implies utility is negative
          log_lik [d] = log1m(lb); // Jacobian adjustment
        } else{
          real lb = phi_logit_approx((thr[d, y[1,d] - 1] -(mu[n,d] + prev)) / L[d,d]);
          real ub = phi_logit_approx((thr[d, y[1,d]    ] -(mu[n,d] + prev)) / L[d,d]);
          t = lb + (ub - lb) * u[n,d];
          z[n,d] = inv_phi_logit_approx(t); // implies utility is negative
          log_lik[d] = log(ub - lb);
        }
        if (d < D) {
          prev = L[d + 1, 1 : d] * head(z[n], d);
        }
        // Jacobian adjustments imply z is truncated standard normal
        // thus utility --- mu + L_Omega * z --- is truncated multivariate normal
        
        lp[n] = sum(log_lik);
      }
    }
  }
}

And R code to get the log posterior predictive density per @Bob_Carpenter above:

log_lik <- as_draws_matrix(fit_gq$draws()) 
lppd <- apply(X = log_lik, MARGIN = 2, function(x) matrixStats::logSumExp(x) - log(4000))

Here, lppd is of length N. Now my question - I have an unknown test point. The array y is known but the independent variable X is unknown. Essentially, I want to know if this unknown test value is likely given the current construction of the model. I want to know what the probability of this multivariate outcome is given the current construction of the model. What I have done is plotted the pointwise contributions to the lppd against the independent variable x.

Is this valid? I essentially read this and see that the lppd is higher between 8-10 so perhaps this one single individual is similar or likely that age. But, I am unsure I am interpreting this correctly. Is there a better way to ask this question in Stan / R? Comparing 2 models with loo and elpd is rather straightforward, it is this single data point approach that is throwing me.

plot(train$agey, lppd)