Marginalize out missing continuous data

Hi all.

I have a question about marginalization in stan. Let’s say I have missing continuous observations that are part of a larger data structure. According to the manual, with some index fiddling, I can reassemble the vector of observations (see below) by declaring a parameter ymiss and build the y vector like below and use that in the lpdf.

Now, this essentially treats the parameter as part of the joint posterior distribution of all the unknown parameters. But also, this amounts to simply imputing a value based on sampling - is this intuition correct?

My question is this, what if I want to marginalize out the missing data so they have no influence on the log probability? Can I adjust the likelihood to use only the non-missing data, as it is written, the ymiss parameters do have an influence on the joint probability distribution. In some cases this could be minimal, but it is also possible that they could have an outsized influence on the posterior distribution.

data {
  int<lower=0> N_obs;
  int<lower=0> N_mis;
  int<lower=1, upper=N_obs + N_mis> ii_obs[N_obs];
  int<lower=1, upper=N_obs + N_mis> ii_mis[N_mis];
  array[N_obs] real y_obs;
transformed data {
  int<lower=0> N = N_obs + N_mis;
parameters {
  array[N_mis] real y_mis;
  real<lower=0> sigma;
transformed parameters {
  array[N] real y;
  y[ii_obs] = y_obs;
  y[ii_mis] = y_mis;
model {
  sigma ~ gamma(1, 1);
  y[1] ~ normal(0, 100);
  y[2:N] ~ normal(y[1:(N - 1)], sigma);

Marginalization means integrating the likelihood.

p\left(y_{\mathrm{obs}}\mid\sigma\right)=\int p\left(y_{\mathrm{obs}},y_{\mathrm{mis}}\mid\sigma\right)\mathrm{d}y_{\mathrm{mis}}

It can be tricky sometimes but your model is a simple random walk and that’s straightforward (sum of gaussians is a gaussian with larger variance.)

parameters {
  real<lower=0> sigma;
model {
  sigma ~ gamma(1, 1);
  // assuming ii_obs[1] == 1
  y_obs[1] ~ normal(0, 100);
  for (i in 2:N_obs) {
    y_obs[i] ~ normal(y_obs[i-1], sqrt(ii_obs[i]-ii_obs[i-1])*sigma);

The unmarginalized model does impute the missing values in a sense—but these are better thought of as predictions of the (marginalized) model rather than something imputed before fitting the model. In particular, the posterior for sigma does not change, whether the model is marginalized or not.

1 Like

Thank you @nhuurre!

I just have a follow up so I understand the mechanics.

In the model above (from the Stan manual), y_obs is an array of reals, read in as data that make up the non-missing values. In turn, ii_obs are the indices of those values. Is the parameter ymiss and associated indices ii_mis not included or necessary?

Probably related to this, I know the sqrt() expression you have in the lpdf is related to the normal pdf (I think?). I just want to understand what this expression with the indices is doing here?

Thank you!

The missing indices ii_mis are implicitly all indices that are not included in ii_obs (and are smaller than N_obs+N_mis). The original model included the missing indices explicitly for easier vectorization but after marginalization you don’t need y_mis nor ii_mis.

The random walk model is

\epsilon_{i}\sim N\left(0,\sigma^{2}\right)

and when some points are missing you can skip ahead


The sum of gaussian distributed variables is another gaussian distributed variable whose variance is the sum of the variances

\left(\sum_{j=i-n}^{i}\epsilon_{j}\right)\sim N\left(0,n\sigma^{2}\right)

Stan parameterizes gaussian distribution by standard deviation which is the square root of the variance, in this case \sqrt{n}\sigma.


@nhuurre Thank you! Makes complete sense.

My last question related to marginalization adds some complexity. In a traditional MVN model with missing data we can do what Section 3.5 of the manual does and model the marginal distribution of the observed data if some values are missing in the response vector. This can become quite tedious if the response vector is larger in dimension and further, inefficient because it requires the use of the full covariance instead of the Cholesky (I am unsure of any way to reduce the Cholesky factor in ways that allow for missing data?).

What I am trying to do is model a Gaussian copula with normal marginals to start. You’ll see in the code I attach I use the method recommended by the manual and what I initially posted about. That is, filling in the response vector with a combination of known data and ymiss parameters. This works, fits, and is okay. But I was curious if I could marginalize the marginals here much like your example and the marginalization in an MVN.

I know there are a quite few complexities here but most of all are 1) the use of the cholesky instead of the full covariance and 2) the input to the lpdf (centered_gaussian_copula_cholesky) requires a matrix of marginal responses and the cholesky matrix and because the missing are not distributed equally across responses, the matrix would be ragged (e.g., some responses may have 10 complete observations while some have 20). Thus, I am unsure how I can get a ‘complete’ matrix of marginals to input into the lpdf function if I marginalize missing data much like your example.

I hope all this makes sense. I also fully recognize it may not be analytically possible to integrate out the missing data here, but I figured I would ask! In theory, the copula construction makes it easier because it deals with the marginals one response at a time. On the other hand, the nature of the lpdf function may make this more difficult than coding it from scratch in R? I draw my copula example from here.

  real multi_normal_cholesky_copula_lpdf(matrix U, matrix L) {
    int N = rows(U);
    int J = cols(U);
    matrix[J, N] G = mdivide_left_tri_low(L, U') ;
    return -N * sum(log(diagonal(L))) -  0.5 * ( sum( G^2 ) - sum(U^2) );
  real centered_gaussian_copula_cholesky_(matrix[,] marginals, matrix L, int[] order) {
    // Extract dimensions of final outcome matrix
    int N = rows(marginals[1][1]);
    int J = rows(L);
    matrix[N, J] U;

    // Iterate through marginal arrays, concatenating the outcome matrices by column
    // and aggregating the log-likelihoods (from continuous marginals) and jacobian
    // adjustments (from discrete marginals)
    real adj = 0;
    int pos = 1;
    for (m in 1 : size(marginals)) {
      int curr_cols = cols(marginals[m][1]);

      U[ : , pos : (pos + curr_cols - 1)] = marginals[m][1];

      adj += sum(marginals[m][2]);
      pos += curr_cols;

    // Return the sum of the log-probability for copula outcomes and jacobian adjustments
    return multi_normal_cholesky_copula_lpdf(U[ : , order] | L) + adj;
  matrix[] normal_marginal(matrix y, matrix mu_glm, matrix sigma) {
    int N = rows(mu_glm);
    int J = cols(mu_glm);
    matrix[N, J] rtn[2];
    // Initialise the jacobian adjustments to zero, as vectorised lpdf will be used
    rtn[2] = rep_matrix(0, N, J);

    for (j in 1 : J) {
      rtn[1][ : , j] = (y[ : , j] - mu_glm[ : , j]) ./ sigma[ :,j];
      rtn[2][1, j] = normal_lpdf(y[ : , j] | mu_glm[ : , j], sigma[ :,j]);

    return rtn;
  int J; // # of outcomes
  int<lower=0> Nrow;
  int<lower=0> Ncol;
  int<lower=0> Ncomp; // Number of non-missing values
  int<lower=0> Nmiss; // Number of missing values
  real dat_complete[Ncomp];   // Vector of non-missing values
  int ind_pres[Ncomp, 2];     // Matrix (row, col) of non-missing value indices
  int ind_miss[Nmiss, 2];// Matrix (row, col) of missing value indices
  real X[Nrow]; // predictor (age)
transformed data{
  int order[J]; // order of responses for copula -> ordinal (binary) first
  for(i in 1:J){
    order[i] = i;
  real ymiss[Nmiss]; // missing data parameter - UNSURE
  vector<lower=0>[J] c1; // first parameter of continuous mean function
  vector<lower=0>[J] c2; // second parameter of continuous mean function
  vector[J] c3; // third parameter of continuous mean function
  vector<lower=0>[J] k1; // first parameter of continuous noise function
  vector<lower=0>[J] k2; // second parameter of continuous noise function
  cholesky_factor_corr[J] L; // cholesky factor of correlation matrix 
transformed parameters{
 matrix[Nrow,J] mu_J; // continuous
  matrix[Nrow,J] sd_J; // continuous
    matrix[Nrow,Ncol] y;   // The "data" with interpolated missing values
  // Fill y with non-missing values 
    for(n in 1:Ncomp) {
        y[ind_pres[n,1]][ind_pres[n,2]] = dat_complete[n];
    // Fill the rest of y with missing value "parameters"
    for(n in 1:Nmiss){
        y[ind_miss[n,1]][ind_miss[n,2]] = ymiss[n];

  for(i in 1:Nrow){
    for(j in 1:J){
      mu_J[i,j] = c2[j]*X[i]^c1[j] + c3[j];
      sd_J[i,j] = k1[j]*(1+k2[j]*X[i]);
  // Mean and SD Functions for Likelihood
   for (n in 1:Nmiss){
    ymiss[n] ~ normal(mu_J[ind_miss[n,1]][ind_miss[n,2]],sd_J[ind_miss[n,1]][ind_miss[n,2]]);
  // Priors
  c1 ~ normal(0,1);
  c2 ~ normal(0,1);
  c3 ~ normal(0,1);
  k1 ~ cauchy(0,1);
  k2 ~ normal(0,1);
  L ~ lkj_corr_cholesky(1); //

  // Likelihood
  target += centered_gaussian_copula_cholesky_({normal_marginal(y,mu_J,sd_J)},L,order);
generated quantities{
  corr_matrix[J] Omega = multiply_lower_tri_self_transpose(L);
  vector[Nrow] log_lik;

  for(n in 1:Nrow){
    log_lik[n] = centered_gaussian_copula_cholesky_({normal_marginal(to_matrix(y[n,],1,J),to_matrix(mu_J[n,],1,J),to_matrix(sd_J[n,],1,J))},L,order);

Yeah, marginalizing a multivariate distribution gets quite complicated and I don’t know an analytical solution.

The only change I would make to that model is remove this line:

   for (n in 1:Nmiss){
    ymiss[n] ~ normal(mu_J[ind_miss[n,1]][ind_miss[n,2]],sd_J[ind_miss[n,1]][ind_miss[n,2]]);

I think this log-prob term is already included as part of what you call “jacobian adjustments” in normal_marginal().

Ah yes, one of SEVERAL inefficiencies I need to clean up here - Thank you! (Also a misnomer on my end, the ‘jacobian’ term is just a relic from a similar function I’ve modified from the link for ordinal marginals that is an actual jacobian adjustment).

I was thinking too complex (and probably unnecessary) and analytically not possible. But just wanted confirmation!