Psis-loo with censored observations

Hi all, although I found a very similar thread on the forum (Loo: calculating point-wise log-lik when using data augmentation for censored observations), I don’t think my question has been answered.
Consider the following model Y \sim \mathcal{N}(aX + b, \sigma), where some of the observations Y are left-censored. We can integrate out the unobserved data using normal_lcdf and we get the following Stan model (C[n] == 0 indicates that the nth observation is uncensored, and C[n] == 1 indicates that it is left-censored).

// cens_reg.stan
data {
    int N;
    array[N] real X;
    array[N] real Y;
    array[N] int C;
}

parameters {
    real a;
    real b;
    real<lower=0> sigma;
}

transformed parameters {
    array[N] real log_lik;
    for ( n in 1:N ) {
        if ( C[n] == 0 ) {
            log_lik[n] = normal_lpdf(Y[n] | a * X[n] + b, sigma);
        } else {
            log_lik[n] = normal_lcdf(Y[n] | a * X[n] + b, sigma);
        }
    }
}

model {
    target += log_lik;
}

I now want to compare this model with other models using PSIS-LOO. However, this leads to “very bad” pareto-k values. These correspond to some of the left-censored observations that are below the observation limit (equal to -0.5). So they have a log-probability close to 0. The following figure shows the (potentially) censored observations (black) and the “ground truth” values (red), together with the fitted linear model (blue). The bottom panel is made with plot_khat from the arviz package.

My question is: Does anyone know how to resolve this? Can we just ignore such warnings and trust the loo results?

This is the python code I used to create the figure:

import cmdstanpy
import arviz as az
import matplotlib.pyplot as plt
import scipy.stats as sts
import numpy as np

sm = cmdstanpy.CmdStanModel(stan_file="cens_reg.stan")

a = 0.4
b = 0.1
sigma = 0.2
N = 100

X = sts.norm.rvs(size=N)
X.sort()
Y_unc = sts.norm.rvs(loc=X, scale=sigma)

lb = -0.5

C = np.array([0 if y >= lb else 1 for y in Y_unc])
Y = np.array([y if c == 0 else lb for y, c in zip(Y_unc, C)])

sam = sm.sample(data={"X" : X, "Y" : Y, "N" : N, "C" : C})
azsam = az.from_cmdstanpy(sam)

loo = az.loo(azsam)
print(loo)

fig, (ax, bx) = plt.subplots(2, 1, figsize=(7,7))

ax.scatter(X, Y_unc, color='r', s=1, zorder=1)
ax.scatter(X, Y, color='k', s=2, zorder=2)
ax.set_xlabel("X")
ax.set_ylabel("Y")

xs = np.linspace(np.min(X), np.max(X), 100)
a_est = sam.stan_variable("a")
b_est = sam.stan_variable("b")
yss = np.array([[a*x + b for x in xs] for a, b in zip(a_est, b_est)])
lys, mys, uys = np.percentile(yss, axis=0, q=[2.5, 50, 97.5])
ax.plot(xs, mys)
ax.fill_between(xs, lys, uys, alpha=0.3)

az.plot_khat(loo, ax=bx, show_hlines=True)
1 Like

I’m confused. Left or right?

Can you show histograms of a) log_lik values and b) 1/exp(log_lik) values for the one with worst khat?

Usually censored observations are weakly informative, although there is of course more leverage in the extreme x values.

Apologies: that should be left-censored (I edited the post). This figure shows the log_lik and 1/exp(log_lik) for observations 1,…,10 (they are sorted by X, so these are the worst).

Can you show histograms of a) log_lik values and b) 1/exp(log_lik) values for the data point with the worst khat?

Trying to revive this thread as I am basically in the exact same situation. Generally, the (left-)censored data points have the higher pareto k values in my data set, with increasing pareto k values for those with more extreme x-values. One additional layer of complexity in my model is that I am trying to fit a pattern consisting of an up-slope followed by a down-slope, with the peak at x=0. However, I do not know the x-values, only the distances between them so in addition to fitting these two slopes I am shifting the data horizontally. Here are the histograms for the worst pareto-k value (which is at inf):


I am also wondering if there are any issues with doing model comparison based on censored data or if it’s something else (model misspecification).
Thanks in advance!

Hi @Loni92

I have a simple hack to make the problem go away. I’m not sure if this is a bad idea, and am interested in what the Stan experts have to say about it: My solution is to add a tiny bit of jitter to the log_lik vector. Here’s the modified stan code:

// cens_reg.stan
data {
    int N;
    array[N] real X;
    array[N] real Y;
    array[N] int C;
}

parameters {
    real a;
    real b;
    real<lower=0> sigma;
}

transformed parameters {
    array[N] real log_lik;
    for ( n in 1:N ) {
        if ( C[n] == 0 ) {
            log_lik[n] = normal_lpdf(Y[n] | a * X[n] + b, sigma);
        } else {
            log_lik[n] = normal_lcdf(Y[n] | a * X[n] + b, sigma);
        }
    }
}

model {
    target += log_lik;
}

generated quantities {
    array[N] real log_lik_jitter;
    for ( n in 1:N ) {
        log_lik_jitter[n] = log_lik[n] + normal_rng(0, 0.01);
    }
}

I then used arviz to generate the pareto k plots using log_likelihood="log_lik" and log_likelihood="log_lik_jitter":

@avehtari please let us know if this is a terrible idea or not. Thanks!!

1 Like

Hi,

I’ve been on vacation and now slowly going through the pile of messages.

Which loo version you are using? We’ve made some fixes to avoid Inf returned in more recent versions. Can you show something like the 100 smallest log_lik values? That is probably enough to figure out why you get Inf.

It’s wrong. Looking at your x vs y plot, some of the LCDF’s correspond to quite extreme tail probabilities, which are likely to get high Pareto-k’s (even without importance sampling). If you think these extreme tail probabilities are not interesting, it would probably be better to either modify the model to have thicker tails or use some other score like CRPS.

Hi Aki,

I’m sure it’s wrong, but I don’t understand your argument: These tail probabilities are all very close to 1: the predicted value ax+b is much smaller than the limit of detection of 0.5. The censored values close to x = -2 should not be influential observations. The fact that you mention thicker tails tells me that we’re talking past each other.

As a demonstration, I’ve added a non-censored outlier (data point 100). This has a high Pareto k in both the jittered and non-jittered case.

I’ve also changed the scale a bit to get more extreme k values. Interestingly, the Pareto ks for the fest few data points (smallest x) are close to zero.

this is done with arviz 0.22.0 by the way.

best regards,

Chris

Thanks for the reply, really appreciated! I am using CmdStan version 2.31.0 so pretty old I think. I’ll try to update soon, thanks for the tip!

I am wondering the same thing as Chris above though: why would these data points below the limit of detection on the extreme left end be so influential, their cumulative probabilty should be really high (as can be seen when looking at the corresponding log-likelihood values).

Best wishes!

1 Like

Can you share your simulation code? Now just eyeballing your first figure, there are censored observations at -.5, the predicted mean is for one of them about -2, the residual sd might be something like .25, and the posterior sd for the prediction about 0.1, then we get Pareto khat bigger than 0.5. You can test this with a simple simulation

posterior::pareto_khat(1/pnorm(q=-.5, mean=rnorm(n=10000,mean=-2,sd=0.1), sd=.25))

If we added posterior uncertainty in residual sd, too, we would get even bigger khat’s.

Which is equivalent to the observation being far in the tail.

The extreme tail probabilities (in this case very close to 1) have a nasty distribution with respect to posterior draws. This is now kind of opposite to outlier, the values close to x=-2 have high probability, but it is difficult to estimate that probability accurately with finite number of MCMC draws reflected in Pareto-khat diagnostic.

Adding to the previous.

This is a nice but rare example where Monte Carlo estimate for elpd_loo can be unstable even when the observation is non-influential and not an outlier. I’ll add discussion of this to CV-FAQ. Thanks for sharing the example.

1 Like

Hi, this is the code I used for the last figure:

import cmdstanpy
import arviz as az
import matplotlib.pyplot as plt
import scipy.stats as sts
import numpy as np

sm = cmdstanpy.CmdStanModel(stan_file="cens_reg.stan")

a = 0.4
b = 0.1
sigma = 0.1
N = 100

X = 2*sts.norm.rvs(size=N)
X.sort()
Y_unc = sts.norm.rvs(loc=X, scale=sigma)

## add an outlier

X = np.append(X, 0)
Y_unc = np.append(Y_unc, 2.0)

lb = -0.5

C = np.array([0 if y >= lb else 1 for y in Y_unc])
Y = np.array([y if c == 0 else lb for y, c in zip(Y_unc, C)])

sam = sm.sample(data={"X" : X, "Y" : Y, "N" : len(X), "C" : C})
azsam_unp = az.from_cmdstanpy(sam, log_likelihood="log_lik")
azsam_jit = az.from_cmdstanpy(sam, log_likelihood="log_lik_jitter")

loo_unp = az.loo(azsam_unp, pointwise=True)
loo_jit = az.loo(azsam_jit, pointwise=True)

print(loo_unp)
print(loo_jit)

fig, (ax, bx, cx) = plt.subplots(3, 1, figsize=(7,10))

ax.scatter(X, Y_unc, color='r', s=1, zorder=1)
ax.scatter(X, Y, color='k', s=2, zorder=2)
ax.set_xlabel("X")
ax.set_ylabel("Y")

xs = np.linspace(np.min(X), np.max(X), 100)
a_est = sam.stan_variable("a")
b_est = sam.stan_variable("b")
yss = np.array([[a*x + b for x in xs] for a, b in zip(a_est, b_est)])
lys, mys, uys = np.percentile(yss, axis=0, q=[2.5, 50, 97.5])
ax.plot(xs, mys)
ax.fill_between(xs, lys, uys, alpha=0.3)

az.plot_khat(loo_unp, ax=bx, show_hlines=True)
az.plot_khat(loo_jit, ax=cx, show_hlines=True)
# let bx and cx share y-limits

#bx.set(ylim=(-0.5, 2.5), title="Unjittered log lik")
#cx.set(ylim=(-0.5, 2.5), title="Jittered log lik")

ax.annotate("outlier", xy=(0, 2.0), xytext=(-1, 4),
            arrowprops=dict(arrowstyle="->", color='k'),
            fontsize=12, color='k')

fig.tight_layout()

and this is the stan model:

// cens_reg.stan
data {
    int N;
    array[N] real X;
    array[N] real Y;
    array[N] int C;
}

parameters {
    real a;
    real b;
    real<lower=0> sigma;
}

transformed parameters {
    array[N] real log_lik;
    for ( n in 1:N ) {
        if ( C[n] == 0 ) {
            log_lik[n] = normal_lpdf(Y[n] | a * X[n] + b, sigma);
        } else {
            log_lik[n] = normal_lcdf(Y[n] | a * X[n] + b, sigma);
        }
    }
}

model {
    target += log_lik;
}

generated quantities {
    array[N] real log_lik_jitter;
    for ( n in 1:N ) {
        log_lik_jitter[n] = log_lik[n] + normal_rng(0, 0.01);
    }
}
2 Likes

Thanks for sharing the code. I started writing the CV-FAQ addition, and then realized that you can trust the pointwise and total elpd estimates in this case, despite the high Pareto-khats. Due to looking at the extreme tail probabilities, the Monte Carlo variation causes the distribution of the ratios to have apparently thick tail, but the variance is actually finite and very small. Instead of adding the jitter, you can just ignore the high Pareto-k’s when the cumulative probabilities in censoring are close to 1. It would be difficult to recognize this case automatically in a fool-proof way, so at the moment I’ll just add this to CV-FAQ, but I’ll keep think about this.

1 Like