I’m just working with my first Stan model and I have a question I have not been able to find an answer to. I want to know which of my datapoints have had the most influence/leverage on the posterior. Whenever I find some discussion that touches on the topic of identifying points with high leverage, the focus always seems to be on comparative module evaluation. But I am not interested in evaluating the out-of-sample predictive performance of my model, as I am not using it for making predictions. I just want to introspect the model to understand it.

I guess a naive strategy would to use full-blown LOO-CV to see which datapoints when omitted cause the mean of the posterior to change the most. But I was hoping for a less computationally expensive approach. I have tried using the loo function from Arviz, but the results are not what I hoped for. All of the points have a very low k value (< 0.05), and the ones that the highest values are unstable and change from Stan run to run. I’m not necessarily looking for true outliers that would make this an unreliable model, I just want to know what parts of the data have most leverage on the posterior.

My current tactic is to compute which of the observed datapoints are most improbable according to the full posterior. That is, I take the absolute value of subtracting the observed value for each datapoint from its mean predicted value. This procedure seems to produce sensible results, but I have no idea if it is mathematically justified. My intuitive reason for it is that the most improbable datapoints are the ones that the model will have tried the hardest to accommodate, even while inevitably failing to fit them well. Does that make sense?

Before I go ahead and use LOO-CV to re-fit my model 1,000 times, can anyone suggest a more efficient and mathematically justified procedure?

There are different influence measures, and for example leverage has a specific definition Leverage (statistics) - Wikipedia. If you want to compute leverage, then compute that as defined.

Even if you are not using it for making predictions, it can be helpful to know the predictive performance. What would you expect to learn from introspecting a model that is not predicting better than a random guess?

Pareto-k measures how much the whole posterior changes when one observation is left out, so you can use it even if you don’t care about the predictive performance. You can improve stability by running more chains or longer chains (or combine the results from the many runs you already did).

That is mean absolute error (MAE) of the point prediction, and is also a measure of predictive performance. It makes sense if that is the relevant part of the model you want to investigate.

You can use PSIS-LOO computed importance weights to investigate any derived quantities that depend on the posterior and data and how those change if one observation is removed. That way, you can focus to look at the influence of one observation to the specific quantity of interest.

Thank you so much for such a comprehensive and thoughtful answer! The reason it has taken me a month to reply is that I have been trying to fix my model. Your answer told me that I should be able to use PSIS-LOO for the purpose of investigating which datapoints have the most influence on the posterior, which prompted me to go back to my model to see why it was giving strange results.

I then discovered and fixed a couple of conceptual problems with my model. The results of PSIS-LOO now make sense. I am very grateful for your help in understanding my errors. I just have one problem left. The Pareto-k values are still quite unstable. When I try to further increase the number of samples, I run out of memory on my laptop, as I have over 10,000 datapoints.

The issue is that arviz.from_cmdstanpy tries to read into memory all of the log-likelihood values as computed by Stan for all samples and all datapoints. I am currently experimenting with the loo package for R, since the documentation says that it permits you to pass a function that will be called for each datapoint to return the log-likelihood for that observation. This means that I can avoid computing all of the values in Stan and compute them point by point in R. But I have never programmed in R before, so this is going a bit slowly.

Is there a better approach? If I do manage to increase the number of samples, is it certain the Pareto-k values will eventually converge to stable values? All of the values are below 0.5 and the vast majority are less than zero.

I would like to be able to report an ordered list of the 10 or 20 most influential data points, but the list keeps changing somewhat from run to run. Is there any hope that it will stabilize eventually?

Ah, now I understand why you asked. Consider what would happen if there are e.g. 40 exactly as influential data points, as then the list keeps changing no matter how many iterations. As you can’t avoid having some uncertainty. If the amount of influence is not exactly the same, you could estimate the uncertainty in the influence ranking and choose a ranking which has high probability.

That makes perfect sense – thanks! Is there any advantage in estimating the ranking of the most influential observations by using the Pareto-k values in the manner you describe, as compared with using a method which gives more stable rankings? I have tried three other methods:

Comparing importance weights, which I calculate as the harmonic mean of the likelihood of each observation across samples.

Mean absolute error: the absolute value of the difference between the mean predicted value across samples and the observed value for each point.

Mean squared error: the square of that difference.

All of these give broadly similar but slightly different rankings, and I am trying to come up with a principled reason to select one of them as the method for ranking the most important observations when presenting my results. Any suggestions on the advantages or disadvantages of these methods?

What do you mean by samples here? I’m asking because some people use it to refer to observations and some people use it to refer to posterior draws. In importance-sampling LOO, importance weights are computed for each observation and posterior draw pair, and the harmonic mean over the posterior draws corresponds to IS-LOO, but it would be better to use Pareto smoothed weights for improved stability. You can get the Pareto smoothed log likelihoods (elpds) for each observation from the output of loo() function.

The biggest error doesn’t necessarily have the biggest influence on the posterior, so none of these are traditionally used as influence measures. There is no single best influence measure, but I recommend to focus on the quantity of interest and examine how the inference on that changes and even better whether any conclusions or actions would change if one (or more) of the observations is (are) removed.

Here is the code I am using to rank the importance weights. I suspect it would be possible to extract these values more cleanly from Arviz, but I wasn’t sure how:

Would it correct to call this a version of IS-LOO?

Do you mean the pareto_k attribute of the ELPDData object returned by loo(idata, pointwise=True)? That’s the data I was referring to in my previous posts when discussing the instability of the Pareto-k values. Or do you mean something else?

That’s an excellent point. I know that there is not a single best measure, but this thread has been extremely helpful in my effort to understand the behaviour of the different possible measures of pointwise influence.

I think my next step will be to gather a set of 30-40 or so candidate points from the union of all of the sets of top-20 points proposed by the various measures of influence I have tried and then to run full-blown LOO-CV on those points. At that point, if I rank the points based on how much the posterior changes when they are left out, I can declare that to be my definitive ranking of influence. Does that seem reasonable? It will be a lot more straightforward to fit the model 30-40 times than 10,000 times!

To do that, I will need a metric to calculate divergence between the full posterior and the one-left-out posterior. Would KL divergence be a reasonable choice?

KL divergence is in an abstract sense the most canonical choice, but it’s very difficult to compute in practice. First of all, the most straightforward way involves integrating a function of both posteriors, which is difficult because both are generally non-analytic.

Stan and PSIS will give you samples from both posteriors; to compute an approximate KL divergence from those, you’ll have discretize the posteriors and compute a summation. If any part of one distribution goes to zero where the other is non-zero, you’ll end up with an infinite KL divergence. As the dimensionality of your posterior increases, whatever discretization you choose will become increasingly sparse and you’ll be basically guaranteed an infinite KL divergence.

My recommendation would be to choose a distance metric that’s meaningful for your discussion, then use a metric-dependent distance like

In addition to being computationally easier, I’d wager it’s also more direclty interpretable for any practical applications.

I would recommend using dimension names for the computation as much as possible. With the snippet above I am not 100% sure which dimension is being reduced (I think you end up with an array of shape (sample,) but I’d have to check), nor if this is what is intended.

You could do something like:

from xarray_einstats.stats import hmean
stacked = az.extract(idata, "log_likelihood")
likelihood = np.exp(stacked.log_lik)
weights = hmean(likelihood, dims="obs_dim") # or dims="sample"
index = np.argsort(weights)[0:20]

You can always use numpy functions on xarray DataArrays, if possible it will return a DataArray and keep all dims and coords (i.e. all pointwise functions like np.exp) but otherwise the returned object is a numpy array, and dims and coords are lost. Similar thing happens with scipy. xarray_einstats is a small library (maintained by the ArviZ project) that wraps some functions in numpy and scipy to make these calculations with xarray more natural and intuitive. ArviZ uses it for the circular mean and std for example (in summary where there is a circular option for instance) so you should have it installed already.

You should also be able to do:

from xarray_einstats.stats import hmean
log_lik = idata.log_likelihood["log_lik"]
likelihood = np.exp(log_lik)
weights = hmean(likelihood, dims=("chain", "draw"))
# equivalent to the sample option, otherwise you'd need to flatten here on weights on and it is not worth it
index = np.argsort(weights)[0:20]

If the influences are very similar, this doesn’t remove the variability completely. I feel you are using quite a lot of time and computing resources to try to make a ranking that with any threshold would leave outside of the threshold observations that have very similar influence as those within the threshold. I recommend embracing the uncertainty and report the rankings with uncertainty included. Then you don’t need to use more computing, and the reporting will be more clear that there is no clear ordering.

That’s all very true. I know it seems strange to be spending so much effort on this. But there is not much documentation on how to quantify influential data-points for the purpose of model introspection rather than for model evaluation and selection, and that is why this thread has been so helpful to me.

The reason I am going to so much trouble about this stuff is that, whenever I have presented my research to people without a background in statistics, they always ask me what are the most influential points, because in the domain where I am working, that is a very interesting matter. So I want to have some principled and justifiable answers to that question.

But I realize that an important part of such discussions will be to explain the arbitrariness and uncertainty of any answers I provide.

My domain is literary criticism, particularly of Latin poetry. One of the attractions of Bayesian modelling is that you can make use of all of the data, which is important when you have very little.

My observations are the word frequencies in a given ancient text. I create a hypothesis about the process which generated that text and fit a model to test it. After considering the implications of the posterior for that hypothesis, my next step is to examine which words contributed most to the shape of the posterior. The goal is to feed this quantitative information back into a qualitative discussion about the connotations of those influential words. I expect the most influential observations to be either common words (influential because of their frequency), or rare words (influential because of their distinctiveness), or a mixture of both.

Identifying and studying influential points in Bayesian models seems related to the topic of “explainable” machine learning, which is a big deal these days. So I was surprised not to find much literature on this question, beyond the identification of pathological outliers for the purposes of model evaluation. I would be very grateful for any suggestions for further investigation!

Cool! If you have an earlier paper or a new draft, I can have a look and comment. We have been thinking a bit on influential observations vs influential covariates