Demonstrating the 'absence' of an effect for a publication

loo

#1

Please excuse the long post - my question is in essence: what and how much would you report about an effect, that the loo package doesn’t provide evidence for in the first place? It should ideally be enough to convince a sceptic reviewer that holds the other view dear.

Details:
I am working to submit my first paper based purely on Bayesian statistics. The data is experimental, with 200 measurements for each of 23 participants. For this example, let’s say that I am testing whether displaying a word in red or blue letters will influence the accuracy of them recalling that word (bimodal outcome). Assume that both views (influence or not) are equally likely a priori.

Before, I would have created 2 models based on lm4 and unless the complex model had at least a 2 point lower AIC value, I would conclude that the missing factor had no effect on the outcome. This would also typically be an accepted approach in papers. Now I am more confused as to what is appropriate.

My first intuition is to use the loo-package and using the compare_models function between two models either containing the predictor or not. Then to divide “elpd_diff” by “se” and say that any number lower that 2 is insufficient evidence for the effect. The actual results are:

elpd_diff        se 
     -0.5       0.8 

Indicating the full model is -0.61 standard errors worse than the model without the predictor. This would personally be enough for me to conclude that the burden of showing that the predictor meaningfully influences the outcome, has shifted to those that still believe so. However, I can already hear the classic ‘Absence of evidence is not evidence of absence’ being hurled back at me. ‘-0.61 standard errors worse’ simply doesn’t sound as solid as ‘a lower AIC which can be interpreted as the better model’. It may be a good thing to be less absolute, but it may not help with sceptical reviewers. Even less so, if the model maybe was 1.5 standard errors better, but still below the somewhat arbitrary cutoff of > 2.

I have been considering other options such as Savage-dickey bayes factors, which in this case does show moderate ‘evidence for the null’ (Bf01 of ~ 5.5). However, this is highly influenced by the priors and I think that this may only be better evidence than the loo function in appearance.

Another option is ROPE - region of practical equivalence. Here I find that it is 95% of the posterior inside a region of 3.3 % around 0. Is this better evidence or does it just muddy the waters?

I would really appreciate your input on this one.


#2

No problem, I think it was just fine in the amount of details.

If they accept this, then they should accept loo comparison. 2 point difference is also arbitrary and it doesn’t take into the uncertainty.

Here 2 is also arbitrary, but at least you are approximately taking into uncertainty. “se” in loo is usually under estimated, and thus the uncertainty estimate is not perfectly calibrated, but it’s still better than ignoring the uncertainty completely.

Considering that AIC is not valid for Bayesian models (except in some asymptotic cases), you should not consider any AIC result as solid. For Bayesian models you should use Bayesian LOO or WAIC, and if you care about the accuracy, reliability and diagnostics then I recommend using Bayesian LOO as implemented in loo package.

Yes, BFs are usually very sensitive to priors.

ROPE is fine if you trust your model, ie you have made proper model checking AND the parameter you are interested is independent of the other parameters. In this case ROPE is more accurate than loo (smaller uncertainty) and in this case I would report both, ROPE and loo results. If there are dependencies in the posterior then the interpretation of ROPE is difficult. See example of this in https://github.com/avehtari/BDA_R_demos/blob/master/demos_rstan/colinear.md

I answered shortly, so don’t hesitate to ask more details,


#3

I’m bit slow, and missed the important detail here. I assume you would like to infer whether this would generalize for new participants. Aim I right? If so, then I would recommend using 23-fold leave-one-participant-out cross-validation. LOO (and WAIC and AIC) assume you would be interested predicting what would happen for these same 23 participants if new word is displayed.


#4

Thanks for your response, Aki.

First of all, yes I am importantly interested to try to infer on new participants (and ideally, the population of participants in general). Does this mean that I should abandon the loo package for this part of the project or can I use it like this?

kfold_m0 <- kfold(m0, K = 23, save_fits = F)
kfold_m1a <- kfold(m1a, K = 23, save_fits = F)
compare_models(kfold_m0,kfold_m1a)

As for whether my parameters are independent, your example made me question my approach. Say I have the following features randomized: colour of the word (1/2), position of the word on the screen (1:8).
Then I have the accuracy (0/1) and the rate of how certain they were in their answer (1:4).

My main question is whether colour influences accuracy and I was, therefore, planning to do a rope on colour alone. While I have some reason to think that the position has some influence, it is randomized and therefore it should hopefully not interfere with the overall accuracy of the 2 colours.

Secondly, I am interested in which model best would predict accuracy for future studies. I was planning to use loo for feature-selection and I saw that while a model with position did predict accuracy better than a model with a random intercept alone, a model with ratings of certainty did better and a model with both certainty and position did not do better (nor did interactions). However, I guess it is likely that certainty ‘eats’ some of the predictive power of position if participants are using the ratings based on the position of the word and that they, therefore, not are independent.

I will work through you linked example in details and see if I could learn from this as to how to compare the models, but does it seem reasonable to do the ROPE as mentioned?


#5

Yes, you should use k-fold-cv. The problem is that kfold() is not making the data division by group. This feature is so commonly useful, that it has been added to the todo list as a github issue, and there is a student who is interested in implementing that, but it can take weeks before it’s ready. Before that you need to make the data division yourself.

if you predict accuracy given colour, then rope is ok
if you predict accuracy given colour and position which are independently randomized and there is no interaction, then rope is ok
if you predict accuracy given colour and position which are independently randomized but there is interaction, then it’s more complicated

If you use certainty also as a covariate, I would assume that there is dependency, and you can check it by looking at the joint posterior. It is possible to do joint rope, but it’s not common.

I think the colinear example is most similar to what you might have. I have a plan to make also an example showing couple examples where rope is reasonable and better than cross-validation.

I hope my answers above made it clear that it may or may not be reasonable.


#6

Thank you for your response. However, I have never used k-fold-cv with a model from rnstanarm before. If I understand you correctly, the part of dividing the data should be straightforward, but I am confused as to how to compare it to the held out data.

I looked at your paper but it only helped me so far. My best current guess is something like this.

# Divide data into data and holdout
parts <- sort(unique(results$participant)) # Extract participants

# Create variables for capturing mean deviation
devm0 <- matrix(NA,length(parts),2)
devm1 <- matrix(NA,length(parts),2)

for(i in 1:length(parts)){
data <- subset(results, participant!=parts[i])
holdout <- subset(results, participant==parts[i])


m0data <- stan_glmer(Corr~+(1|participant),
                 data=data, family=binomial,
                 prior_intercept = normal(location = 0.7, scale = 0.5))

m1data <- update(m0data,.~.+condition,
              prior = cauchy(location =  0, scale = 0.4),
              prior_intercept = normal(location = 0.7, scale = 0.5))

### Compare prediction of two models to holdout, but how? (I am almost sure that this is not correct)
devm0[i,1] <- with(subset(holdout, condition == 1), mean(Corr)) - summary(m0data)[1]
devm0[i,2] <- with(subset(holdout, condition == 2), mean(Corr)) - summary(m0data)[1]
devm1[i,1] <- with(subset(holdout, condition == 1), mean(Corr)) - summary(m1data)[1]
devm1[i,2] <- with(subset(holdout, condition == 2), mean(Corr)) - (summary(m1data)[1]+summary(m1data)[2])
}

I believe the first part to be ok, but I could surely use some input on the second part. Again thank you for taking time to help me.


#7

I’ve so far used k-fold-cv only with rstan and stan code I wrote myself. I need to check what is the easiest way to do it in rstanarm before we have the updated kfold-function.


#8

I don’t think there are many relationship or differences that are ever precisely zero in the population. The issue is the practical/real-world impact – that is, how far from zero must the impact of word colour be to actually matter for recall? It strikes me that there is heavy substantive component to the argument.