# Pareto K for outlier detection

Hey Everyone,

I am fitting a mixed effects model (random intercept account & slope for test occasion) on correct problems of addition tests gathered 3 times remotely from children. Because they’re remotely gathered and from six year olds there is quite a bit of noise (e.g. T1 = 22, T2 = 27, T3 = 2).

I wanted to start very simple with a random intercept only model, with a fixed effect of time and wide priors (I’m aware my priors aren’t that sensible right now, e.g. time should have a positive effect and a smaller sd).

\mu_{i} = \alpha + \alpha_{SUBJ[i]} + \beta_{T}{Time}_{i}
\alpha_{SUBJ} \sim {Normal}(0, \sigma_{SUBJ})
\alpha \sim {Normal}(0, 10)
\beta_{T} \sim {Normal}(0, 10)
\sigma_{SUBJ} \sim {HalfCauchy}(0, 1)
\sigma \sim {HalfCauchy}(0, 1)

And following brms code:


time_randinci_250 <-
brm(data = addition_250, family = gaussian,
value.c ~ 1 + time.c + (1 | account_id),
prior = c(prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b),
prior(cauchy(0, 2), class = sd),
prior(cauchy(0, 2), class = sigma)),
sample_prior = "only",
iter = 5000, warmup = 2000, chains = 4, cores = 2,
seed = 13)


To speed it up I am only using 250 subjects.

All of my MCMC diagnostics look good, all of my model diagnostics seem okay. Yet when I try to do the waic I get the warning saying try loo instead, than when I use loo I get a warning of influential cases k > 0.7. I read somewhere it could be a sign of a misspecified model, I am using normal distribution and wide priors yet I think its a data issue.

This effects ~2% of the data, when I take subjects with observations with high Pareto K and plot the time course its clear that its noise. When I compared them to subjects with low K vals there’s usually one test that is clearly missed (which corresponds with the High K val).

My question is, is it kosher to exclude observations (make them missing) based on Pareto K values? If its not can I just delete the entire subject list-wise? How would this effect my model building, eventually I want to add random slope, covariates (e.g. grade etc…) and my treatment plan?

Last bonus question: Is it okay to compare two models with reloo = T where the observations excluded are different?

Thanks a million for your help,
Nick

1 Like

Can you post the full loo output? Seeing the full output, we can infer a bit more based on the total number of parameters and estimated effective number of parameters p_loo as discussed in http://mc-stan.org/loo/reference/loo-glossary.html

How many observations do you have per account_id?

Hey Aki,

Only 3 tests per account_id, I know thats not ideal but that’s all I got.
I attached the output from 250 subjects (750 observations) and 4690 subjects (14070 observations).

nick

The small amount of observations per group, make each observation influential which may lead to high khat values even if model is well specified.

With 250 subjects, p_loo is about 198 which is relatively high compared to the number of subjects, indicating that the estimated population prior is quite wide and there is not much borrowing of information from one subject to other. With 4690 subjects, the ratio of number of subjects (number of random effect parameters) and p_loo is similar and same conclusion can be made.

Assuming that by excluded you refer to cross-validation then, but if refer to your earlier question of excluding observations from the inference based khat values, I don’t understand the question.

You should not remove observations. You can use reloo to get more reliable loo estimate (and there is an open pull request for iterative moment matching loo, which is faster approach than the current reloo), but that doesn’t solve the problem that it is difficult to predict left out observations. You should think how you can improve your model.

It seems your model might also need improvement?

1 Like

You should not remove observations.

Even in the illogical cases Test1 = 30 correct problems, Test2 = 35, Test3 =1?

What would be the best prior to focus on to fix the wide population prior?

\sigma_{SUBJ} \sim {HalfCauchy}(0, 1)

I was making this prior (and others) very tight to little avail, before I wanted to remove the high K-value observations (which are the subjects with the most variance in tests).

Sorry, I don’t understand what you are trying to say here.

You have many groups, so you have lot of information on population distribution and it is unlikely that you can fix it by changing the prior. Currently your data is saying that subjects are quite different from each other. You did say that some observations are surprising, maybe you would need then more thick tailed distribution for observations or for the population distribution.

The first part ties into the second, there are some observations that aren’t just surprising they are impossible:

For example an account_id that scores 25 on the first observation, 27 on the second observation and scores 2 on the third.

Its impossible that the child’s true addition ability decreased by that much between the 2nd and 3rd measurement, it’s obviously noise from the test. Accounts like these are the ones that have a high K value in one measurement (e.g. 3rd in this case).

it is unlikely that you can fix it by changing the prior

Thanks that’s very good to know, I thought I was going crazy because my extremely tight priors weren’t doing much.

more thick tailed distribution

Any recommendations on super thick tailed distributions?

Thanks,
Nick

1 Like

Ok, now I get it. How are these scores obtained? Are these errors in typing the results? You don’t need Pareto k’s to detect these impossible observations. The challenge is that you would need to know more about the error process. Without knowing the error process, let’s assume that sometimes the score entry is missing one digit, ie, true score is 29, but data has 2. This would be easy error process as the error would never be upwards. If you would assume that these errors happen independently on the actual value and other values (corresponding to missing completely at random), you could drop these individual observations. If the error process depends on the actual value or other values, you would need to think more about the error process.

that you would need to know more about the error process
The error is entirely related to them being 6 years-old.

This would be easy error process as the error would never be upwards
I think MCAR is acceptable, and I’ll exclude observations and see if it helps the model.

Any distribution recommendations?

After removing illogical values, try examining calibration of predictive distributions, e.g. with ppc_loo_pit_qq in bayesplot, and whether the estimated random effects are approximately normal. From these you might get idea what you would need. E.g. Student’s t is supported by brms, but you should still first look what is happening with the current model.