Improving pp_check results Poisson model

Hi,

I am trying to build a model to analyze count data (errors in sentences produced by language learners; sentences have different lengths). Unfortunately, I am having problems modeling the brms model correctly.

The resulting pp_check plot is the following:
Rplot01.pdf (1.1 MB)

What am I doing wrong? What can be done to improve it?
Perhaps I should specify that I am new to this method of analysis.

prior_mc_poi <- c(
  set_prior("normal(0, 1)", class = "b"), 
  set_prior("student_t(2, 0, 1.5)", class = "sd", group = "user_id"), 
  set_prior("student_t(2, 0, 1.5)", class = "sd", group = "exercise"),  
  set_prior("normal(0, 100)", class = "Intercept"),  #  "student_t(2, 0, 2.5)"
  set_prior("lkj(2)", class = "cor") 
)

model_mc_srt_poisson <- brm(count_mc ~ phase + family_status + offset(log(sent_length)) + (1 + phase|user_id) + (1 + phase|exercise), 
                    data = data_srt, 
                    family = poisson(), 
                    prior = prior_mc_poi,
                    chains = 4, 
                    iter = 5000, 
                    warmup = 1000, 
                    cores = 2, 
                    control = list(adapt_delta = 0.99),
                    seed = 123)

Thanks for any advice!

I’ve played around a bit with dummy data, trying to approximate what I see in your pp_check figure. My dummy data don’t have the bumps at discrete numbers, but I do get y_rep distributions that are wider than the data. That is not the case if I drop the offset.

Take this with a heap of salt; my impression is that having a discrete offset is part of the problem. (I’m assuming your length is number of words in a sentence). When I change the offset to larger numbers, the discrepancy diminishes.

I have very little to back this idea, but I wonder what happens if you use number of letters in a sentence as an offset.

You could also try relaxing the relationship of errors to sentence length by adding length as a regular predictor.

set.seed(33)
d = data.frame(mistakes = rpois(100, 8),
               sentence = NA,
               status = rep(letters[1:2], each = 50))

d$sentence = rpois(100, d$mistakes*2) # things look better with *50

library(brms)

model <- brm(mistakes ~ status + offset(log(sentence)), 
             data = d, 
             family = poisson(), 
             chains = 4, 
             iter = 5000, 
             warmup = 2500, 
             cores = 4, 
             control = list(adapt_delta = 0.9),
             backend = "cmdstanr",
             seed = 123)
summary(model)
pp_check(model, ndraws = 100)
plot(model)

Hi Angelos,
Thanks for your reply. I have to specify that the length of the sentences is not discrete (number of words) but in seconds (I am dealing with sign language data). I created a model using hurdle_poisson(), but the problem is still there. There are no zeros in the data, which means that every sentence contains errors.

I created a model where I log the error count and divide it by the length of the sentences, using a gaussian distribution. This is the only way to get an almost correct pp_check (see picture). However, I am not sure if this is a good way to handle the data.

As you suggested, I tried a model without the offset (just the log of the sentence length as predictor; with hurdle_poisson distribution), and the pp_check result is like this second image. However, I suspect that this method can lead to unfair results, since the number of errors is not “normalized” by the sentence length. Is my guess wrong?

Thank you again!

I see, I was way off. I think the only thing from what I said earlier, that I stand by, is relaxing the relationship with duration. I can try to explain my thinking using written language as an example, you be the judge to what extent any of this transfers to sign language.

I am thinking that more than sentence length, the potential for mistakes in a sentence depends on things like the difficulty of the grammar or the complexity of the subject. Present tense sentences easier than past tense, hypotheticals (“We could’ve gone swimming if it hadn’t rained”) even harder. Then talking about your weekend is easier than talking about the economy.

Conditional on these other sources of difficulty, perhaps the number of mistakes is proportional to sentence length. But if your model does not account for those, the proportionality assumption that is implied by the offset (or by dividing the response with time) is off. The number of mistakes would be influenced by length but not be strictly proportional to it. That is why I think that in the absence of any terms that account for other sources of difficulty, adding duration as a predictor makes more sense than using it as an offset.

How does a simple Poisson model (i.e. without the hurdle) and duration as a predictor compare to the above?

EDIT: Sorry you probably already know this; are we on the same page that the last pp_check plot you showed looks pretty good?

Your explanation is very useful, thank you!
I confirm that the last pp_check in the previous post looks good, but I was not sure about justifying the difference in sentence length without taking into account the proportionality assumption of the offset.

This plot is the result of the “simple” Poisson model with duration as a predictor. It does not look very nice…
plot_noOff_noLog_poisson
This is the loo comparison between the two models:

                             elpd_diff se_diff
model_nmc_srt_noOff_hurdle    0.0       0.0 
model_nmc_srt_noOff_poi    -167.3      16.5 

However, in my data, I have two types of error counts for each sentence because errors can be manual (let’s say error count A) or non-manual (error count B). It makes sense to analyze the types of errors together, but it also makes sense to analyze them separately, because the hypothesis is that at the beginning (phase 1) there are more A errors and fewer B errors since learners are not able to use elements of the B type (this is a very simplistic explanation!).

When I repeat the experiment with count A, even the solution without offset but with log duration (the best solution for count B) leads to a terrible result.
plot_mc_hupoi_log

It seems that for count A the model simulated “normally” distributed data, so I tried a model with Gaussian distribution and log(duration)…and this is the result:
plot_mc_log_gaussian

How wrong is it to analyze count data in this way?

Thanks in advance for any thoughts and suggestions!

I am not the right person to answer this. I would personally avoid it but I have to agree that it looks like an improvement.

If you are interested in differences of the type of error by phase, would it not make sense to fit something like

count ~ phase*type + log(duration) + (phase*type|user_id)

In a written sentence, it is probably not possible to define the number of all potential errors one could make (too many possibilities, which makes Poisson a natural choice). Is this also true for your exercises? It sounds from your description like there are discrete elements to get right or wrong. Could you treat this as a successes-over-trials response (i.e. Binomial)?