To take this lower bound into account, I was thinking of using a zero-inflated modeling after shifting values. Ideally, my choice would be @saudiwin’s ordbetareg package (based on brms), which I’ve used before to model data with both a lower and upper bound. But I’m wondering to what extent it’s possible to apply ordered beta regression to data with a lower bound but no upper bound.
Does it make sense to use ordbetareg on such data after rescaling it to [0,1]?
In order to avoid modeling the values corresponding to the upper bound separately, I adjusted the rescaling to avoid including any ones in the rescaled data (normalizing by a value 10% higher than the actual maximum).
I get a pretty good fit, as illustrated by the posterior predictive plots below (obtained with the pp_check_ordbeta function in the ordbetareg package).
Do you think this is an appropriate way to model this data?
I have been also thinking of adapting the parameter corresponding to the upper bound in the definition of the Dirichlet prior, but this resulted in errors in model fitting. And after reading this thread I’m not sure it’s relevant to do so.
The other options I can think of would be:
Rewrite some of the code to keep the same general idea by only modeling degenerate lower bound values (zeros) and continuous values, but not the upper bound. However, my understanding of the package code is too limited to allow me to make this modification directly.
Use another distribution to model this data. As this duration data is discretized with a resolution of 10 milliseconds and can therefore be considered as a count of the number of 10 ms frames, I was thinking of a Poisson distribution. However, the first attempt I made with a Poisson distribution resulted in a fit not as good as with ordbetareg.
Thanks in advance for any other suggestions on how to model such data.
As @amynang wrote, there are other models that would seem to be more to the point. However, both Gamma and log-normal will require a zero-inflated (hurdle) component as neither accepts true 0s. But in principle, that would marry a discrete 0 with an unbounded positive response.
That being said, yes, you can use ordbetareg for modeling this type of outcome and a lot of it comes down to whether it makes sense to think of your treatment effect as a percentile effect. The primary advantage is that it enables scale-free comparison. For example, suppose you do multiple samples and the upper bound is somewhat arbitrary, but what you really want to see is what the percent difference was between treatment and control (not the amount on the response scale).
In that case, what you should do is pass a vector of c(0,upb) to the true_bounds option where upb is some value above which you don’t think durations are important or interesting (such as a time-out or some such upper limit). Then ordbetareg will scale the treatment effect against this upper bound, and that is a perfectly acceptable use of ordbetareg if percentile effects are useful for your research question.
I discuss this issue in more depth in this blog post about the econometrics “log of 0” controversy, which is a related topic:
Many thanks to both of you for your answers, which have helped me to see things more clearly, both for modeling this data and for other applications in my field.
Indeed my data do not include a 0 value since the minimum duration is by definition 0.03 seconds. In some similar cases, this minimum duration may represent a larger proportion of the whole.
Before considering ordbetareg, I had thought of using a lognormal (or skew normal) distribution, but this seems less appropriate to take into account the presence of this lower bound. I gave it a try with a gamma distribution that I hadn’t thought of before (thanks @amynang for the suggestion). This distribution fits my data rather well, but not as well as ordbetareg for the lower end of the scale, as can be seen on the pp_check plot zoomed in on the [0;0.2] range below.
I’m hesitating between a hurdle_gamma model on data shifted by 0.3 seconds and therefore relative to the minimum duration which would correspond to the value 0, which does indeed seem a more obvious choice, or keeping ordbetareg on data normalized in percentiles according to the approach proposed by @saudiwin. I think the latter approach is relevant to my research question, since I’m more interested in comparing the magnitude of the effects of different factors on duration than in values in seconds. The values in seconds will only be useful to give an order of magnitude of typical values for the different values of my predictors, but I suppose I can apply an inverse transformation to express the predicted values in the original scale.
In my case it doesn’t make much sense to define a maximum duration in milliseconds for normalizing the data as percentiles, but I can define an upper bound depending on the distribution of the data.
The other reason why I’m more inclined to use ordbetareg is that, from the few tests I’ve made so far, this model seems to me both more computationally efficient and to allow a more direct interpretation of the results than the hurdle models (but as I have very little experience of using these models I may be missing something obvious).
There is also “shifted_lognormal()” which might help with the mismatch. I am not sure that the ordbeta fit is obviously better.
I do not get the distinction you are making later on. If you’re interested in multiplicative effects, why not just log-transform your response and call it a day?
You’re right, in this particular case it’s probably a better choice, thanks. I was fixated on the idea of modeling untransformed data and hadn’t thought of such a simple option.
As long as duration can’t go to 0, I would probably model this as log-normal given that this model is a) relatively easy to interpret and b) very easy to fit.
Gamma is some extra hassle on both those dimensions.