I have simulated breakdown stress data for some components.
I tested whether a Weibull or lognormal distribution best fit the data. Please see Figures below. The figures provide median and 95\% prediction intervals in black. The green lines are the non-parametric fit (observations).
It looks like the Weibull distribution fits poorly at the lower tail compared to the lognormal distribution. That is, the Weibull fit overestimates the probability of failing at stress lower than (approximately) 45.
I think, from a visual check alone, that the lognormal fit is best.
However, I am interested in the lower tail of the breakdown stress distribution, and I believe the lognormal distribution is too optimistic. For example, a 95\% prediction interval for the 0.0001\% quantile under the lognormal model is [26.0,28.2], compared to [10.4,13.7] under the Weibull model.
I have also read (no references at hand, sorry) that the lognormal distribution can provide estimates that are too optimistic (in this case predicting that the 0.0001\% quantile is much higher than it actually is).
Write the 95\% prediction interval for the 0.0001\% quantile as [\tau_l,\tau_u]. The biggest error in my decision problem would be a component failing at stress s^* < \tau_l. For this reason, I would like a conservative estimate of the 0.0001\% quantile.
Generally, the Weibull model is preferred over the lognormal model when small quantiles are of interest since the Weibull model provides a more conservative estimate. However, for my data, I think these estimates are too conservative (and wrong) because the model fits poorly in the region I am most interested in.
I then tested whether a generalized limited failure population (GLFP) model would work. This model is a mixture of Weibull distributions. My hope was that a mixture of Weibull distributions would fit better in the lower tail and would also provide more conservative estimates than the lognormal fit. The cdf of the GLFP distribution is written as
F(t; \pi,\alpha_1, \beta_1, \alpha_2, \beta_2) = 1-(1-\pi F_1(t;\alpha_1, \beta_1))(1-F_2(t;\alpha_2, \beta_2)),
where F_1(t;\alpha_1, \beta_1) and F_2(t;\alpha_2, \beta_2) are Weibull cdf’s, where \alpha_1, \alpha_2 are scale parameters and \beta_1, \beta_2 are shape parameters.
I had trouble estimating the parameters of this mixture distribution. I played around with this distribution and found a decent fit to the data when (\alpha_1,\beta_1,\alpha_2,\beta_2) = (42.5,100, 44.7,14.3).
I tried refitting the GLFP model in Stan with priors centred around the above values that I found worked well (this is probably not how one should construct prior distributions but I think the model parameters were having identifiability issues). I used the following prior distributions:
\alpha_1 \sim N(40,1) \\ \alpha_2 \sim N(44,1) \\ \beta_1 \sim N(100,1) \\ \beta_2 \sim N(14,1)
I have included the GLFP fit below. It fits much better than the Weibull fit in the lower tail and provides more conservative estimates than the lognormal fit. It looks a little strange to me, since there is a kink around 40, but other than that a decent fit all round. I only care about the lower tail anyway. A 95\% prediction interval for the 0.0001\% quantile under the GLFP model is [13.1,16.3].
My concerns are that:
(1) \alpha_1 \approx \alpha_2. If I fit the model with \alpha_1 = \alpha_2 and use \alpha \sim N(40,5) as the prior, the model runs fine (no divergent transitions, etc), but the fit is poor in the lower tail (the important part). Therefore, I need to estimate both parameters, which are similar in size but different.
(2) I tried using slightly more diffuse priors for \beta_1 and \beta_2, but this caused the fit to be poor. More specifically, I tried \beta_1 \sim N(100,2) and \beta_2 \sim N(14,2). The model ran with no divergent transitions, etc, but did not fit the data well.
(3) I tried using more diffuse priors on \alpha_1 and \alpha_2, but this caused the model to fit poorly (very poor). More specifically, I tried \alpha_1 \sim N(40,4) and \alpha_2 \sim N(44,4).
Please see final three figures below.
The above three points show that the mixture model is very sensitive. I had to use trial and error to find parameter values that looked reasonable. These values were used to construct very strict priors, and any deviation from these priors causes the model to fit poorly.
Moreover, I have many simulated datasets. The GLFP model that worked well (with the initial strict priors) often diverges (or fits poorly) for these alternative datasets (since the priors are so strict).
I don’t think I should construct priors for each dataset. I have too many datasets, but even if I only had a few I think this would be frowned upon.
I am also worried about the stability of the GLFP model.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
A summary:
I have some simulated data. I am interested in modelling the 0.0001\% quantile of this data. I tried a Weibull model and a lognormal model. The lognormal model fits the data much better in the lower tail, but I am worried the 0.0001\% quantile is too optimistic (I think it should be lower). I have read that the lognormal distribution can be too optimistic. I would prefer to use the Weibull distribution to model such a small quantile (as it is much more conservative), however, the Weibull fits poorly in the lower tail. I then tried a GLFP model (Weibull mixture). This model fits okay if I use very strict priors. I am worried about the stability of this model since any small change in any of the prior distributions causes the model to fit poorly.
(1) Does anyone have any thoughts about the mixture model I am using and what I can do about it?
(2) Is there an alternative to the mixture approach that would be more stable? I.e. perhaps telling Stan that I would like to sacrifice some model accuracy in the higher tails to gain more accuracy in the lower tails?
Note: I have read Mixture Model Documentation. I already use constraints to insist \alpha_2 \geq \alpha_1.





