Fitting a mixture of log-logistic distributions to interval data

Hi.

I’m trying to fit a mixture of two log-logistic distributions to
a small dataset. The data are in terms of a minimum and maximum
possible interval for each record, so for each set of parameters
there is a probability of the actual time period falling within
that interval.

I have some R, stan code and data is included below.

Example outputs from a run:

Inference for Stan model: mixll.
10 chains, each with iter=50000; warmup=25000; thin=50;
post-warmup draws per chain=500, total post-warmup draws=5000.

                   mean se_mean    sd    2.5%   97.5% n_eff Rhat
dist_shape_o[1]    0.34    0.00  0.05    0.25    0.45  4410    1
dist_shape_o[2]    0.56    0.00  0.21    0.16    0.74  3318    1
dist_scale_o[1]    0.97    0.00  0.29    0.73    1.71  3711    1
dist_scale_o[2]    3.67    0.00  0.11    3.29    3.73  1759    1
dist_shape[1]      3.09    0.04  1.84    2.22    4.02  1914    1
dist_shape[2]      2.26    0.02  1.41    1.35    6.20  3713    1
dist_scale[1]     19.40    0.11  6.92   14.51   38.80  3808    1
dist_scale[2]    275.60    0.58 24.68  187.62  292.28  1797    1
theta              0.36    0.00  0.10    0.18    0.58  3105    1
lp__            -109.57    0.06  3.48 -116.52 -103.32  3220    1

I use a long warmup period and only keep every 50th sample, but
my effective number of samples seems relatively low to me.

There is a (possibly unnecessary) level of complexity in there
as I use the mean and variance of a set of parameters from
another published analysis as priors. In this previous analysis
the parameterisation of the log-logistic was different and the
data were stored as weeks, rather than days. I transform the
parameters to account for this. Not being completely confident
about the applicability of these priors, I also used the
t distribution rather than the Normal, giving more weight to
the tails.

Does anyone have any suggestions for improving my analysis,
for example better priors or some alternate parameterisation which
might reduce the correlations between the parameters.

Or is the behaviour purely a result of my inadequate dataset?

Sorry for the vague query, I wasn’t sure where to focus it.

Thanks,
Ron.

Code:
library(rstan)

stan_code <- '
/*
    Mixture of two log-logistic distributions fitted
    to the available interval data
 */
functions {
  real llogistic_lcdf( real x, real scale, real shape ){
   return -log1p_exp(lmultiply(-shape,x/scale));
  }
}
data {
  int<lower=1> N;  // number of data points
  real<lower=1> x_lower[N]; // interval observations: lower
  real x_upper[N]; //                        & upper
}
parameters {
  real<lower=0,upper=1> theta;
  /* Parameters on scale of previous analysis */
  vector<lower=0>[2] dist_scale_o;
  vector<lower=0>[2] dist_shape_o;
}
transformed parameters {
 vector[2] dist_scale;
 vector[2] dist_shape;
 vector[2] lp_parts[N];
 vector[2] log_theta;
 /* Convert from FAdist::pllog parameterisation with data in weeks
    to current parameterisation and data in days */
 dist_scale = exp(dist_scale_o)*7;
 dist_shape = 1.0 ./ dist_shape_o;
 log_theta[1] = log(theta);
 log_theta[2] = log1m(theta);
 for( i in 1:N ){
  for( j in 1:2 ){
   lp_parts[i,j] =  log_theta[j] +
     log_diff_exp( llogistic_lcdf( x_upper[i] | dist_scale[j], dist_shape[j] ),
                   llogistic_lcdf( x_lower[i] | dist_scale[j], dist_shape[j] ) );
  }
 }
}
model {
 theta ~ beta(3,3);
 /*  Mean and s.d. of results of old analyses as priors,
     student_t used to make it fat-tailed to make them less strong  */
 dist_scale_o[1] ~ student_t(1,0.78,0.0175); // df,mu,sigma
 dist_scale_o[2] ~ student_t(1,3.7,0.0065);
 dist_shape_o[1] ~ student_t(1,0.33,0.0157);
 dist_shape_o[2] ~ student_t(1,0.73,0.0026);
 for( i in 1:N) target += log_sum_exp(lp_parts[i]);
}

'

sm <- stan_model(model_code=stan_code,model_name='mixll')

n_warm_up <- 25000
n_samples <- 5000
n_thin    <- 50
n_chains  <- 10
n_samples_per_chain <- as.integer(n_samples/n_chains)

dl <- structure(list(N = 43L, x_lower = c(1, 1, 2, 1, 1, 1, 4, 6, 5,
14, 12, 16, 21, 24, 20, 7, 25, 37, 52, 57, 66, 59, 74, 102, 28,
91, 186, 197, 197, 202, 232, 243, 239, 245, 263, 273, 280, 301,
308, 312, 345, 482, 517), x_upper = c(134, 41, 170, 1121, 164,
18, 39, 48, 61, 28, 26, 34, 35, 32, 48, 511, 60, 289, 108, 141,
94, 171, 116, 158, 393, 231, 326, 365, 225, 286, 260, 271, 575,
329, 291, 393, 308, 315, 476, 340, 359, 762, 531)), .Names = c("N",
"x_lower", "x_upper"))

il <- list( theta=0.5, dist_scale_o=c(0.78,3.7), dist_shape_o=c(0.33,0.73) )
iln <- rep(list(il),n_chains)

smo <- sampling( sm, data=dl, init=iln, chains=n_chains,
                 pars=c("dist_shape_o","dist_scale_o",
                        "dist_shape","dist_scale","theta"),
                 warmup=n_warm_up, iter=n_warm_up+(n_samples_per_chain*n_thin),
                 thin=n_thin, cores=n_chains,
                 control = list(max_treedepth = 15) )

Ok. I’ve made an improvement to the Stan code (improved in terms of it’s speed, which I take as
an indicator that it is finding it’s way around the parameter space better, and higher effective numbers).

I declared dist_scale to be ordered[2] dist_scale;, forcing the median of the second distribution to always be ‘to the right’ of the first distribution. I also put a restriction on dist_shape_o to be between 0 and 1, vector<lower=0,upper=1>[2] dist_shape_o;, making dist_shape be greater than 1 - which gives the two distributions a single non-zero mode.

This seems better and I’ve combined it with a wide thinning interval. Seems to be reasonable.

Still open to suggestions for improvements.

Your lowest efficiency (n_eff/post_warmup_samples) is 35.1%. Hardly anything to brag about, but not bad at all. I’m curious to know why you think your n_effs are not satisfactory. If the posterior is highly correlated or displays common pathologies such as funnel-like behaviour, any sampler (NUTS and variants included) will struggle to produce independent samples. One way you can gain a tad more insight into your model is by looking at a pairs() plot, which also has the advantage of showing you where (if any) divergences have occurred.

Thanks for the comment.

My concern over the effective numbers was purely on a ‘feels a bit worrying’ basis, rather than anything scientific. Prompted by not being 100% confident in my modelling or choice of priors, and my uncertainty doing something new being confounded with having such a small data set to do it with.

I had looked at the pairs plot and densities, that was partly what prompted me to make the scale parameter ordered, which made an improvement. The improvements in speed made it reasonable to increase the thinning interval and, along with 10 parallel chains, generate a good sized run (10000 samples) in an hour or so, with effective numbers of over 9000 for all parameters. So I am pretty happy with it at present. Needs a closer look and some more plotting on Monday.

Thinning is a bad idea unless you’re out of memory—it just loses information.

You usually don’t need 9000 effective sample size. That’s going to give you MCMC standard errors of sd / sqrt(9000), or around sd / 100. For most purposes, you don’t need so many.

35% mixing rate is great for most models. Only very simple models mix better than that.

The way to validate is to simulate data from the model (first generate params from priors), then fit it. You should have approximate interval coverage for that as the model’s guaranteed to be well specified.

You might be able to reduce that log_diff_exp given how simple the logistic CDFs are. That’d go a long way for efficiency.

Thanks, Bob.

Those comments are really useful.

Despite having read the paper (can’t think of it now) that talks about the redundancy of thinning, for some reason I’ve clung on to it. A habit I really should break, as it would clearly make my life a lot easier.

I’ve used the effective number of samples (as a prop of the number generated) and to some degree how quickly the algorithm runs (and whether the time for different chains varies much) as vague indicators of potential problems with my parameterisation. I’ll stop worrying about the effective number rate so much, is the time (either total or for different chains) useful in this regard?

Thanks again.

A good parameterization has a high n_eff per iteration and a low treedepth per iteration. That means it’s moving well across the distribution without a lot of leapfrog steps.

Sometimes, you can get parameterizations that mix slowly, in which case you just need more iterations to get n_eff up to usable levels (we usually recommend at least a couple dozen minimum, but most people aren’t happy with fewer than 100).

Again, thanks for the valuable information.

Ron.