How to fit a distribution given an empirical CDF?

I have data on the distribution of particle sizes. There are several treatments (in the example data I just provide one treatment, dose, with two levels for simplicity). Within each treatment, there are three replicates rep. Within each replicate, I have the cumulative proportion cumul_p of particles, by volume, at a range of particle diameters bin_max. For instance in the data provided, for rep 1 in the high dose treatment, 0.0009 of the total volume was composed of particles with diameter 37 or less, 0.0021 of the total volume was composed of particles with diameter 43 or less, and so on. This continues until we reach 100% of the volume.

So basically, what I have is an empirical CDF for each rep. I would like to fit some kind of distribution to this data, probably a gamma distribution though I am not sure, and have fixed effect of treatment on the distributional parameters so that we can make inference about how treatment affects them, as well as having a multilevel model that groups together the observations from the same rep nested within treatment. I am not sure how to go about doing that. Is such a thing possible in brms? If not I would also be happy to fit the model in Stan.

Example data (a small subset of the full dataset I have) is provided as an R dump here, as well as a plot of the empirical CDF for each treatment and replicate.

Plot of example data

Note there is very little variation among replicates (and at least in this example dataset not much difference between the treatments). The first plot has a log scale on the x axis (diameter) and the second plot has an arithmetic x axis just in case it is helpful to visualize both ways.

image

image

Data

example_data <- structure(list(dose = c("high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"low", "low", "low", "low", "low", "low", "low", "low", "low", 
"low", "low", "low", "low", "low", "low", "low", "low", "low", 
"low", "low", "low", "low", "low", "low", "low", "low", "low", 
"low", "low", "low", "low", "low", "low", "low", "low", "low", 
"low", "low", "low", "low", "low", "low", "low", "low", "low", 
"low", "low", "low", "low", "low", "low", "low", "low", "low", 
"low", "low", "low", "low", "low", "low", "low", "low", "low", 
"low", "low", "low", "low", "low", "low", "low", "low", "low", 
"low", "low", "low", "low", "low", "low", "low", "low", "low", 
"low", "low", "low", "low", "low", "low", "low", "low", "low", 
"low", "low", "low"), rep = c("Rep_01", "Rep_01", "Rep_01", "Rep_01", 
"Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", 
"Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", 
"Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", 
"Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_02", 
"Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", 
"Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", 
"Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", 
"Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", 
"Rep_02", "Rep_02", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", 
"Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", 
"Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", 
"Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", 
"Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_01", "Rep_01", 
"Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", 
"Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", 
"Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", 
"Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", 
"Rep_01", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", 
"Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", 
"Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", 
"Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", 
"Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_03", "Rep_03", "Rep_03", 
"Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", 
"Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", 
"Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", 
"Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03"
), bin_max = c(9, 11, 13, 15, 18, 22, 26, 31, 37, 43, 50, 60, 
75, 90, 105, 120, 150, 180, 210, 250, 300, 360, 430, 510, 610, 
730, 870, 1030, 1230, 1470, 1750, 9, 11, 13, 15, 18, 22, 26, 
31, 37, 43, 50, 60, 75, 90, 105, 120, 150, 180, 210, 250, 300, 
360, 430, 510, 610, 730, 870, 1030, 1230, 1470, 1750, 9, 11, 
13, 15, 18, 22, 26, 31, 37, 43, 50, 60, 75, 90, 105, 120, 150, 
180, 210, 250, 300, 360, 430, 510, 610, 730, 870, 1030, 1230, 
1470, 1750, 9, 11, 13, 15, 18, 22, 26, 31, 37, 43, 50, 60, 75, 
90, 105, 120, 150, 180, 210, 250, 300, 360, 430, 510, 610, 730, 
870, 1030, 1230, 1470, 1750, 9, 11, 13, 15, 18, 22, 26, 31, 37, 
43, 50, 60, 75, 90, 105, 120, 150, 180, 210, 250, 300, 360, 430, 
510, 610, 730, 870, 1030, 1230, 1470, 1750, 9, 11, 13, 15, 18, 
22, 26, 31, 37, 43, 50, 60, 75, 90, 105, 120, 150, 180, 210, 
250, 300, 360, 430, 510, 610, 730, 870, 1030, 1230, 1470, 1750
), cumul_p = c(0, 0, 0, 0, 0, 0, 0, 0, 9e-04, 0.0021, 0.0036, 
0.0063, 0.0113, 0.0176, 0.0251, 0.0371, 0.0552, 0.0806, 0.1091, 
0.1513, 0.2114, 0.2954, 0.406, 0.5378, 0.6897, 0.8233, 0.9111, 
0.9578, 0.9835, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0.001, 0.0022, 
0.0038, 0.0066, 0.0118, 0.0183, 0.026, 0.0383, 0.0569, 0.083, 
0.1124, 0.1559, 0.2176, 0.3035, 0.4165, 0.5516, 0.709, 0.8488, 
0.9369, 0.9786, 0.9947, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 9e-04, 
0.0021, 0.0036, 0.0063, 0.0112, 0.0174, 0.025, 0.0371, 0.0554, 
0.0811, 0.1098, 0.1523, 0.2126, 0.297, 0.4087, 0.5414, 0.6937, 
0.8267, 0.9133, 0.9582, 0.9824, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 
9e-04, 0.002, 0.0034, 0.0059, 0.0105, 0.0163, 0.0233, 0.0348, 
0.0526, 0.0787, 0.1097, 0.1585, 0.2329, 0.34, 0.4779, 0.6343, 
0.7964, 0.9156, 0.9779, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0.001, 
0.0021, 0.0036, 0.00609999999999999, 0.0108, 0.0166, 0.0238, 
0.0355, 0.0537, 0.0804, 0.1118, 0.1609, 0.2355, 0.3427, 0.4804, 
0.636, 0.7966, 0.9156, 0.9778, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 
0, 7e-04, 0.0018, 0.003, 0.0045, 0.00709999999999999, 0.0119, 
0.0178, 0.0251, 0.037, 0.0553, 0.0818, 0.1131, 0.1628, 0.2386, 
0.3474, 0.4872, 0.6453, 0.8076, 0.9248, 0.9814, 1, 1, 1, 1)), row.names = c(NA, 
-186L), class = "data.frame")

If I’m understanding correctly, you want to have a CDF with some shape/rate parameters where the shape/rate parameters are determined by the data in each treatment group. In this case you would define a CDF based on a prior over each treatment’s shape/rate parameters, giving a vector for each parameter with a number of elements equal to the number of treatments, n. You would then have the priors from each of these treatments parameterizd by a hyperprior. So something like:

data {

 int n;
 array[n] y;

}

parameters {

 //shape parameters
 vector[n] a;
 vector[n] b;

 //hyperparameters
 real mu_a;
 real sigma_a;
 real mu_b;
 real sigma_b;

}

model {

 //hyperpriors
 mu_a ~ normal(0,1);
 sigma_a ~ lognormal(0,1);

 mu_b ~ normal(0,1);
 sigma_b ~ lognormal(0,1);

 //priors
 a ~ gamma(mu_a,sigma_a);
 b ~ gamma(mu_b,sigma_b);

 //likelihood
 y ~ log_likelihood(a,b);

}

Here y is your empirical data, and a and b are your parameters of interest. If I understood what you wanted to do correctly, this is the skeleton of a Bayesian hierarchical model, where log_likelihood() is the log of the likelihood of the data given that the distribution assumed to describe the data. In reality, this log_likelihood() can be any likelihood function that depends on the data and parameters of interest in the model. The sampling statements, like a ~ gamma(mu_a,sigma_a) take care of the priors.

Thanks for the answer, but I am unsure whether this addresses my issue. I do not have a collection of data points for each treatment group that I want to model as drawn from a distribution. All I have are cumulative proportion values for a range of x values (diameters). That is what is output by the instrument that generated the data. Isn’t the Stan code you provided more suitable for the case where we do have the raw data from each treatment group?

If you have some distribution that you assume describes the data, as read out by your instruments, you can fit the data with that distribution. The distribution will have some parameters that determine its shape, samples of which hopefully converge around values that are reflective of the shape of the underlying data distribution and prior assumptions made about it.

These parameters can then be compared across conditions, assuming you have some model where they serve as the latent parameters of the process that generated the data, with some interpretable meaning supplied by the model’s scientific justification. The code itself is pretty general for BHM, barring some arbitrary distribution choices for demonstration purposes on my part. There are no assumptions baked in about the form the data takes, however.

Fitting a distribution to the ECDF will inherit the uncertainty of the ECDF so will not improve precision. You might as well stick with the ECDF.

2 Likes

@harrelfe if there is some theoretical expectation for the parametric form of the distribution (or even just an expectation that the distribution is smooth), then that expectation can be leveraged to refine and smooth the (first derivative of the) ECDF. For example, if the true distribution is known to be gamma, you can find the gamma distribution from which sampling is most likely to yield the histogram associated with the observed ECDF.

@qdread, you can parameterize a gamma distribution, then in transformed parameters find the probability masses associated with the bins of you ecdf, and then treat each particle as a categorical sample from these probabilities.

But under the hood if you are fitting to the ECDF you get all the imprecision of the ECDF. And if you fit a multi parameter distribution function, the precision of that will approximately equal the precision of the ECDF if there are 4 or more parameters for the fitted distribution (sometimes 3). All of this is why I almost always use semiparametric models the encapsulate the (link function-transformed) ECDF as their intercepts.