How to fit a distribution given an empirical CDF?

qdread · January 19, 2024, 3:03pm

I have data on the distribution of particle sizes. There are several treatments (in the example data I just provide one treatment, dose, with two levels for simplicity). Within each treatment, there are three replicates rep. Within each replicate, I have the cumulative proportion cumul_p of particles, by volume, at a range of particle diameters bin_max. For instance in the data provided, for rep 1 in the high dose treatment, 0.0009 of the total volume was composed of particles with diameter 37 or less, 0.0021 of the total volume was composed of particles with diameter 43 or less, and so on. This continues until we reach 100% of the volume.

So basically, what I have is an empirical CDF for each rep. I would like to fit some kind of distribution to this data, probably a gamma distribution though I am not sure, and have fixed effect of treatment on the distributional parameters so that we can make inference about how treatment affects them, as well as having a multilevel model that groups together the observations from the same rep nested within treatment. I am not sure how to go about doing that. Is such a thing possible in brms? If not I would also be happy to fit the model in Stan.

Example data (a small subset of the full dataset I have) is provided as an R dump here, as well as a plot of the empirical CDF for each treatment and replicate.

Plot of example data

Note there is very little variation among replicates (and at least in this example dataset not much difference between the treatments). The first plot has a log scale on the x axis (diameter) and the second plot has an arithmetic x axis just in case it is helpful to visualize both ways.

Data

example_data <- structure(list(dose = c("high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"high", "high", "high", "high", "high", "high", "high", "high", 
"low", "low", "low", "low", "low", "low", "low", "low", "low", 
"low", "low", "low", "low", "low", "low", "low", "low", "low", 
"low", "low", "low", "low", "low", "low", "low", "low", "low", 
"low", "low", "low", "low", "low", "low", "low", "low", "low", 
"low", "low", "low", "low", "low", "low", "low", "low", "low", 
"low", "low", "low", "low", "low", "low", "low", "low", "low", 
"low", "low", "low", "low", "low", "low", "low", "low", "low", 
"low", "low", "low", "low", "low", "low", "low", "low", "low", 
"low", "low", "low", "low", "low", "low", "low", "low", "low", 
"low", "low", "low", "low", "low", "low", "low", "low", "low", 
"low", "low", "low"), rep = c("Rep_01", "Rep_01", "Rep_01", "Rep_01", 
"Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", 
"Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", 
"Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", 
"Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_02", 
"Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", 
"Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", 
"Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", 
"Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", 
"Rep_02", "Rep_02", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", 
"Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", 
"Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", 
"Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", 
"Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_01", "Rep_01", 
"Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", 
"Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", 
"Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", 
"Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", "Rep_01", 
"Rep_01", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", 
"Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", 
"Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", 
"Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_02", 
"Rep_02", "Rep_02", "Rep_02", "Rep_02", "Rep_03", "Rep_03", "Rep_03", 
"Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", 
"Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", 
"Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", 
"Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03", "Rep_03"
), bin_max = c(9, 11, 13, 15, 18, 22, 26, 31, 37, 43, 50, 60, 
75, 90, 105, 120, 150, 180, 210, 250, 300, 360, 430, 510, 610, 
730, 870, 1030, 1230, 1470, 1750, 9, 11, 13, 15, 18, 22, 26, 
31, 37, 43, 50, 60, 75, 90, 105, 120, 150, 180, 210, 250, 300, 
360, 430, 510, 610, 730, 870, 1030, 1230, 1470, 1750, 9, 11, 
13, 15, 18, 22, 26, 31, 37, 43, 50, 60, 75, 90, 105, 120, 150, 
180, 210, 250, 300, 360, 430, 510, 610, 730, 870, 1030, 1230, 
1470, 1750, 9, 11, 13, 15, 18, 22, 26, 31, 37, 43, 50, 60, 75, 
90, 105, 120, 150, 180, 210, 250, 300, 360, 430, 510, 610, 730, 
870, 1030, 1230, 1470, 1750, 9, 11, 13, 15, 18, 22, 26, 31, 37, 
43, 50, 60, 75, 90, 105, 120, 150, 180, 210, 250, 300, 360, 430, 
510, 610, 730, 870, 1030, 1230, 1470, 1750, 9, 11, 13, 15, 18, 
22, 26, 31, 37, 43, 50, 60, 75, 90, 105, 120, 150, 180, 210, 
250, 300, 360, 430, 510, 610, 730, 870, 1030, 1230, 1470, 1750
), cumul_p = c(0, 0, 0, 0, 0, 0, 0, 0, 9e-04, 0.0021, 0.0036, 
0.0063, 0.0113, 0.0176, 0.0251, 0.0371, 0.0552, 0.0806, 0.1091, 
0.1513, 0.2114, 0.2954, 0.406, 0.5378, 0.6897, 0.8233, 0.9111, 
0.9578, 0.9835, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0.001, 0.0022, 
0.0038, 0.0066, 0.0118, 0.0183, 0.026, 0.0383, 0.0569, 0.083, 
0.1124, 0.1559, 0.2176, 0.3035, 0.4165, 0.5516, 0.709, 0.8488, 
0.9369, 0.9786, 0.9947, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 9e-04, 
0.0021, 0.0036, 0.0063, 0.0112, 0.0174, 0.025, 0.0371, 0.0554, 
0.0811, 0.1098, 0.1523, 0.2126, 0.297, 0.4087, 0.5414, 0.6937, 
0.8267, 0.9133, 0.9582, 0.9824, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 
9e-04, 0.002, 0.0034, 0.0059, 0.0105, 0.0163, 0.0233, 0.0348, 
0.0526, 0.0787, 0.1097, 0.1585, 0.2329, 0.34, 0.4779, 0.6343, 
0.7964, 0.9156, 0.9779, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0.001, 
0.0021, 0.0036, 0.00609999999999999, 0.0108, 0.0166, 0.0238, 
0.0355, 0.0537, 0.0804, 0.1118, 0.1609, 0.2355, 0.3427, 0.4804, 
0.636, 0.7966, 0.9156, 0.9778, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 
0, 7e-04, 0.0018, 0.003, 0.0045, 0.00709999999999999, 0.0119, 
0.0178, 0.0251, 0.037, 0.0553, 0.0818, 0.1131, 0.1628, 0.2386, 
0.3474, 0.4872, 0.6453, 0.8076, 0.9248, 0.9814, 1, 1, 1, 1)), row.names = c(NA, 
-186L), class = "data.frame")

Corey.Plate · January 19, 2024, 3:35pm

If I’m understanding correctly, you want to have a CDF with some shape/rate parameters where the shape/rate parameters are determined by the data in each treatment group. In this case you would define a CDF based on a prior over each treatment’s shape/rate parameters, giving a vector for each parameter with a number of elements equal to the number of treatments, n. You would then have the priors from each of these treatments parameterizd by a hyperprior. So something like:

data {

 int n;
 array[n] y;

}

parameters {

 //shape parameters
 vector[n] a;
 vector[n] b;

 //hyperparameters
 real mu_a;
 real sigma_a;
 real mu_b;
 real sigma_b;

}

model {

 //hyperpriors
 mu_a ~ normal(0,1);
 sigma_a ~ lognormal(0,1);

 mu_b ~ normal(0,1);
 sigma_b ~ lognormal(0,1);

 //priors
 a ~ gamma(mu_a,sigma_a);
 b ~ gamma(mu_b,sigma_b);

 //likelihood
 y ~ log_likelihood(a,b);

}

Here y is your empirical data, and a and b are your parameters of interest. If I understood what you wanted to do correctly, this is the skeleton of a Bayesian hierarchical model, where log_likelihood() is the log of the likelihood of the data given that the distribution assumed to describe the data. In reality, this log_likelihood() can be any likelihood function that depends on the data and parameters of interest in the model. The sampling statements, like a ~ gamma(mu_a,sigma_a) take care of the priors.

qdread · January 19, 2024, 3:51pm

Thanks for the answer, but I am unsure whether this addresses my issue. I do not have a collection of data points for each treatment group that I want to model as drawn from a distribution. All I have are cumulative proportion values for a range of x values (diameters). That is what is output by the instrument that generated the data. Isn’t the Stan code you provided more suitable for the case where we do have the raw data from each treatment group?

Corey.Plate · January 19, 2024, 4:12pm

If you have some distribution that you assume describes the data, as read out by your instruments, you can fit the data with that distribution. The distribution will have some parameters that determine its shape, samples of which hopefully converge around values that are reflective of the shape of the underlying data distribution and prior assumptions made about it.

These parameters can then be compared across conditions, assuming you have some model where they serve as the latent parameters of the process that generated the data, with some interpretable meaning supplied by the model’s scientific justification. The code itself is pretty general for BHM, barring some arbitrary distribution choices for demonstration purposes on my part. There are no assumptions baked in about the form the data takes, however.

harrelfe · January 19, 2024, 5:50pm

Fitting a distribution to the ECDF will inherit the uncertainty of the ECDF so will not improve precision. You might as well stick with the ECDF.

jsocolar · January 20, 2024, 3:24am

@harrelfe if there is some theoretical expectation for the parametric form of the distribution (or even just an expectation that the distribution is smooth), then that expectation can be leveraged to refine and smooth the (first derivative of the) ECDF. For example, if the true distribution is known to be gamma, you can find the gamma distribution from which sampling is most likely to yield the histogram associated with the observed ECDF.

@qdread, you can parameterize a gamma distribution, then in transformed parameters find the probability masses associated with the bins of you ecdf, and then treat each particle as a categorical sample from these probabilities.

harrelfe · January 20, 2024, 12:12pm

But under the hood if you are fitting to the ECDF you get all the imprecision of the ECDF. And if you fit a multi parameter distribution function, the precision of that will approximately equal the precision of the ECDF if there are 4 or more parameters for the fitted distribution (sometimes 3). All of this is why I almost always use semiparametric models the encapsulate the (link function-transformed) ECDF as their intercepts.

Topic		Replies	Views
Fitting Weibull CDF to data in rstan Modeling	4	69	August 27, 2024
New to Stan: Having trouble fitting basic treatment response data with brm Modeling	1	301	July 6, 2021
Help in identifying strategies for fitting simulated datasets General mixed-model , hierarchical-model	5	819	May 4, 2021
Using loo for brms non-linear model incorporating Stan functions brms	1	369	March 1, 2024
Fitting a mixture of log-logistic distributions to interval data Modeling	7	1299	April 3, 2018

How to fit a distribution given an empirical CDF?

Plot of example data

Data

Related topics