Hello,
I was approaching the problem of inferring regression for a simplex.
After a week of work I came out with a package that works possibly better than public R pachages such as DirichletReg.
Here what it does better (fully explained on Github):
https://github.com/stemangiola/dirReg
dirReg
This is an R package that infers trend of change in proportions with the condition that they have to sum to 1. The Challenging part is represented by this example:
The problem
There are 4 species of birds, and we count their proportions sampling a picture of the sky.
The totality of the birds at time 0 is:
black brids: 350
red birds: 350
blue birds: 200
green birds: 100
At time 1, the black bird population has increased a lot while all others did not change at all:
black brids: 2850
red birds: 350
blue birds: 200
green birds: 100
Because under some conditions we cannot know the absolute number of total birds, we necessarily have to think in terms of proportions. For example we would get this picture.
We can have the (statistical) impression see that all four specie populations changed in size, and that all changes can be significant; while the reality is that three population are not decreasing independently, but just as result of the normalization to 1.
The following probabilistic model tries to address this challenge. Understanding what is an independent change and what is not.
R code example
if (!require(devtools)) {
install.packages("devtools")
library(devtools)
}
options(buildtools.check = function(action) TRUE)
install_github("stemangiola/dirReg", args = "--preclean", build_vignettes = FALSE)
set.seed(123)
library("dirReg")
n_samples = 300
my_design = model.matrix(~cov , data = data.frame(cov=runif(n_samples)))
obj = simulate_proportions(
30 , ## noise of the data
c(0.35, 0.35, 0.2, 0.1 ) , ## proportions at the intercept
c(2, 0, 0, 0), ## rate of change (positive just for population 1)
my_design, ## design matrix
n_samples = n_samples ## number of samples
)
fit = dirReg(obj$prop, my_design)
------------------------------------------------------------------
Beta-Coefficients for variable no. 1:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.36 0 -29.94 0 ***
cov 0.53 0 -29.12 0 ***
------------------------------------------------------------------
Beta-Coefficients for variable no. 2:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.34 0 -26.05 0 ***
cov 0.16 0 -0.6 0.55
------------------------------------------------------------------
Beta-Coefficients for variable no. 3:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2 0 -5.55 1.21e-13 ***
cov 0.16 0 -0.01 0.99
------------------------------------------------------------------
Beta-Coefficients for variable no. 4:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.11 0 7.47 0 ***
cov 0.15 0 0.37 0.71
------------------------------------------------------------------
Significance codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1
Number of Observations: 300
Link: linear-softmax
This model has successfully identified which group had trends in proportions that were independend (variable no. 1) and which were dependent from the softmax (variable no. 2, 3, 4).
This is different from the package DirichletReg, where all changes of the four groups are seen to be significant.
if (!require(DirichletReg)) {
install.packages("DirichletReg")
library(DirichletReg)
}
dd = data.frame(obj$prop, cov = my_design[,2])
AL <- DR_data(dd[, 1:4])
summary(DirichReg(AL ~ cov, dd))
------------------------------------------------------------------
Beta-Coefficients for variable no. 1: X1
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.50592 0.09509 26.353 < 2e-16 ***
cov 0.88813 0.16188 5.486 4.1e-08 ***
------------------------------------------------------------------
Beta-Coefficients for variable no. 2: X2
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.50290 0.09643 25.954 < 2e-16 ***
cov -0.93510 0.16617 -5.627 1.83e-08 ***
------------------------------------------------------------------
Beta-Coefficients for variable no. 3: X3
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.86560 0.09455 19.730 < 2e-16 ***
cov -0.84977 0.16008 -5.308 1.11e-07 ***
------------------------------------------------------------------
Beta-Coefficients for variable no. 4: X4
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.17471 0.09963 11.791 < 2e-16 ***
cov -0.59494 0.17108 -3.477 0.000506 ***
------------------------------------------------------------------
Significance codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1
Log-likelihood: 1425 on 8 df (98 BFGS + 1 NR Iterations)
AIC: -2835, BIC: -2805
Number of Observations: 300
Link: Log
Parametrization: common