R Package dirReg (beta): an attempt to use stan for improving "softmax regression" inference. Please test if you wish for feedback

stemangiola · December 22, 2017, 6:30am

Hello,

I was approaching the problem of inferring regression for a simplex.

After a week of work I came out with a package that works possibly better than public R pachages such as DirichletReg.

Here what it does better (fully explained on Github):

dirReg

This is an R package that infers trend of change in proportions with the condition that they have to sum to 1. The Challenging part is represented by this example:

The problem
There are 4 species of birds, and we count their proportions sampling a picture of the sky.

The totality of the birds at time 0 is:

black brids: 350
red birds: 350
blue birds: 200
green birds: 100

At time 1, the black bird population has increased a lot while all others did not change at all:

black brids: 2850
red birds: 350
blue birds: 200
green birds: 100

Because under some conditions we cannot know the absolute number of total birds, we necessarily have to think in terms of proportions. For example we would get this picture.

We can have the (statistical) impression see that all four specie populations changed in size, and that all changes can be significant; while the reality is that three population are not decreasing independently, but just as result of the normalization to 1.

The following probabilistic model tries to address this challenge. Understanding what is an independent change and what is not.

R code example

if (!require(devtools)) {
  install.packages("devtools")
  library(devtools)
}
options(buildtools.check = function(action) TRUE)
install_github("stemangiola/dirReg", args = "--preclean", build_vignettes = FALSE)
set.seed(123)
library("dirReg")

n_samples = 300
my_design = model.matrix(~cov , data = data.frame(cov=runif(n_samples)))

obj = simulate_proportions(
30 ,                        ## noise of the data
c(0.35, 0.35, 0.2, 0.1 ) ,  ## proportions at the intercept
c(2, 0, 0, 0),              ## rate of change (positive just for population 1)
my_design,                  ## design matrix
n_samples = n_samples       ## number of samples
)

fit = dirReg(obj$prop, my_design)

------------------------------------------------------------------
Beta-Coefficients for variable no. 1:
Estimate Std. Error z value Pr(>|z|) 
(Intercept)          0.36    0   -29.94 0 ***
cov  0.53    0  -29.12  0 ***
------------------------------------------------------------------
Beta-Coefficients for variable no. 2:
Estimate Std. Error z value Pr(>|z|) 
(Intercept)          0.34    0   -26.05 0 ***
cov  0.16    0  -0.6  0.55 
------------------------------------------------------------------
Beta-Coefficients for variable no. 3:
Estimate Std. Error z value Pr(>|z|) 
(Intercept)          0.2    0   -5.55 1.21e-13 ***
cov  0.16    0  -0.01  0.99 
------------------------------------------------------------------
Beta-Coefficients for variable no. 4:
Estimate Std. Error z value Pr(>|z|) 
(Intercept)          0.11    0   7.47 0 ***
cov  0.15    0  0.37  0.71 
------------------------------------------------------------------

Significance codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Number of Observations: 300
Link: linear-softmax

This model has successfully identified which group had trends in proportions that were independend (variable no. 1) and which were dependent from the softmax (variable no. 2, 3, 4).

This is different from the package DirichletReg, where all changes of the four groups are seen to be significant.

if (!require(DirichletReg)) {
	install.packages("DirichletReg")
	library(DirichletReg)
}

dd = data.frame(obj$prop, cov = my_design[,2])
AL <- DR_data(dd[, 1:4])
summary(DirichReg(AL ~ cov, dd))
------------------------------------------------------------------
Beta-Coefficients for variable no. 1: X1
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.50592    0.09509  26.353  < 2e-16 ***
cov          0.88813    0.16188   5.486  4.1e-08 ***
------------------------------------------------------------------
Beta-Coefficients for variable no. 2: X2
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.50290    0.09643  25.954  < 2e-16 ***
cov         -0.93510    0.16617  -5.627 1.83e-08 ***
------------------------------------------------------------------
Beta-Coefficients for variable no. 3: X3
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.86560    0.09455  19.730  < 2e-16 ***
cov         -0.84977    0.16008  -5.308 1.11e-07 ***
------------------------------------------------------------------
Beta-Coefficients for variable no. 4: X4
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.17471    0.09963  11.791  < 2e-16 ***
cov         -0.59494    0.17108  -3.477 0.000506 ***
------------------------------------------------------------------
Significance codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Log-likelihood: 1425 on 8 df (98 BFGS + 1 NR Iterations)
AIC: -2835, BIC: -2805
Number of Observations: 300
Link: Log
Parametrization: common

bgoodri · December 23, 2017, 1:03am

Before this goes onto CRAN, there is an oversight in the rstan_package.skeleton function and rstanarm build process in that it only works with GNUmake. It is possible to insert

SystemRequirements: GNU make

into DESCRIPTION, but I am trying to come up with a better fix.

andre.pfeuffer · December 23, 2017, 6:30am

I have some problem in understanding:

https://github.com/stemangiola/dirReg/blob/master/src/stan_files/dirReg.stan

Precisely

for(s in 1:S) beta_hat_hat[s] = softmax(to_vector(beta_hat[s])) * phi + 1;

I don’t understand the + 1.

stemangiola · December 23, 2017, 10:25pm

Thanks Ben, however, the model is not mature or fully tested yet, I put it out there mainly for feedback. But I will do if it will put it on CRAN eventually.

stemangiola · December 23, 2017, 10:31pm

The “+ 1” make sure that the prior for dirichlet is bigger than one, with the assuption that the dirichlet distribution is unimodal. This is an assumption of my model, but I think could have general validity.

What do you think?

By the way, if I allow the dirichlet to be bimodal the convergence is more difficult.

andre.pfeuffer · December 24, 2017, 5:23am

I what way does the +1 makes the Dirichlet distribution unimodal?

Say we have a softmax, eg.

p ← c(0.2, 0.15, 0.65)

any know we choose phi to be 4. Then we get:

(p * 4+1)/sum(p * 4+1)
[1] 0.2571429 0.2285714 0.5142857

A functional dependency between phi and p (or phi and the softmax).
I don’t know if Stan can handle this well. Can somebody comment?

stemangiola · December 24, 2017, 10:53am

Thanks for your comment

Unless I am missing something, the * phi + 1 is outside the soft max.

Given that parameters in soft max can go from 0 to 1, + 1 should make sure that the result is > 1.

For example

> p <- c(0.2, 0.15, 0.65)
> (p/sum(p)) * 4 + 1

[1] 1.8 1.6 3.6

I hope it makes sense

andre.pfeuffer · December 24, 2017, 11:35am

I understand, that the result > 1 and I think the softmax defining the porpotions and
phi the scaling.

c> p<-c(1.8,1.6,3.6)

p/sum( p )
[1] 0.2571429 0.2285714 0.5142857

Thus this is your porportions, different from softmax. And if phi (the 4) changes the porportions gets different again:

p ← c(0.2, 0.15, 0.65)
x<-(p/sum( p )) * 6 + 1
x/sum(x)
[1] 0.2444444 0.2111111 0.5444444

Btw sum( p ) == 1

But please somebody else may comment.

denne · December 28, 2017, 2:45pm

Good to know, might be relevant for me. Is there some issue # so that I can follow this up?

bgoodri · December 28, 2017, 5:22pm

It doesn’t affect existing packages, only new ones that are generated with rstan_package.skeleton in rstantools 2.17.x and only for last week and next.

stemangiola · January 2, 2018, 12:01am

Thanks Guys,

few questions:

so after next week I can eliminate the “SystemRequirements: GNU make” requirement?
what would be the most appropriate name for this model?

Dirichlet regression
Simplex regression (I would prefer)
Softmax regression

Thanks a lot.

bgoodri · January 2, 2018, 6:19pm

Ideally, we can make the build process not depend on GNU make. But we still haven’t figured it out completely and you will probably have to run rstan_package.skeleton again.

Topic		Replies	Views
Beta release of our 'posterior' R package Announcements posterior-package	2	733	June 15, 2020
Workshop: Intro to rstan & rstanarm for (ecology) grad students Publicity ecology	1	1106	May 8, 2017
New releases of rstanarm and rstantools R packages Announcements	12	1208	October 12, 2019
StanEstimators: New R Package Exposing Stan Methods to R Publicity	3	610	January 16, 2024
Alpha release of our new 'posterior' R package Announcements posterior-package	6	1767	November 12, 2019

R Package dirReg (beta): an attempt to use stan for improving "softmax regression" inference. Please test if you wish for feedback

Related topics