Automatic differentiation with stan math


#1

Hello, my problem is this:
I need to calculate the gradient of a logistic regression, which is as follows:

∇wNLL(y|X,w) =Σxi(µi −yi) = X^T(µ−y)

Were: µ is wx^T
T : trasnpose

reading the paper The Stan Math Library: Reverse-Mode Automatic Differentiation I found the following example:

#include <vector>
int main() {
 double y = 1.3; stan::math::var mu = 0.5, sigma = 1.2;
stan::math::var lp = 0;
 lp -= 0.5 * log(2 * stan::math::pi());
 lp -= log(sigma); lp -= 0.5 * pow((y - mu) / sigma, 2);
std::vector<stan::math::var> theta; theta.push_back(mu); theta.push_back(sigma); 
std::vector<double> g;
 lp.grad(theta, g);
 std::cout << " d.f / d.mu = " << g[0] << " d.f / d.sigma = " << g[1] << std::endl;

 return 0;

}

my problem is how I make the difference with respect to w if in the final expression of the gradient that value becomes µ which is passed through the sigmoid function. I would appreciate your help or guidance


#2

To use the autodiff, step 1 is just writing out a formula for your log probability (lp). Make the variables you want to autodiff with respect to stan::math::vars.

Once you call lp.grad();, then you can call the .adj() function of any of the var variables to get the derivative of lp with respect to that variable. This might seem a little magical, but the details of how it works are in the math paper, so you’ve found the right resource.

If you’re repeating this multiple times, you’ll need to zero out the old adjoints before you call grad again.

But step 1 is just writing out that lp. If you’re having trouble, just post the exact code you want to autodiff and point out the variable you want the derivatives with respect to, and it should be easy to modify.

There’s actually an easy interface in Rstan to get at some of these gradients for a model written in Stan. Check out log_prob and grad_log_prob in the Rstan manual.


#3

the code that I need to differentiate is the following:

VectorXd CPU_LogisticRegression::computeGradient(){
VectorXd E_d=this->phi-*this->Y_train;
VectorXd E_w=this->weights/(this->lambda);
VectorXd grad=VectorXd::Zero(this->dim);
#pragma omp parallel for schedule(static)
for(int d=0;ddim;d++){
grad[d]=this->X_train->col(d).cwiseProduct(E_d).sum()+E_w[d];

}
if(this->with_bias) this->grad_bias= (E_d.sum()+this->bias/this->lambda);
return grad;

}

where phi is obtained from the following function:

void CPU_LogisticRegression::preCompute(){
this->eta = (*this->X_train * this->weights);
if(this->with_bias) this->eta.noalias()=(this->eta.array()+this->bias).matrix();
this->phi = this->sigmoid(this->eta);
}

and then this value is passed through the sigmoid function before reaching the gradient

my objective is to differentiate with respect to the variable weights, but since it is not explicitly defined in the gradient, I can not reach it and differentiate it. I am new to stan any help I thank you in advance


#4

Oh, are you trying to get at second derivatives of your logistic regression here?

(by the way, if you add triple backticks around code (``` on the left and right), you can format it nicely)

{
  VectorXd CPU_LogisticRegression::computeGradient(){
  VectorXd E_d=this->phi-*this->Y_train;
  VectorXd E_w=this->weights/(this->lambda);
  VectorXd grad=VectorXd::Zero(this->dim);
  for(int d=0;ddim;d++){
    grad[d]=this->X_train->col(d).cwiseProduct(E_d).sum()+E_w[d];
  }
  return grad;
}

Reverse mode autodiff efficiently computes the derivative of one output with respect to a large number of inputs (forward mode does many outputs for one input efficiently).

Since your function returns a vector, the derivative of your output is going to be a Jacobian (https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant).

If this is your full code, you should just compute your derivatives manually. The Jacobian of this is a diagonal matrix with 1 / lambdas on the diagonal.

But! If you want to autodiff, here goes:

We can compute one row of the Jacobian per reverse mode autodiff call.

1st replace the things you want to get the derivatives with respect to with vars.

{
  VectorXd CPU_LogisticRegression::computeGradient(){
  VectorXd E_d=this->phi-*this->Y_train;
  // Use vars here. Also I'm moving the lambda
  Matrix<stan::math::var, Dynamic, 1> E_w = this->weights;
  // This is now vars too. In types, a function of a var and a double returns a var (the types promote)
  Matrix<stan::math::var, Dynamic, 1> grad(this->dim);
  for(int d=0;d < dim;d++){
    grad[d]=this->X_train->col(d).cwiseProduct(E_d).sum()+E_w[d] / (this->lambda);
  }
  MatrixXd grad_vals(this->dim);
  MatrixXd jac(this->dim, E_w.size());
  for(int d=0;d < dim;d++) {
    // Clear out the autodiff stack before doing anything. Technically not needed the 1st time
    set_zero_all_adjoints();
    grad[d].grad();
    // .val() can be used to get the regular function values from the forward pass
    grad_vals[d] = grad[d].val();
    for(int i = 0; i < E_w.size(); i++) {
      jac(d, i) = E_w(i).adj();
    }
  }
  ...
}

So now grad_vals has the regular output of the function, and jac has the Jacobian of the function grad = function(this->weights).


#5

You don’t want to use autodiff on that—it has a very simple analytic gradient w.r.t. the coefficients. Here’s a general derivation, but there’s one in just about every machine learning book that talks about stochastic gradient:


#6

Hello … what I’m trying to build is the gradient with autodiff, since I implemented a gradient “handmade” in the following way:

{
VectorXd CPU_LogisticRegression::computeGradient(){
VectorXd E_d=this->phi-*this->Y_train;
VectorXd E_w=this->weights/(this->lambda);
VectorXd grad=VectorXd::Zero(this->dim);
#pragma omp parallel for schedule(static)
for(int d=0;ddim;d++){
grad[d]=this->X_train->col(d).cwiseProduct(E_d).sum()+E_w[d];

}
if(this->with_bias) this->grad_bias= (E_d.sum()+this->bias/this->lambda);
return grad;
}

where phi is obtained from the following function:

{
void CPU_LogisticRegression::preCompute(){
this->eta = (*this->X_train * this->weights);
if(this->with_bias) this->eta.noalias()=(this->eta.array()+this->bias).matrix();
this->phi = this->sigmoid(this->eta);
}

and then this value is passed through the sigmoid function before reaching the gradient

my objective is to differentiate with respect to the variable weights.The problem that I have is related to how to differentiate with respect to said variable since “it is lost” when passing it as a phi to the gradient. since my intention is to optimize the calculation of the gradient with stan math … greetings

[edit: escaped code]


#7

Sitll not sure what you’re trying to do. If you’re just trying to calculate derivatives for logistic regression, they’re very easy to do analytically.

In general, if you want to understand how autodiff works in Stan’s math lib, there’s an arXiv paper by me and some others. The easiest way to use it is to create a functional for what you want to evaluate (store data as member variables and take all autodiff parameters in a vector argument) and then use our gradient functional. The function will have to be sufficiently templated to be instantiated with stan::math::var types for all the real-values scalars.

Your expressions are very hard to read with no spaces! And I’m not familiar with whatever library you’re using that has a namespace CPU_LogisticRegression::. It’s not part of Stan.


#8

I understand … in that case I will attach my code to improve comprehension, my intention is to apply to the function computeGradient (which is a gradient made by hand) of the following expression:

∇wNLL(y|X,w) =Σxi(µi −yi) = X^T(µ−y)

Were: µ is wx^T
T : trasnpose

a gradient with autodiff reverse-mode in order to optimize it since the one that programs is a bit slow.

I read the paragraph of the functional gradient of the paper but I can not understand how to use it with my code

I would appreciate any help, thank you very much

my code (CPU_logistic_regressionhpp.txt (424 Bytes)
logistic_regression.txt (3.5 KB)
logistic_regressionhpp.txt (1.4 KB)
I add the other files)

VectorXd CPU_LogisticRegression::train(int n_iter,double alpha,double tol){
VectorXd log_likelihood=VectorXd::Zero(n_iter);
for(int i=0;i<n_iter;i++){
tools.printProgBar(i, n_iter);
this->preCompute();
log_likelihood(i)=-this->logPosterior();
VectorXd gradient=this->computeGradient();
this->weights-=alphagradient;
if(this->with_bias) this->bias-=alpha
this->grad_bias;
}
cout << endl;
return log_likelihood;
}

void CPU_LogisticRegression::preCompute(){
this->eta = (*this->X_train * this->weights);
if(this->with_bias) this->eta.noalias()=(this->eta.array()+this->bias).matrix();
this->phi = this->sigmoid(this->eta);
}

////////this part I need to perform the autodif

VectorXd CPU_LogisticRegression::computeGradient(){
VectorXd E_d=this->phi-*this->Y_train;
VectorXd E_w=this->weights/(this->lambda);
VectorXd grad=VectorXd::Zero(this->dim);
for(int d=0;ddim;d++){
grad[d]=this->X_train->col(d).cwiseProduct(E_d).sum()+E_w[d];

}
if(this->with_bias) this->grad_bias= (E_d.sum()+this->bias/this->lambda);
return grad;

}

VectorXd CPU_LogisticRegression::predict(MatrixXd &_X_test,bool prob, bool data_processing){
if (data_processing){
if (this->normalization) tools.testNormalization(_X_test,this->featureMax,this->featureMin);
if(this->standardization) tools.testStandardization(_X_test,this->featureMean,this->featureStd);
}
VectorXd eta_test = (_X_test)*this->weights;
if(this->with_bias) eta_test.noalias()=(eta_test.array()+this->bias).matrix();
VectorXd phi_test=this->sigmoid(eta_test);
if(!prob){
phi_test.noalias() = phi_test.unaryExpr([](double elem){
return (elem > 0.5) ? 1.0 : 0.0;
});
}
return phi_test;
}


#9

You should just calculate the derivative analytically if this is the only case you care about.

Otherwise, the best place to start on how to use Stan’s autodiff is the arXiv paper:

https://arxiv.org/abs/1509.07164

You can also look at the unit tests in C++ or how the gradient() functional is implemented in order to figure out how it’s implemented at the lower level if you prefer working code.


#10

no, do not pretend, calculate, seconds, derivatives, what you are trying to do, rewrite this stochastic gradient to autodiff-reverse mode. I tried to study the functional gradient but I did not understand it very well, I hope you can help me

In summary, what I want to convert to autodiff is this line, which generates the bottleneck

for(int d=0;d < dim;d++){
grad[d]=this->X_train->col(d).cwiseProduct(E_d).sum()+E_w[d] / (this->lambda);
}

‘’'
VectorXd CPU_LogisticRegression::computeGradient(){
VectorXd E_d=this->phi-*this->Y_train;
VectorXd E_w=this->weights/(this->lambda);
VectorXd grad=VectorXd::Zero(this->dim);
#pragma omp parallel for schedule(static)
for(int d=0;ddim;d++){
grad[d]=this->X_train->col(d).cwiseProduct(E_d).sum()+E_w[d];

}
if(this->with_bias) this->grad_bias= (E_d.sum()+this->bias/this->lambda);
return grad;
}
’’’


#11

Sorry for the delay getting back.

What Bob says is correct. If you want derivatives of a function with a form like that, it’ll be worth your time to work it out by hand. To be clear, autodiff is not faster than computing this stuff manually. It’s more about making things convenient and the process of building models way less error prone. So if you’re dealing with a bottleneck, you’ll want to do things manually.

If you need autodiff, the code I wrote above should get you going in the right direction though.


#15

thank you for answering, according to the code you showed above, I tried to do the following (please see the attached image) the image represents the objective function first and then the gradient with respect to w

What I have done for now is this:

VectorXd CPU_LogisticRegression::computeGradient( ){
VectorXd E_w=this->weights/(this->lambda);
VectorXd grad_vals=VectorXd::Zero(this->dim);
VectorXd a=VectorXd::Zero(this->dim);
Matrix<stan::math::var, Dynamic, 1> lp(this->dim);
a = *this->Y_train;
for(int d=0;ddim;d++){
lp[d]= -((a[d]*log(this->phi[d])+(1 - a[d])*log(1 - this->phi[d])));

}
for(int i = 0;i dim;i++) {
stan::math::set_zero_all_adjoints;
lp[i].grad();
grad_vals[i] = lp[i].val()+E_w[i];
}

return grad_vals;
}

this would be fine to get the gradient regarding w ?? thanks for your time


#16

I didn’t check the code super closely, but there are two things missing:

  1. You should be looking for the gradient in output with respect to one of the inputs. That input needs to be declared as a stan::math::var. That’s what this bit of code in the previous code I wrote was:

Matrix<stan::math::var, Dynamic, 1> E_w = this->weights;

  1. Once you call lp[i].grad, you should go collect the adjoints from each of those input variables. That’s what this bit of code in the previous code was:
for(int i = 0; i < E_w.size(); i++) {
  jac(d, i) = E_w(i).adj();
}

But again you should just do these calculations by hand. Nothing magic about autodiff.

There was another thread recently where a dude does some autodiff stuff you might like to mess around with: Gradient after transformation (math library)