Automatic differenciation reverse-mode loss function

avatar · May 31, 2018, 3:58am

Hi everyone, I’m new to the world of automatic differenciation, and I’m trying to get the gradient of this loss function with respect to weights. the problem is that I am working with the data type map which has made it difficult for me to transfer the function to stan notation. If someone could help me, I would be very glad.Since I’ve tried to do it but I’m not going in a good way.

vector<map<int,double>> data;
map<int,double> weights;
vector<int> index(data.size());
iota(index.begin(),index.end(),0);


double lossFunction(map<int,double>& features,map<int,double>& weights){
	
	Matrix<stan::math::var, Dynamic, 1> loss(1);
	double logit = 0.0;
	static double overflow = 20.0;
	vector<int> aux(features.size()); // vector index de largo de data
	iota(aux.begin(),aux.end(),0);
	int label = features[aux[0]];

    for(auto it = features.begin(); it != features.end(); it++){
        if(it->first != 0){
            logit += it->second * weights[it->first];
        }
    }
	if (logit > overflow) logit = overflow;
    if (logit < -overflow) logit = -overflow;
	double predicted = 1.0/(1.0 + exp(-logit));
	loss = label*log(predicted)+(1-label)*log(1-predicted);
    loss = loss *-1;
    loss.grad()
    double grad_val = loss.val();


return grad_val

}


double result = lossFunction(data[index[i]],weights);

bbbales2 · June 1, 2018, 4:44pm

The map data type shouldn’t cause any trouble.

If you want to get the gradients with respect to weights, you’ll want the weights to be vars and the output to be a var as well. As a C++ snippet, you want:

map<int, var> weights_as_vars;
// Convert your weights to vars
for(auto it = weights.begin(); it != weights.end(); it++) {
  weights_as_vars[it->first] = it->second;
}

var loss = lossFunction(features, weights_as_vars); // You'll need to make logit a var, and probably some other stuff inside lossFunction

// Do the reverse mode autodiff (if you need to call this multiple times you'll want to clear out the autodiff stack -- you probably need this so ask about it and I'll explain or find an example)
loss.grad();

// Print out the gradients of the loss -- they're stored in the weights vars (which is why we needed those to be vars)
for(auto it = weights_as_vars.begin(); it != weights_as_vars.end(); it++) {
  std::cout << "dloss_dweight" << it->first << " = " << (it->second).adj() << std::endl;
}

That make sense? You’ll need to make changes to loss function to get this to work, but I figured maybe showing you what the interface looks like might make it clearer what you need to do. Did you find the Stan math paper: https://arxiv.org/abs/1509.07164 ? Have you tried coding up some of the examples in that and seeing if you can get them to work?

avatar · June 4, 2018, 5:17pm

Yes, I found that document, but I had my doubts about whether the map was causing problems with the stan data types, but this will be of great help to guide me, thank you very much for your help. I will write later how it goes with this

Bob_Carpenter · June 5, 2018, 6:55am

Why make the loss a one-element vector rather than a scalar?

Binary log loss is built in, so this can be simplified to:

double predicted = inv_logit(logit);
loss = binary_log_loss(label, predicted);

What’s more, the combination is just the Bernoulli on the log odds scale, which is built in as:

loss = bernoulli_logit_lpdf(y, logit);

I’d rename the variable logit to something that isn’t conflated with a scale and a function.

avatar · June 5, 2018, 2:59pm

Thank you very much for responding, then could it be reached and simplified as well, since what I intend is to implement the descent of the gradient in the following way:

stan::math:: var lossFunction(map<int,double>& features,map<int,stan::math::var>& weights){
			stan::math::var logit = 0.0;
			static double overflow = 20.0;
			vector<int> aux(features.size()); // vector index de largo de data
			iota(aux.begin(),aux.end(),0);
			int label = features[aux[0]];


		    for(auto it = features.begin(); it != features.end(); it++){
		        if(it->first != 0){
		        	stan::math::var w = weights[it->first];
		        	stan::math::var f = it->second;
		            logit += f * w;
		        } 
		    }
			if (logit > overflow) logit = overflow;
		    if (logit < -overflow) logit = -overflow;
			double predicted = inv_logit(logit);
			stan::math:: var loss = binary_log_loss(label, predicted);
		  

return loss;
}

main(){

vector<map<int,double>> data;
map<int,double> weights;
vector<int> index(data.size());
iota(index.begin(),index.end(),0);

// the data is read and stored in both weights and data

map<int,stan::math::var> weights_as_vars;
        // Convert your weights to vars
        for(auto it = weights.begin(); it != weights.end(); it++) {
             weights_as_vars[it->first] = it->second;
        }

cout << "# stochastic gradient descent" << endl;
        while(norm > eps){

            map<int,double> old_weights(weights);
            if(shuf) shuffle(index.begin(),index.end(),g);

            for (unsigned int i = 0; i < data.size(); i++){
                mu += (l1*alpha);
                stan::math::var result = lossFunction(data[index[i]],weights_as_vars);
                stan::math::set zero all adjoints();
                result.grad();
                double gradient= 0.0;
                for(auto it = data[index[i]].begin(); it != data[index[i]].end(); it++){
                    if(it->first != 0){
                    	
                    	gradient = (weights_as_vars[it->second]).adj();
                        weights[it->first] += alpha * gradient;
                        

                        if(l1){
                           
                            double z = weights[it->first];
                            if(weights[it->first] > 0.0){
                                weights[it->first] = max(0.0,(double)(weights[it->first] - (mu + total_l1[it->first])));
                            }else if(weights[it->first] < 0.0){
                                weights[it->first] = min(0.0,(double)(weights[it->first] + (mu - total_l1[it->first])));
                            }
                            total_l1[it->first] += (weights[it->first] - z);
                        }    
                    }
                }
            }


}

Bob_Carpenter · June 5, 2018, 7:10pm

We have an L-BFGS optimizer that not only uses the gradient, it uses a limited-memory approximation to the Hessian, so it tends to converge must faster and with more accuracy than gradient-based optimizers.

Gradient-based optimizers are useful if you want to do minibatches because you have a ton of data.

Topic		Replies	Views
Automatic differentiation with stan math Modeling	12	4303	November 24, 2017
Derivative of vector-valued function (i.e., Jacobian) in Stan math C++ library using reverse autodiff? General	8	382	September 28, 2023
Vector-valued functions with manual gradients in external C++ General cmdstan , ode	11	630	August 30, 2023
Gradient after transformation (math library) General stan-math	2	1351	November 22, 2017
Multivariate Function with Known Gradients - RStan Developers rstan	3	1232	December 12, 2017

Automatic differenciation reverse-mode loss function

Related Topics