Automatic differenciation reverse-mode loss function


#1

Hi everyone, I’m new to the world of automatic differenciation, and I’m trying to get the gradient of this loss function with respect to weights. the problem is that I am working with the data type map which has made it difficult for me to transfer the function to stan notation. If someone could help me, I would be very glad.Since I’ve tried to do it but I’m not going in a good way.

vector<map<int,double>> data;
map<int,double> weights;
vector<int> index(data.size());
iota(index.begin(),index.end(),0);


double lossFunction(map<int,double>& features,map<int,double>& weights){
	
	Matrix<stan::math::var, Dynamic, 1> loss(1);
	double logit = 0.0;
	static double overflow = 20.0;
	vector<int> aux(features.size()); // vector index de largo de data
	iota(aux.begin(),aux.end(),0);
	int label = features[aux[0]];

    for(auto it = features.begin(); it != features.end(); it++){
        if(it->first != 0){
            logit += it->second * weights[it->first];
        }
    }
	if (logit > overflow) logit = overflow;
    if (logit < -overflow) logit = -overflow;
	double predicted = 1.0/(1.0 + exp(-logit));
	loss = label*log(predicted)+(1-label)*log(1-predicted);
    loss = loss *-1;
    loss.grad()
    double grad_val = loss.val();


return grad_val

}


double result = lossFunction(data[index[i]],weights);

#2

The map data type shouldn’t cause any trouble.

If you want to get the gradients with respect to weights, you’ll want the weights to be vars and the output to be a var as well. As a C++ snippet, you want:

map<int, var> weights_as_vars;
// Convert your weights to vars
for(auto it = weights.begin(); it != weights.end(); it++) {
  weights_as_vars[it->first] = it->second;
}

var loss = lossFunction(features, weights_as_vars); // You'll need to make logit a var, and probably some other stuff inside lossFunction

// Do the reverse mode autodiff (if you need to call this multiple times you'll want to clear out the autodiff stack -- you probably need this so ask about it and I'll explain or find an example)
loss.grad();

// Print out the gradients of the loss -- they're stored in the weights vars (which is why we needed those to be vars)
for(auto it = weights_as_vars.begin(); it != weights_as_vars.end(); it++) {
  std::cout << "dloss_dweight" << it->first << " = " << (it->second).adj() << std::endl;
}

That make sense? You’ll need to make changes to loss function to get this to work, but I figured maybe showing you what the interface looks like might make it clearer what you need to do. Did you find the Stan math paper: https://arxiv.org/abs/1509.07164 ? Have you tried coding up some of the examples in that and seeing if you can get them to work?


#3

Yes, I found that document, but I had my doubts about whether the map was causing problems with the stan data types, but this will be of great help to guide me, thank you very much for your help. I will write later how it goes with this


#4

Why make the loss a one-element vector rather than a scalar?

Binary log loss is built in, so this can be simplified to:

double predicted = inv_logit(logit);
loss = binary_log_loss(label, predicted);

What’s more, the combination is just the Bernoulli on the log odds scale, which is built in as:

loss = bernoulli_logit_lpdf(y, logit);

I’d rename the variable logit to something that isn’t conflated with a scale and a function.


#5

Thank you very much for responding, then could it be reached and simplified as well, since what I intend is to implement the descent of the gradient in the following way:

stan::math:: var lossFunction(map<int,double>& features,map<int,stan::math::var>& weights){
			stan::math::var logit = 0.0;
			static double overflow = 20.0;
			vector<int> aux(features.size()); // vector index de largo de data
			iota(aux.begin(),aux.end(),0);
			int label = features[aux[0]];


		    for(auto it = features.begin(); it != features.end(); it++){
		        if(it->first != 0){
		        	stan::math::var w = weights[it->first];
		        	stan::math::var f = it->second;
		            logit += f * w;
		        } 
		    }
			if (logit > overflow) logit = overflow;
		    if (logit < -overflow) logit = -overflow;
			double predicted = inv_logit(logit);
			stan::math:: var loss = binary_log_loss(label, predicted);
		  

return loss;
}

main(){

vector<map<int,double>> data;
map<int,double> weights;
vector<int> index(data.size());
iota(index.begin(),index.end(),0);

// the data is read and stored in both weights and data

map<int,stan::math::var> weights_as_vars;
        // Convert your weights to vars
        for(auto it = weights.begin(); it != weights.end(); it++) {
             weights_as_vars[it->first] = it->second;
        }

cout << "# stochastic gradient descent" << endl;
        while(norm > eps){

            map<int,double> old_weights(weights);
            if(shuf) shuffle(index.begin(),index.end(),g);

            for (unsigned int i = 0; i < data.size(); i++){
                mu += (l1*alpha);
                stan::math::var result = lossFunction(data[index[i]],weights_as_vars);
                stan::math::set zero all adjoints();
                result.grad();
                double gradient= 0.0;
                for(auto it = data[index[i]].begin(); it != data[index[i]].end(); it++){
                    if(it->first != 0){
                    	
                    	gradient = (weights_as_vars[it->second]).adj();
                        weights[it->first] += alpha * gradient;
                        

                        if(l1){
                           
                            double z = weights[it->first];
                            if(weights[it->first] > 0.0){
                                weights[it->first] = max(0.0,(double)(weights[it->first] - (mu + total_l1[it->first])));
                            }else if(weights[it->first] < 0.0){
                                weights[it->first] = min(0.0,(double)(weights[it->first] + (mu - total_l1[it->first])));
                            }
                            total_l1[it->first] += (weights[it->first] - z);
                        }    
                    }
                }
            }


}

#6

We have an L-BFGS optimizer that not only uses the gradient, it uses a limited-memory approximation to the Hessian, so it tends to converge must faster and with more accuracy than gradient-based optimizers.

Gradient-based optimizers are useful if you want to do minibatches because you have a ton of data.