Loss Functions in Deep Learning Models
A Loss function is a method of evaluation about how well your model evaluates the dataset. If model predictions are correct your loss will be less, otherwise your loss will be very high. Any machine learning, neural network or deep learning problem gets trained to reduce this loss from the loss function by variation of the weights.
Optimization algorithms help in minimizing the loss by varying the weights and biases ( trainable parameters) to achieve the minimum loss.
The cost or loss function has an important job in that it must faithfully distill all aspects of the model down into a single number in such a way that improvements in that number are a sign of a better model.
This can be a challenging problem as the function must capture the properties of the problem and be motivated by concerns that are important to the project and stakeholders.
The picture gives the 3-dimensional representation of simple loss function with 2 features. The loss is minimum at the lowest point of the hyperbole. The goal of the optimization algorithms is to minimize the loss by varying the weights w1 to w2 to reach the lowest point in the hyperbole.
Gradient Descent
Gradient descent is one of important optimization algorithms used in deep learning techniques.
Consider yourself on top of a hill or some place on the hill and your goal is to reach the lowest point in the valley.
You make nimble steps towards the valley to reach the lowest point on the valley. This is the intuition behind the gradient descent.
Derivative in Gradient Descent
A Gradient is simply a vector which is a multi-variable generalization of a derivative(dy/dx) which is the instantaneous rate of change of y with respect to x. It is derivative of a function which is dependent on more than one variable or multiple variables. And a gradient is calculated using partial derivatives. Also another major difference between the gradient and a derivative is that a gradient of a function produces a vector field.
A gradient gives the direction of movement to minimize the loss.
Consider the picture above. This gives variation of the cost function J(w). Here at the initial point gradient dy/dx will be high and it gives an indication that you have still move down to reach the minima. A gradient basically is a tangent drawn at that point in the error function. At the minima the gradient will be 0 or lowest giving indication that it is the minima. Once you go the other side of the curve the gradient will be negative giving indication that you are moving in the wrong direction.
w=w−η⋅∇J(w)
Here w is the weight and η is the learning rate and ∇J(w) is the partial derivative of the cost function or loss function J(w).
Optimization Challenges in Deep Learning
The loss function may not be always as simple as a quadratic equation. It may have many minima and you may reach a local minima. There may a saddle point where the curve remains flat for a long variation of weights before it is going down again.
Consider this curve it has 2 minima B and A. B is the local minima but the point A is the actual minima. When you reach the point B you do not know if we must go further to get the actual minima as the gradient again becomes negative after crossing point B. This is one of the challenges in optimization.
Another challenge with the optimization is the saddle point. When you reach the point A below in the loss function you may feel that you have reached minima but you need to still go further to get to the actual minima but the gradient will be almost zero.
Another issue with the optimization is the vanishing gradient .
In this function at point A and B the gradient is almost zero and you may consider this is the minimum but it may not be the lowest point in the cost function.
Maximum Likelihood
There are many functions that could be used to estimate the error of a set of weights in a neural network.
We prefer a function where the space of candidate solutions maps onto a smooth (but high-dimensional) landscape that the optimization algorithm can reasonably navigate via iterative updates to the model weights.
Maximum likelihood estimation, or MLE, is a framework for inference for finding the best statistical estimates of parameters from historical training data. Maximum likelihood seeks to find the optimum values for the parameters by maximizing a likelihood function derived from the training data.
We have a training dataset with one or more input variables and we require a model to estimate model weight parameters that best maps examples of the inputs to the output or target variable.
Given input, the model is trying to make predictions that match the data distribution of the target variable. Under maximum likelihood, a loss function estimates how closely the distribution of predictions made by a model matches the distribution of target variables in the training data.
One way to interpret maximum likelihood estimation is to view it as minimizing the dissimilarity between the empirical distribution defined by the training set and the model distribution, with the degree of dissimilarity between the two measured by the KL divergence. Minimizing this KL divergence corresponds exactly to minimizing the cross-entropy between the distributions.
A benefit of using maximum likelihood as a framework for estimating the model parameters (weights) for neural networks and in machine learning in general is that as the number of examples in the training dataset is increased, the estimate of the model parameters improves. This is called the property of “consistency.”
Under appropriate conditions, the maximum likelihood estimator has the property of consistency, meaning that as the number of training examples approaches infinity, the maximum likelihood estimate of a parameter converges to the true value of the parameter. Now that we are familiar with the general approach of maximum likelihood, we can look at the error function.
Loss functions for Linear Regression
The objective of the linear regression is to get an equation or plot of curve which fits the data points. There are various functions to calculate the loss for the linear regression. They are mean squared error mean absolute error.
Mean Squared error, L2 Loss
Mean squared error is the most common loss function used for the regression problem. This is the mean of the square of the errors of predicted value with the actual value. It is also called L2 loss.
Mean Absolute Error, L1 Loss
Mean Absolute Error (MAE) is another loss function used for regression models. MAE is the sum of absolute differences between our target and predicted variables. So it measures the average magnitude of errors in a set of predictions, without considering their directions.
MSE vs MAE
It is easier to solve using Mean squared error but mean absolute error is more resilient to the outliers. Let me explain with an example. Consider the following samples for training.
Let’s consider RMSE ( Root Mean Square Error) to get in same terms with MAE
Here the MSE is more biased towards the outliers and the prediction is sensitive to the outliers.
There is one problem with the MAE. The derivative of MAE is a constant so it is does not converge easily. Even if the error is very less the derivative is constant and it difficult to understand if we are near to the actual value. But the MSE converges well.
L1 loss is robust to outliers but its derivatives are not continuous making it difficult to arrive at the solution. L2 is very sensitive to the outliers but gives a stable and closed form solution (As the derivative is 0 when very close to the solution)
Loss functions for Classification
The objective of the classification problem is to predict the probability of the sample to be part of a class. The output is basically probability value f(x) which is called the score of the input x. Generally, magnitude of the score represents the confidence of the prediction. The target variable y is a binary value either 1 ( if it belongs to the class) or 0 ( it does not belong to the class). Binary cross entropy and categorical cross entropy are some loss functions in the classification problem.
Binary Cross Entropy
The loss function binary crossentropy is used on yes/no decisions when there are 2 classes. The loss tells you how wrong your model’s predictions are.
where ŷ is the predicted value.
Binary crossentropy measures how far away from the true value (which is either 0 or 1) the prediction is for each of the classes and then averages these class-wise errors to obtain the final loss. Cross-entropy loss increases as the predicted probability diverges from the actual label.
Categorical Cross Entropy
Categorical cross entropy is used for multiclass classification problem.
where ŷ is the predicted value.
Categorical crossentropy will compare the distribution of the predictions (the activation in the output layer, one for each class) with the true distribution, where the probability of the true class is set to 1 and 0 for the other classes. To put it in a different way, the true class is represented as a one-hot encoded vector, and the closer the model’s outputs are to that vector, the lower the loss.
Loss function in Knowledge Distillation Model
Knowledge distillation is one of the model compression techniques which was detailed in the paper by Geoffrey Hinton, Oriol Vinyals and Jeff Dean in 2015. Model compression reduces the computations and the size of the target model. It makes them usable across many devices including mobiles where the storage requirement is critical. The main theme behind knowledge distillation is use the knowledge of complex or cumbersome or computations intensive model to a simple model. The simple model will mimic the behavior of the cumbersome model. Below picture gives the details of loss function in Knowledge distillation model. The cumbersome model is called the teacher model and the target or simple model is called the student model.
The softmax function is given by the expression below.
It takes the output of the last layer of the model architecture and outputs the probabilities of each class with target class probability being the highest. It does not give more information other than the ground truth provided in the dataset. For example suppose the final prediction is ‘1’, it will have the maximum probability value for this class label. It does not give any other details like how close it is to other closer numbers like ‘7’ or ‘6’ in the dataset. To get this information in the output layer, softmax function with temperature is used.
Softmax function with temperature is given by the expression below. It is same as the softmax function above expect that it divides the input logits zi by the temperature T. This parameter T allows the softmax function to become softer and allows to provide more information of other classes which are closer to predicted class. So it not only gives information on the predicted class but also to which class it is closer. For example if the predicted class label is ‘1’, it not gives highest probility value for ‘1’ class, but gives next higher probablity values to numbers closer to ‘1’, like ‘7’ and ‘6’. This is referred to as ‘Dark Knowledge’ by Hinton in his paper on knowledge distillation. The output of softmax function with temperature are called soft targets. And student model is trained to learn to output these soft targets of the teacher model. So the loss function will be cross entropy of soft targets of teacher model and soft predictions of student model. The soft predictions of the student model are obtained by using the softmax function with temperature, with T value same the one used during training the teacher model. This loss is called the distillation loss.
However to make the student model to perform better, Hinton in his paper tells to train the student model with another loss called the student loss. This loss function will be the softmax function with T=1. It will calculate the loss between the student model predicted class probabilities using the softmax function with T=1 and actual target labels in the dataset.
Hence the overall loss function (L) for the student model will be summation of distillation loss (L1) and student loss (L2).
L = L1 + L2 where,
L1 = Cross Entropy(σ(ZT, T=τ), σ(ZS,T=t)) where,
σ(ZT, T=t), is the softmax output of the teacher model with T=t and
σ(ZS,T=t), is the softmax output of the student model with T=t
L2 = Cross Entropy(y, σ(ZS, T=1)) where,
σ(ZS,T=t), is the softmax output of the student model with T=1
The final loss L = (1 — α)L1 + αL2, where α is the hyper parameter during model training.
Single Shot Multi Box Detection
Single shot multi box detection model is one of the object detection models used in deep learning. Wei Liu and colleagues introduced SSD model.It is built over the over a base model. The objective of the model is to detect the location of an object in the model and classify.
Consider you have a cat in the image. SSD model detects the location of the cat in the image, builds a bounding box with co-ordinates x, y, width , height and classifies the object with label cat.
This model is built on a base model such as VGG16 or mobile net.
There are default anchor boxes created with different scales and the aspect ratios. There are predication layers added over the base model for object detection at different scales. A set of filters are added on the top of convolutional feature maps to predict detections for different aspect ratios and class categories. These are compared with the ground truth boxes.
The loss for this kind of model is a combination of classification and linear regression. Classification for the class detection and the regression for the localization detection.
The loss is expressed with the equation below.
Hence the overall loss of the SSD model is the combination of the confidence loss and the localization loss.
Generative Adversarial Network (GAN)
Ian Goodfellow and his colleagues from University of Montreal introduced GAN in 2014. It allowed learning the variations in the dataset using random samples. It is made up of two networks which play a game against each other. The Generator generates fake samples of data and tries to fool the Discriminator. The aim of the Discriminator is to distinguish between the generated and real input. The Generator model is trained to maximize the probability of the Discriminator in making a mistake in finding the difference between real and fake input. The Discriminator is trained to maximize the probabilities of assigning the correct label to both real and fake images. The sample architecture of GAN model is given below.
The overall loss function for the GAN model is given by min max (L1, L2).
L1 is called the discriminator loss and L2 is called the generator loss.
L1 = maximize (log D(x) + (1 — log D( G(z) )) where,
G = Generator
D = Discriminator
D(x) is the discriminator’s estimate of the probability that real data instance x is real
G(z) is the generator’s output for the random input z
D(G(z)) is the discriminator’s estimate of the probability that fake instance is real, so (1 — log D(G(z))) is the inverse probability that fake instance is real.
L2 = minimize (1 — log D(G(z)))
So generator loss is to minimize the inverse probability that fake instance is real.
The minimax loss function can cause the GAN to get stuck in the early stages of GAN training. It may not provide sufficient gradient for G to learn well. This is because during the early stages of GAN learning, generator may perform poor and D can reject samples with high confidence because the generated samples will be too different from the real data.
To overcome the above problem the modified generator loss is used,
L2 = maximize ( log D(G(z)))
With this generator tries to maximize the probability of generated samples are real samples. This loss improves training in early stages by giving better gradient information for updating weights of the generator model.
Conclusion
The selection of loss function used for any model depends on the output expected for the problem at hand. The selection of loss function is very important to optimized solution for any deep learning problem.
Authors
B N Chandrashekar, chandru4ni@gmail.com
Srinivas Chakravarthy, srinivas.yeeda@gmail.com