Introduction
This post continues my blog series on Neural Networks and Deep Learning with (an) R (twist), which is motivated by my current enrollment in Andrew Ng’s Deep Learning Specialisation on Coursera. One of the very first things I picked in this course is that the familiar logistic regression classifiers can be seen as a neural network. In fact it turns out that the logistic regression classifier is a good example to illustrate and motivate the basics of neural networks. I start by motivating the binary classification problem.
The Binary Classification Problem
The binary classification problem is one in which given a set of input (called features), we want to output a binary prediction . A fascinating example of this type of problem I saw recently is given a set of (some Nike and Adidas) shoe pictures, can we learn a binary classifier to tell whether a shoe was made by Nike or Adidas!
In this setting, the possible outputs (Nike/Adidas) of are denoted with and and the logistic regression classifier is typically used on this type of problem because it is capable of producing an output (prediction) which is basically the probability that given the input features :
Why The Logistic Regression Classifier ?
The logistic regression classifier can easily be motivated from linear regression model given by
where is the number of features (or predictors or columns) of , , and . However, we want but the current linear regression model does not satisfy that. Consequently, we pass through the sigmoid function given by:
Note that if is very large, is close to 1.and if is very small, is close to zero. If we let , the output of the logistic regression classifier can then be written as:
Learning The Parameters of The Logistic Regression Classifier
This section details the steps needed to train the logistic classifier. First we setup the problem with the appropriate notations.
Setup
Given the set of training examples: where any and with the superscript indices refering to training examples. We want for any single training example , a prediction that is close to the actual value as possible, i.e we want: where
Next we proceed by first defining logistic regression loss function.
The Logistic Regression Loss (or Error) Function
To assess how good the current values of parameters and are, we need to define a metric to compare how close a single prediction (on the training example given the and ) is to the actual value . This metric is called the loss (or error) on a training example and it basically measures the difference between and . For logistic regression, the preferred loss function is Note that if , then and since we want to minimise the loss (we want to be close to as possible), must be large and that implies that must be large which consequently means that will be close to (since the sigmoid function ensures that ).
Likewise, if , then and to minimise , must be large and that implies that must be large which in turn means that must be small and consequent upon the constraint of the sigmoid function, (because the sigmoid function ensures that ).
Therefore, minimising corresponds to getting which is close to (a prediction which is close to the actual value) as much as possible.
The Logistic Regression Cost Function
The loss function defined above measures the error between the prediction and the actual value ( and respectively) on a single training example . To assess the parameters on the entire training data, we need to define the cost function which averages the loss function over all the entire training examples. Consequently, the cost function given and is defined as So therefore, to get the values of and which guarantees that is as close to as possible for all , we need to minimize . That is we want to find and which minimizes !