Introduction to neural networks for ML

I heard about artificial neural networks first time around 2017. Since then I have tried to understand their behavior and explain them in a simple way.

Neural networks compared to traditional ML techniques

Neural networks are very adaptible systems. In practice this means that they can learn complex patterns without explicit definition. It has been said, that feature engineering can be totally skipped.

Many neural networks architectures can be trained efficiently to make predictions for unseen scenarios.

Both of these arguments apply to common definition of machine learning. Neural networks just take these aspects further. As a downside neural network internal behavior is not as easy to explain than the traditional ML models.

Data for neural network example

Example case: Predict loan risk by applicant age. The training data could be something like this. Neural networks work better with normalized values, so the age is scaled between 0 and 1. One row is one person.

age_years	age_scaled	risk
20	0.0	0.8
40	0.5	0.1
60	1.0	0.3
…	…	…

The aim is to predict credit risk by the age_scaled.

Neural network with 1 input and 1 output

The simplest possible neural network architecture looks like this:

O - O

It has input node O, one edge - and output node O.

Each observation in the training set runs through this “network”.

The first observation in the loan data had age of 0.0. That is the input. Simple math is applied at the output node to get a prediction for the 20-year old:

ActivationFunction(0.0*[Weight]+[Bias]) = [Predicted Risk]

This simple formula is the foundational for neural networks. But the activation function, weight and bias require a bit more explanation.

Neural network with multiple inputs and 1 output

Let’s take a step further. This could be a neural network that predicts loan risk by applicant age, income and shoe size. Three inputs and one output. The smart guys would call it as multivariate regression. Or just linear regression with three variables…

O \
O - O
O /

Math is not yet too complex after the first iteration:

[Predicted Risk] = ActivationFunction(20*[Weight 1]+[Bias]+20*[Weight 2]+20*[Weight 3]+[Bias])

Weight and bias in neural networks

Each edge in neural network has a weight. They are initialized randomly.

And each node has bias. It is a node specific constant value added to the calculation.

The generic formula for a node values is:

ActivationFunction([Input 1]*[Weight 1]+[Input 2]*[Weight 2]+[Input n]*[Weight n]+[Bias])

Activation function

Activation function sounds scary, but the math is rudimentary.

Simplest activation function might be a linear one. It keeps the value the same than it was.

Maybe more common one is ReLU. Despite of the weird name it just converts negative values to zero. In the end, it does not make sense to mix positive and negative value to plus calculation that are in the core of the neural network. Large positive and large negative would become close to zero.

Here is a summary table of common activation functions:

Neural network activation function	How it works	When to use
Linear	No conversion
ReLU	Convert negative to zero, otherwise value remains.	Deep networks.
Leaky ReLU	Negative values are just slightly negative.	To avoid vanishing gradients
Sigmoid	Probabilities [0, 1] for two-class classification. The name referes to S-shape by definition.	Output layer. Logistic regression.
Softmax	Probabilities [0, 1] for multi-class classification	Output layer.
TanH	Like sigmoid but can result values [-1, 1].	RNN and NLP.

Sigmoid and TanH should no be used with deep networks due to vanishing gradients.

Neural network loss

Each observation runs through the network. Whatever weights the edges might have are applied to in the activation functions. Well, in the simple example there was only one activation that happened in the output node.

Loss is the prediction error in the output.

Let’s say that the we used a linear activation function (no conversion) and the value in output node was 0.32. This is our prediction. The actual value was 0.8. If our loss metric is mean absolute error the total loss would be simply:

0.8 - 0.32 = 0.48

After running each iteration the loss metric is calculated again. By plotting these values we can create a loss curve to visuzlize progress. Loss curve should decrease exponentially.

Here are some common neural network loss functions:

Neural network loss function	Explanation
Mean Average Error	Simple.
Means Squared Error	Emphasizes big errors.

Updating weights by backpropagation and gradient descent

A neural network is just a bunch of simple mathematical calculations. Having enough of them makes the model complex.

Gradient descent is the algorithm that calculates the new edge weights after each observation has ran through the network.

As you see, this situation could be identical to a univariate linear or logistic regression depending on the activation function.

There are two fundamental parameters for gradient descent: Step size and step direction. The step size is known as learning rate of the model.

Here are the trade-offs for learning rate:

Gradient descent step size	Problems
Too small	Training takes long time. Might get stuck to local optima.
Too big	Loss curve starts oscillating and does not converge.

Neural network otimizers

Optimizer are variants of gradient descent algorithm. The aim for optimzing the step size or direction to train optimal model as quickly as possible.

Neural network optimizer	Full name	When to use
SGD	Stochastic Gradient Descent	Common default.
Adam	Adaptive Moment Estimation	Large dataset, lots of parameters.
Adamgrad	Adaptive Gradient Optimizer	Adjust step size per feature.

Neural network mini-batches

Running gradient descent in mini batches reduces the loss graph variation and makes it more readable. This means that loss calculation and weight update is done only after multiple observations.

Some sources recommend 10-1000 observations per mini-batch, some 40-100. If there are too many, the iteration does not fit into computer memory. Also, you the program crashing after long period of time would more costly.

Mini-batches also allow program to run similar operation in parallel which a huge performance boost.

Larger batch sizes require smaller learning rates.

Epochs in neural network training

Epoch means running the whole dataset through the neural network. A typical training consists of multiple epochs.

Virtual epochs can be used for large and varying amounts of data. With this approach, epoch can be any size, not necessarily the whole dataset.

Monitoring training issues

Common problematic behavior that can be detected from loss curve:

Loss curve behavior	Explanation
Oscillating	Learning rate too big
Converges slowly	Learning rate too small
Diverges (goes up)	Overfitting, try simpler model.
Not converging	Underfitting, try more complex model.
Increases sharply	Anomalous values that cause NaN traps or exploding gradients.

Number of layers in neural network

More hidden layers can learn more complex non-linear patterns.

A layer can be thought as a step in feature engineering. The model is doing some of that work automatically!

Neural network with multiple inputs and multiple outputs

In this example input features are risk applicant’s age, income and shoe size. On top of the loan risk level the model predicts also loan repayment time and customer lifetime value.

Multiple multivariate regressions at once. This would correspond having three linear regressions that are somewhat tied together.

This graph tries to present that connections go from each input to each output. 9 edges in total.

O <- O
O <- O
O <- O

Neural network with multiple inputs, multiple outputs and a hidden layer

Inputs and output could again be the same as above. But we also have a hidden layer with 3 nodes. Now 18 edges in total.

The hidden layer is not controlled. The nodes do not have explanation that directly connects to real world. It is just a vector of three items.

But you can think that the middle step would describe the applicant by three attributes: Population of the city they live in, reliability of their workplace and person’s cost awareness.

Or maybe this would correspond running principal component analysis before training. In PCA, synthetic features are created to reduce dimensionality.

O <- O <- O
O <- O <- O
O <- O <- O

A neural network with multiple hidden layers can be called deep. Deep neural networks are performing well with highly correlated features compared to linear models.

Linear vs non-linear network

According to discussion, multiple linear layers in neural network will not make it non-linear.

To achieve model non-linearity, non-linear activation functions such as ReLU, sigmoid or tanH must be used.

Google materials state that non-linearity speeds up training and improves accuracy without loss of important information.

Neural network architectures

Here are some neural network models you might read about in the internet:

Neural network architecture	Use case
ANN	Artificial Neural Network. The umbrella term.
DNN	Deep Neural Network. A network with hidden layers.
GAN	Generative Adversial Networks. Tries to reproduce samples (images).
CNN	Convolutional Neural Network. Image classification. Pixel relationships matters a lot.
RNN	Recurrent Neural Network. A network that has memory of previous step. Suitable for NLP and time series.
LSTM	Long Short-Term Memory. Time series forecasting.
GRU	Gated Recurrent Unit. LSTM with forget unit, fewer parameters and no output.
Resnet	Deep Residual Learning for image recognition to tackle vanishing gradients.
Autoencoder and self-encoder	Create embeddings in a hidden layer by
TabNet	Deep network to with sequential attention using the best features on each step.
AdaNet	Adaptive neural network that tries various sub networks.
DCN	Deep & Cross Network for recommenders.
R-CNN	Regional CNN. Identify multiple objects from image by splitting to 2000 regions.

Introduction to neural networks for ML

Blog series

Neural networks compared to traditional ML techniques

Data for neural network example

Neural network with 1 input and 1 output

Neural network with multiple inputs and 1 output

Weight and bias in neural networks

Activation function

Neural network loss

Updating weights by backpropagation and gradient descent

Neural network otimizers

Neural network mini-batches

Epochs in neural network training

Monitoring training issues

Number of layers in neural network

Neural network with multiple inputs and multiple outputs

Neural network with multiple inputs, multiple outputs and a hidden layer

Linear vs non-linear network

Neural network architectures

Tags of the post

Blog series navigation

You might also like

Participate to discussion

Write a new comment

Introduction to neural networks for ML

Blog series

Neural networks compared to traditional ML techniques

Data for neural network example

Neural network with 1 input and 1 output

Neural network with multiple inputs and 1 output

Weight and bias in neural networks

Activation function

Neural network loss

Updating weights by backpropagation and gradient descent

Neural network otimizers

Neural network mini-batches

Epochs in neural network training

Monitoring training issues

Number of layers in neural network

Neural network with multiple inputs and multiple outputs

Neural network with multiple inputs, multiple outputs and a hidden layer

Linear vs non-linear network

Neural network architectures

Tags of the post

Blog series navigation

You might also like

Neural networks for natural language processing

Neural networks for image recognition

Keras for basic neural networks

Participate to discussion

Write a new comment

Reply to comment