Some notes about image recognition while preparing for Google Cloud MLE certification.

Why Convolutional Neural Networks are great for images

In images pixel relationships matters a lot. That’s why Convolutional Neural Networks (CNN) work well for them.

Basic Deep Neural Networks (DNN) could work in some problems but would not be able to utilize the pixel orientation efficiently.

CNN terminology

Convoloution and max pooling are in the core of any CNN layer.

In CNN the small sliding window is known as a convolution kernel. Some might detect horizontal edges, some bright spots. Convolution reduces the image size slightly due to kernel running by the edges. If the CNN has multiple kernels, one kernel output is known as channel. The number of pixels the kernel moves at the time is known as stride.

The image can be padded by eg zeros to avoid image downsizing by the kernel.

Convolutional layer requires much less weights compared to a dense layer. This is big computational advantage.

Max and average pooling are common techniques to condense information from a range of pixels (the convolution kernel). Pooling layers have no learnable parameters as the operation is a simple lookup.

Well-known image recognition architectures

AlexNet is a known CNN. It started with large kernels but nowadays has around 3x3.

ResNet (Deep Residual Learning for Image Recognition). Very deep network but works well. Avoids vanishing gradient problem by keeping the result “at least the same” than in previous layer. For example ResNet 50 refers to 50 layers.

EfficientNet. Aims to optimize the width, depth, and image resolution dimensions of the network all together. Results smaller and thus more efficient model.

MobileNets are lightweight CNNs for mobile and embedded applications.

Getting started with CNNs

Personally I have tried the handwritten digit recognition with the famous MNIST dataset . It is great task to start with.

Internet is full of blog tutorials to walk you through the code.