Component of CNN and common CNN Architectures

5 min readNov 2, 2021

A CNN sequence to classify handwritten digits (source: towarddatascience.com)

I/ Components of CNN

There are four types of layers for a convolutional neural network: the Convolution Layer, the Pooling Layer, the ReLU Correction Layer and the Fully-Connected Layer. In this chapter, I will explain how these different layers work.

Convolutional Layer

The convolution layer is the key component of convolutional neural networks, and is always at least their first layer.

Its goal is to identify the presence of a set of features in the images received as input. To do this, we perform a convolution filtering: the principle is to “drag” a window representing the feature on the image, and to calculate the convolution product between the feature and each portion of the scanned image. A feature is then seen as a filter: the two terms are equivalent in this context.

exemple of a Convolutional Layer with a Kernel 3x3

The convolution layer receives several images as input, and computes the convolution of each of them with each filter. The filters correspond exactly to the features we want to find in the images.

We obtain for each pair (image, filter) an activation map, or feature map, which tells us where the features are located in the image: the higher the value, the more the corresponding place in the image looks like the feature.

the features are not pre-defined according to a particular formalism, but learned by the network during the training phase! The kernels of the filters are the weights of the convolution layer. They are initialized and then updated by backpropagation of the gradient.

This is the strength of convolutional neural networks: they are able to determine by themselves the discriminating elements of an image

Pooling Layer

This type of layer is often placed between two convolution layers: it receives as input several feature maps, and applies to each of them the pooling operation.

The pooling operation consists in reducing the size of the images, while preserving their important characteristics. This means that we get the same number of feature maps in output as in input, but they are much smaller in term of size of the images. The idea is to improve the efficiency of the network and to avoid over-learning. The widely used pooling operations are average pooling and max pooling. For each block, or “pool”, the operation simply involves computing the 𝑚𝑎𝑥 ( or average )value, like this:

ReLU correction layer

ReLU (Rectified Linear Units) refers to the real nonlinear function defined by ReLU(x)=max(0,x).

The ReLU correction layer therefore replaces all negative values received as inputs with zeros. It plays the role of activation function.

Fully-Connect Layer

The fully-connected layer is always the last layer of a neural network, convolutional or not.

The last Fully-Connect layer is used to classify the input image of the network: it returns a vector of size N where N is the number of classes in our image classification problem. Each element of the vector indicates the probability for the input image to belong to a class.

To compute the probabilities, the fully-connected layer multiplies each input element by a weight, sums it up, then applies an activation function (logistic if N=2, softmax if N>2). This treatment amounts to multiplying the input vector by the matrix containing the weights. The fact that each input value is connected with all output values explains the term fully-connected.

The convolutional neural network learns the values of the weights in the same way that it learns the filters of the convolution layer: during the training phase, by backpropagation of the gradient.

II/ Famous Architecture

There are several CNN architectures that have played a key role in building the algorithms that power AI. Some of these are listed and explained below:

LeNet-5 by Yann LeCun (1998)

This architecture has become the standard model: stacking convolutions with an activation function, and pooling layers, and finishing the network with one or more fully connected layers ect.
TheNet-5 is one of the simplest architectures and has about 60,000 parameters. It has two convolutional layers and three fully connected layers (hence the number “5”). The averaging layer, as we know it today, was called the subsampling layer and had trainable weights ,which is not current CNN design practice.

AlexNet-8 (2012)

The network had a very similar architecture as LeNet by Yann LeCun et al but was deeper, with 60M parameters, AlexNet has 8 layers ,5 convolutional and 3 fully-connected. They were the first to implement Rectified Linear Units (ReLUs) as activation functions after every convolutional and fully-connected layer. They also introduce dropout layers, used to avoid the neural network overfitting.

VGGNet -16 (2014)

The building components of VGGNet are the same as LeNet and AlexNet. Its differe on an even deeper network with more convolutional, pooling, and dense layers. Other than that, there are no new components that are introduced here. VGGNet, also known as VGG-16, consists of 16 weight layers; 13 convolutional layers + 3 fully-connected layers. Its uniform architecture makes it very appealing in the deep learning community because it is very easy to understand. However, VGGNet consists of 138 million parameters, which is a bit challenging to handle.

ResNet-26 (2015)

ResNet is one of the early adopters of batch normalisation, with 26M parameters. It popularised skip connections they weren’t the first to use skip connections. This architecture design even deeper CNNs (up to 152 layers) without compromising model’s generalisation power. It’s among the first architecture to use batch normalisation.