A comprehensive guide to CNNs: architecture, forward pass, backpropagation, and practical implementations
Convolutional Neural Networks (CNNs) represent a revolutionary approach in deep learning that has transformed computer vision and image processing. Inspired by the organization of the animal visual cortex, CNNs have become the cornerstone of many modern applications including image recognition, object detection, facial recognition, and medical image analysis.
The power of CNNs lies in their ability to automatically learn hierarchical patterns in data, starting from simple features like edges and textures to more complex structures like shapes and objects. Unlike traditional neural networks, CNNs maintain spatial relationships in data, making them particularly effective for processing grid-like structures such as images.
Convolutional Neural Networks (CNNs) are a class of deep learning models specifically designed for processing grid-like data, such as images. At their core, CNNs are stacks of convolutional operations interspersed with non-linear activation functions and pooling layers. These layers work together to extract hierarchical features from the input data, enabling the network to learn spatial patterns efficiently.
Before we explain the key components of a CNN, I believe it's really important that we understand some core concepts of CNNs which are often overlooked in courses, and other blogs.
Key Concepts
Instead of learning a unique weight for each input pixel(as in an MLP), CNNs reuse the same set of weights (a kernel or filter) across the entire image. This means that a small kernel slides over the image and applies the same weights at each position.
The same feature detector(kernel) is applied everywhere, meaning the same set of parameters detects the same feature across the image.
We will have a better understanding of these operations in the next section when we will talk about Convolutions
The convolution operation is the backbone of CNNs. It involves sliding a small filter (kernel) over the input image and computing the dot product between the filter and the input region.
Let's see how this works with an example!
The output matrix is obtained by sliding the kernel over the input one pixel at a time (stride = 1) and computing the dot product between the filter and the input region corresponding. Let's see how this is done.
In the first step the input region is [[1, 6], [5, 3]] and the kernel is [[1, 2], [-1, 0]], the result when we compute the dot product is: 1 * 1 + 6 * 2 + 5 * (-1) + 3 * 0 = 8 . This process repeats itself for each region of the input image.Key Parameters:
After the convolution operation, a non-linear activation function is applied to introduce non-linearity into the model. The most common activation function is ReLU (Rectified Linear Unit) Non Linearities create non linear decision boundaries, this ensure the output cannot be written as a linear combination of the inputs. If non linear activation was absent, deep CNN architectures would be equivalent to a single conv layer, which would not perform as well. I have also learned when i was reading AlexNet paper that ReLU is also faster than other non-linearities.
Pooling layers reduce the spatial dimensions of the feature maps, making the network more computationally efficient and invariant to small translations. They operate on small regions of the feature map (e.g., 2x2 or 3x3 windows) and apply a pooling operation to summarize the information in that region.
Types of Pooling:
Here the pooling region is 2x2 we apply the operation each time with a stride of 2. For the Max Pooling we take the maximum of the pooling region, for average we take the average.
After several convolutional and pooling layers, the output is flattened and passed through one or more fully connected layers. These layers combine the extracted features to produce the final output (e.g., class probabilities, continuous value).
Let’s walk through an example to understand how the dimensions of an input image change as it passes through a CNN. Consider AlexNet, a classic CNN architecture:
AlexNet Architecture:
Now let's train a small CNN by hand to make sure we understand everything properly
So far, we have talked about the theory of CNN, we have trained one by hand, the next step is to code one. For that we will use Pytorch. The Architecture that we will implement is AlexNet (already described in the previous sections).
We will simply stop at the architecture level. To see all the code required for training you can check out my my repo
from torch import nn
class AlexNet(nn.Module):
def __init__(self):
super().__init__()
self.feature_extraction = nn.Sequential(
nn.Conv2d(3, 96, 11, stride = 4),
nn.ReLU(),
nn.MaxPool2d(3, stride = 2),
nn.Conv2d(96, 256, 5, stride = 1, padding = 2),
nn.ReLU(),
nn.MaxPool2d(3, stride = 2),
nn.Conv2d(256, 384, 3, stride = 1, padding = 1),
nn.ReLU(),
nn.Conv2d(384, 384, 3, stride = 1, padding = 1),
nn.ReLU(),
nn.Conv2d(384, 256, 3, stride = 1, padding = 1),
nn.ReLU(),
nn.MaxPool2d(3, stride = 2)
)
self.flatten = nn.Flatten()
self.fc1 = nn.Linear(9216, 4096)
self.dropout = nn.Dropout(p = 0.5)
self.fc2 = nn.Linear(4096, 200)
def forward(self, x):
x = self.feature_extraction(x)
x = self.flatten(x)
x = self.fc1(x)
x = nn.ReLU()(x)
x = self.dropout(x)
x = self.fc2(x)
return x
I won’t explain the code in detail because I believe it’s pretty straightforward. Each element is named according to its purpose, making it easy to understand.
That's all from me! I hope you enjoyed the blog and learned something new. If you have any questions, feel free to reach out to me on Twitter / X