The Art of Image Recognition in Computer Vision

The Art of Image Recognition in Computer Vision

Today machine learning has become a driving force behind most of the technological advancements. Image recognition is one of the main advancements in machine learning. It is the analysis of objects within the images using a subset of learning algorithms called Convolutional Neural Network.

Did you know that image recognition is one of the main streams that is decisive in the advancement of self-driving cars? For detecting vehicles and pedestrians, we use image recognition in self-driving car. In the autonomous vehicle it is a two-part process: image classification, and image localization. Image classification is the determination of objects in the image, like a vehicle or a person. And image localization is the assigning of location to the objects.

Under the hood, image recognition is the cause for our discussion. Image recognition is enabled in models using deep neural network. Convolutional Neural Network(CNN) forms the basis of the technique in this aspect. These models are then trained using a set of images labeled using image annotation technique. This help the model to recognize any unfamiliar visual scenarios. The convolutional neural network performs operations on images in order to classify them. But the real question is what is CNN and how it works?

What is CNN?

Convolutional Neural Network (CNN) is a type of deep neural network implemented in a visual analysis application. Any CNN network comprises of multi layer perceptrons. They are the basic unit or building blocks of any deep neural network. The CNN network consist of numerous overlapped multi layer perceptrons for covering the entire visual area. After knowing what a CNN is, its working can also remain curious. Before jumping on to the next section, it’s always recommended to know how a multi layer perceptron works.

How CNN enables image recognition?

Image resolution
Image resolution

A convolutional neural network can have hundreds of layers. Each layer is to learn and detect different features of the image. Each picture has its own features in different aspects. That features may be edges, brightness, etc. They are convolved and the output of each layer is used as an input to the next. CNN can help you identify and classify images, text, sound, and video.

Similar to other neural networks, CNN consists of multiple layers. An input layer, an output layer, and many other additional hidden layers.

Image classification in CNN  will take the input image consisting of different parameters to identify and classify them. Computers detect image as an array of pixels and the values depend upon the image resolution which is in the form of h x w x d (h=height, w=width, d=Dimension).

Convolution Layer:

Convolution layer is the first layer. The purpose of this primary layer is to extract the features embedded in an image using certain filters.

As stated above each picture consists of several pixels. Each of those pixel has some numerical values. They are represented in the form of a matrix as shown below(grey matrix). Here it is represented as a 5 x 5 matrix, where the pixel values are represented as 0’s and 1’s. The possible pixel value of a binary image( black and white)will be the 0’s and 1’s. If it is a grey scale, values of the image may range from 0-255. A 3 x 3 matrix(green matrix) given below is the filter matrix. The image matrix is multiplied with the given filter matrix to give an output matrix.

The process of combining a 5 x 5 image matrix with 3 x 3 filter matrix is called Feature map. In the above example, the computations are done by element-wise multiplication between two matrices. These multiplied matrices add to form a single matrix known as Convolved Feature.

Convolution in Image recognition
Feature Map

Convolution with different filters does different operations. Using different filters we can do different operation eg: edge detection, blur and sharpen by applying filters. Below is an example showing convolution for different operation applying different type of filters.

Filter Matrix and features
Filter matrix for various features.

The size of the Feature Map (Convolved Feature) varies according to the number of convolution step. It depends on three parameters:


Stride is the number of pixels that slide the filter matrix over the input matrix. It is the next parameter in the convolution layer. If the stride is 1 then the filters move one pixel at a time(refer above image). When the stride is two then the filters move two pixels at a time as we slide them around. By striding a larger matrix produce smaller feature maps.


Stride of 2 pixels


Depth of convolution means the number of filters used in the process. When we make feature maps, the convolution for different features is done in different filters. The number of feature maps/filtering is called the depth of future map.


Sometimes filter doesn’t suit for convolution in the input image. In such cases, we can pad the pictures with zeroes known as zero padding. Or else we can drop the part of the image where the filter didn’t fit. In the case if convoluted matrix value is negative we can use nonlinear(ReLU) as an activation function.

Pooling Layer:

Pooling is the layer after convolutional layer. After completing the convolution layer, it is a common practice to pass the output values into the next layer known as pooling. It is applied to the output layer to reduce the size of data and can be done in different ways:

  • Max Pooling:If it is a max matrix, take the largest element from the given future map.
Max Pooling
  • Average Pooling: In this process, the average value of the matrix is calculated.
  • Sum Pooling: It is the process of adding all elements in the matrix.

Fully connected Layer:

Fully connected layer is the layer after pooling layer. The main purpose of this layer is to add non-linearity to our data. Here we flattened all the data and connected to form like a normal neural network. While we take a face as an example, it has to recognize different parts. For example mouth, eyes, nose, etc. But they have no idea about the accurate position where it has to be placed. Fully connected layer combines all the layers together and according to the features it locates the correct place where it belongs. This can give it a prediction power and will help to predict like a human. To classify output, we have classification functions like sigmoid or softmax.

Artificial Neural Networks is the computational models incited from human brain. Using Artificial neural network there are many recent advancements like Voice Recognition, Image Recognition, Robotics. Inappropriately, object recognition is a key feature of image classification, and the commercial implications of this are vast. Yeah…CNN is a bit confusing topic, these concepts and calculations are quite hard to get in the first go. But while understanding in deep you will be able to produce a satisfying result.

Drisya prakash

Drisya Prakash is the public relation executive at Infolks Pvt. Ltd. A writer by day and reader by night.Passionate about social media marketing. Do content crafting for blogs and social media.

This Post Has One Comment

Leave a Reply

Close Menu