The object Detection technique is used to identify objects present in an image. it is a rapid revolutionary change in the field of computer vision. Its involvement in the combination of object classification as well as object localization makes it one of the most challenging topics in the domain of computer vision. In simple words, the goal of this detection technique is to determine where objects are located in a given image called as object localization and which category each object belongs to, that is called as object classification.
There are so many Techniques are there to detect the objects inside the image.
Some popular techniques are listed below.
YOLO(You Only Look Once)
Faster R-CNN is an object detection algorithm that is similar to R-CNN. This algorithm utilizes the Region Proposal Network (RPN) that shares full-image convolutional features with the detection network in a cost-effective manner than R-CNN and Fast R-CNN. A Region Proposal Network is basically a fully convolutional network that simultaneously predicts the object bounds as well as objectness scores at each position of the object and is trained end-to-end to generate high-quality region proposals, which are then used by Fast R-CNN for detection of objects.
It is a very famous kind of object detection, and very popular among all object detection. It is based on the COCO dataset, which contains 80 classes of images sample, from where we can predict the objects inside the images just by using the pre-trained model. The base YOLO model processes images in real-time at 45 frames per second, while the smaller version of the network, Fast YOLO processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
SSD(Single Shot Detectors):
In this Blog/Tutorial, SSD is explained. Single Shot Detector (SSD) is a method for detecting objects in images using a single deep neural network. The SSD approach discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios. After discretizing, the method scales per feature map location. The Single Shot Detector network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes.
Advantages of SSD:
SSD completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network.
Easy to train and straightforward to integrate into systems that require a detection component.
SSD has competitive accuracy to methods that utilize an additional object proposal step, and it is much faster while providing a unified framework for both training and inference.
The SSD approach is based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a non-maximum suppression step to produce the final detections. The early network layers are based on a standard architecture used for high-quality image classification (truncated before any classification layers), which we will call the base network. We then add auxiliary structure to the network to produce detections with the following key features.
Suppose we have to detect these two classes of the person and the cars.
Key challenge: The number of outputs is unknown, a variable number of outputs.
Provides a fixed number of bounding boxes+classifications.
Classify bounding boxes as an object or not an object.
By only considering objects we can produce variable numbers of boxes+classifications
SSD’s architecture builds on the VGG-16 architecture but it discards the fully connected layers.
The reason VGG-16 was used as the base network is because of its:
strong performance in high-quality image classification tasks
popularity for problems where transfer learning helps in improving results in the case of small datasets also
Instead of the original VGG fully connected layers, a set of auxiliary convolutional layers were added, thus enabling the feature extraction at multiple scales and progressively decrease the size of the input to each subsequent layer.
In this above Vehicle detection Problem, every algorithm has been tested and their performance has been written.
Here SSD has bounded more boxes rather than all other algorithms. and also gives pretty good and accurate results.
The key difference between training SSD and training a typical detector that uses region proposals is that ground truth information needs to be assigned to specific outputs in the fixed set of detector outputs. Some version of this is also required for training in YOLO and for the region proposal stage of Faster R-CNN and MultiBox. Once this assignment is determined, the loss function and backpropagation are applied end-to-end. Training also involves choosing the set of default boxes and scales for detection as well as the hard negative mining and data augmentation strategies. SSD: Single Shot MultiBox Detector 5 Matching strategy During training we need to determine which default boxes correspond to a ground truth detection and train the network accordingly. For each ground truth box, we are selecting from default boxes that vary over the location, aspect ratio, and scale. We begin by matching each ground truth box to the default box with the best Jaccard overlap (as in MultiBox). Unlike MultiBox, we then match default boxes to any ground truth with Jaccard overlap higher than a threshold (0.5). This simplifies the learning problem, allowing the network to predict high scores for multiple overlapping default boxes rather than requiring it to pick only the one with maximum overlap.
SSD is nothing but a higher and better version of YOLO. Single-shot detectors give faster and better performance than YOLO. also, give better accuracy.
While Yolo has a fixed grid cell aspect ratio. SSD uses a different aspect ratio with multi boxes for better accuracy
SSD has additional Conv layers at the end of the base VGG-16 for object detection. the convolutional layer has multiple features with different scale and hence it is able to detect objects in multiple scales better