{"id":1613,"date":"2019-04-03T02:16:54","date_gmt":"2019-04-03T02:16:54","guid":{"rendered":"http:\/\/kusuaks7\/?p=1218"},"modified":"2023-06-29T09:56:08","modified_gmt":"2023-06-29T09:56:08","slug":"deep-learning-for-object-detection-a-comprehensive-review","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/deep-learning-for-object-detection-a-comprehensive-review\/","title":{"rendered":"Deep Learning for Object Detection: A Comprehensive Review"},"content":{"rendered":"<p id=\"93d8\">With the rise of autonomous vehicles, smart video surveillance, facial detection and various people counting applications, fast and accurate object detection systems are rising in demand. These systems involve not only recognizing and classifying every object in an image but\u00a0<em>localizing<\/em>\u00a0each one by drawing the appropriate bounding box around it. This makes object detection a significantly harder task than its traditional computer vision predecessor, image classification.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" style=\"width: 700px; height: 355px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1200\/1*ftTEVgsx0jfvUSFB6X5mQg.jpeg\" \/><\/p>\n<p id=\"4e21\">Fortunately, however, the most successful approaches to object detection are currently extensions of image classification models. A few months ago, Google released a\u00a0<a href=\"https:\/\/research.googleblog.com\/2017\/06\/supercharge-your-computer-vision-models.html\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/research.googleblog.com\/2017\/06\/supercharge-your-computer-vision-models.html\" data->new object detection API<\/a>\u00a0for Tensorflow. With this release came the pre-built architectures and weights for a\u00a0few specific models:<\/p>\n<ul>\n<li id=\"1664\"><a href=\"https:\/\/arxiv.org\/abs\/1512.02325\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/arxiv.org\/abs\/1512.02325\" data->Single Shot Multibox Detector<\/a>\u00a0(SSD) with\u00a0<a href=\"http:\/\/research.googleblog.com\/2017\/06\/mobilenets-open-source-models-for.html\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/research.googleblog.com\/2017\/06\/mobilenets-open-source-models-for.html\" data->MobileNets<\/a><\/li>\n<li id=\"f41f\">SSD with\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1512.00567\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/arxiv.org\/abs\/1512.00567\" data->Inception V2<\/a><\/li>\n<li id=\"97dc\"><a href=\"https:\/\/arxiv.org\/abs\/1605.06409\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/arxiv.org\/abs\/1605.06409\" data->Region-Based Fully Convolutional Networks<\/a>\u00a0(R-FCN) with\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1512.03385\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/arxiv.org\/abs\/1512.03385\" data->Resnet 101<\/a><\/li>\n<li id=\"4751\"><a href=\"https:\/\/arxiv.org\/abs\/1506.01497\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/arxiv.org\/abs\/1506.01497\" data->Faster RCNN<\/a>\u00a0with Resnet 101<\/li>\n<li id=\"243f\"><a href=\"https:\/\/arxiv.org\/abs\/1506.01497\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/arxiv.org\/abs\/1506.01497\" data->Faster RCNN<\/a>\u00a0with\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1602.07261\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/arxiv.org\/abs\/1602.07261\" data->Inception Resnet v2<\/a><\/li>\n<\/ul>\n<p id=\"6842\">In my\u00a0<a href=\"https:\/\/medium.com\/towards-data-science\/an-intuitive-guide-to-deep-network-architectures-65fdc477db41\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/medium.com\/towards-data-science\/an-intuitive-guide-to-deep-network-architectures-65fdc477db41\" data->last blog post<\/a>, I covered the intuition behind the three base network architectures listed above: MobileNets, Inception, and ResNet. This time around, I want to do the same for Tensorflow\u2019s object detection models: Faster R-CNN, R-FCN, and SSD. By the end of this post, we will hopefully have gained an understanding of how deep learning is applied to object detection, and how these object detection models both inspire and diverge from one another.<\/p>\n<h3 id=\"59c3\">Faster R-CNN<\/h3>\n<p id=\"02ac\">Faster R-CNN is now a canonical model for deep learning-based object detection. It helped inspire many detection and segmentation models that came after it, including the two others we\u2019re going to examine today. Unfortunately, we can\u2019t really begin to understand Faster R-CNN without understanding its own predecessors, R-CNN and Fast R-CNN, so let\u2019s take a quick dive into its ancestry.<\/p>\n<h4 id=\"ec2d\">R-CNN<\/h4>\n<p id=\"9cac\">R-CNN is the grand-daddy of Faster R-CNN. In other words, R-CNN\u00a0<em>really<\/em>kicked things off.<\/p>\n<p id=\"1cfe\">R-CNN, or\u00a0<strong>R<\/strong>egion-based\u00a0<strong>C<\/strong>onvolutional\u00a0<strong>N<\/strong>eural\u00a0<strong>N<\/strong>etwork, consisted of 3 simple steps:<\/p>\n<ol>\n<li id=\"c99c\">Scan the input image for possible objects using an algorithm called Selective Search, generating ~2000\u00a0<strong>region proposals<\/strong><\/li>\n<li id=\"4d31\">Run a convolutional neural net (<strong>CNN<\/strong>) on top of each of these region proposals<\/li>\n<li id=\"2834\">Take the output of each\u00a0<strong>CNN<\/strong>\u00a0and feed it into a) an SVM to classify the region and b) a linear regressor to tighten the bounding box of the object, if such an object exists.<\/li>\n<\/ol>\n<p id=\"b181\">These 3 steps are illustrated in the image below:<\/p>\n<figure id=\"1d0d\"><canvas width=\"75\" height=\"47\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 445px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*D2sFqL329qKKx4Tvl31IhQ.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*D2sFqL329qKKx4Tvl31IhQ.png\" \/><\/figure>\n<p id=\"7072\">In other words, we first propose regions, then extract features, and then classify those regions based on their features. In essence, we have turned object detection into an image classification problem. R-CNN was very intuitive, but very slow.<\/p>\n<h4 id=\"c083\">Fast R-CNN<\/h4>\n<p id=\"6b5b\">R-CNN\u2019s immediate descendant was Fast-R-CNN. Fast R-CNN resembled the original in many ways, but improved on its detection speed through two main augmentations:<\/p>\n<ol>\n<li id=\"2426\">Performing feature extraction over the image\u00a0<strong>before<\/strong>\u00a0proposing regions, thus only running one CNN over the entire image instead of 2000 CNN\u2019s over 2000 overlapping regions<\/li>\n<li id=\"b7cd\">Replacing the SVM with a softmax layer, thus extending the neural network for predictions instead of creating a new model<\/li>\n<\/ol>\n<p id=\"708d\">The new model looked something like this:<\/p>\n<figure id=\"1453\"><canvas width=\"75\" height=\"47\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 446px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*iWyUwIPO-5kA2ECAfaaPSg.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*iWyUwIPO-5kA2ECAfaaPSg.png\" \/><\/figure>\n<p id=\"8f77\">As we can see from the image, we are now generating region proposals based on the last feature map of the network, not from the original image itself. As a result, we can train just\u00a0<strong>one<\/strong>\u00a0CNN for the entire image.<\/p>\n<p id=\"8958\">In addition, instead of training many different SVM\u2019s to classify each object class, there is a single softmax layer that outputs the class probabilities directly. Now we only have one neural net to train, as opposed to one neural net and many SVM\u2019s.<\/p>\n<p id=\"4167\">Fast R-CNN performed much better in terms of speed. There was just one big bottleneck remaining: the selective search algorithm for generating region proposals.<\/p>\n<h4 id=\"afa8\">Faster R-CNN<\/h4>\n<p id=\"f50f\">At this point, we\u2019re back to our original target: Faster R-CNN. The main insight of Faster R-CNN was to replace the slow selective search algorithm with a fast neural net. Specifically, it introduced the\u00a0<strong>region proposal network<\/strong>\u00a0(RPN).<\/p>\n<p id=\"5ec8\">Here\u2019s how the RPN worked:<\/p>\n<ul>\n<li id=\"249a\">At the last layer of an initial CNN, a 3&#215;3 sliding window moves across the feature map and maps it to a lower dimension (e.g. 256-d)<\/li>\n<li id=\"34c8\">For each sliding-window location, it generates\u00a0<em>multiple<\/em>\u00a0possible regions based on\u00a0<em>k<\/em>\u00a0fixed-ratio\u00a0<strong>anchor boxes\u00a0<\/strong>(default bounding boxes)<\/li>\n<li id=\"e041\">Each region proposal consists of a) an \u201cobjectness\u201d score for that region and b) 4 coordinates representing the bounding box of the region<\/li>\n<\/ul>\n<p id=\"c66c\">In other words, we look at each location in our last feature map and consider\u00a0<em>k<\/em>\u00a0different boxes centered around it: a tall box, a wide box, a large box, etc. For each of those boxes, we output whether or not we think it contains an object, and what the coordinates for that box are. This is what it looks like at one sliding window location:<\/p>\n<figure id=\"13af\"><canvas width=\"75\" height=\"42\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 413px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*7heX-no7cdqllky-GwGBfQ.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*7heX-no7cdqllky-GwGBfQ.png\" \/><\/figure>\n<p id=\"3b78\">The 2<em>k<\/em>\u00a0scores represent the softmax probability of each of the\u00a0<em>k<\/em>\u00a0bounding boxes being on \u201cobject.\u201d Notice that although the RPN outputs bounding box coordinates, it does not try to classify any potential objects: its sole job is still proposing object regions. If an anchor box has an \u201cobjectness\u201d score above a certain threshold, that box\u2019s coordinates get passed forward as a region proposal.<\/p>\n<p id=\"ae31\">Once we have our region proposals, we feed them straight into what is essentially a Fast R-CNN. We add a pooling layer, some fully-connected layers, and finally a softmax classification layer and bounding box regressor. In a sense,\u00a0<strong>Faster R-CNN = RPN + Fast R-CNN.<\/strong><\/p>\n<figure id=\"c2d9\"><canvas width=\"75\" height=\"65\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 628px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*LHk_CCzzfP9mzw280kG70w.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*LHk_CCzzfP9mzw280kG70w.png\" \/><\/figure>\n<p id=\"4dc4\">Altogether, Faster R-CNN achieved much better speeds and a state-of-the-art accuracy. It is worth noting that although future models did a lot to increase detection speeds, few models managed to outperform Faster R-CNN by a significant margin. In other words, Faster R-CNN may not be the simplest or fastest method for object detection, but it is still one of the best performing. Case in point, Tensorflow\u2019s Faster R-CNN with Inception ResNet is their\u00a0<a href=\"https:\/\/github.com\/tensorflow\/models\/blob\/master\/object_detection\/g3doc\/detection_model_zoo.md\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/github.com\/tensorflow\/models\/blob\/master\/object_detection\/g3doc\/detection_model_zoo.md\" data->slowest but most accurate model<\/a>.<\/p>\n<p id=\"85e4\">At the end of the day, Faster R-CNN may look complicated, but its core design is the same as the original R-CNN:\u00a0<strong>hypothesize object regions and then classify them<\/strong>. This is now the predominant pipeline for many object detection models, including our next one.<\/p>\n<h3 id=\"4876\">R-FCN<\/h3>\n<p id=\"976d\">Remember how Fast R-CNN improved on the original\u2019s detection speed by sharing a single CNN computation across all region proposals? That kind of thinking was also the motivation behind R-FCN:\u00a0<em>increase speed by maximizing shared computation.<\/em><\/p>\n<p id=\"51a5\">R-FCN, or\u00a0<strong>R<\/strong>egion-based\u00a0<strong>F<\/strong>ully\u00a0<strong>C<\/strong>onvolutional\u00a0<strong>N<\/strong>et, shares 100% of the computations across every single output. Being\u00a0<a href=\"https:\/\/leonardoaraujosantos.gitbooks.io\/artificial-inteligence\/content\/image_segmentation.html\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/leonardoaraujosantos.gitbooks.io\/artificial-inteligence\/content\/image_segmentation.html\" data->fully convolutional<\/a>, it ran into a unique problem in model design.<\/p>\n<p id=\"c742\">On the one hand, when performing classification of an object, we want to learn\u00a0<em>location invariance<\/em>\u00a0in a model: regardless of where the cat appears in the image, we want to classify it as a cat. On the other hand, when performing detection of the object, we want to learn\u00a0<em>location variance<\/em>: if the cat is in the top left-hand corner, we want to draw a box in the top left-hand corner. So if we\u2019re trying to share convolutional computations across 100% of the net, how do we compromise between location invariance and location variance?<\/p>\n<p id=\"3521\">R-FCN\u2019s solution:\u00a0<strong>position-sensitive score maps<\/strong>.<\/p>\n<p id=\"dd73\">Each position-sensitive score map represents\u00a0<em>one relative position<\/em>\u00a0of\u00a0<em>one object class<\/em>. For example, one score map might activate wherever it detects the\u00a0<em>top-right<\/em>\u00a0of a\u00a0<em>cat<\/em>. Another score map might activate where it sees the\u00a0<em>bottom-left<\/em>of a\u00a0<em>car<\/em>. You get the point. Essentially, these score maps are\u00a0<strong>convolutional feature maps that have been trained to recognize certain parts of each object<\/strong>.<\/p>\n<p id=\"11aa\">Now, R-FCN works as follows:<\/p>\n<ol>\n<li id=\"f97c\">Run a CNN (in this case, ResNet) over the input image<\/li>\n<li id=\"e07f\">Add a fully convolutional layer to generate a\u00a0<strong>score bank<\/strong>\u00a0of the aforementioned \u201cposition-sensitive score maps.\u201d There should be k\u00b2(C+1) score maps, with k\u00b2 representing the number of relative positions to divide an object (e.g. 3\u00b2 for a 3 by 3 grid) and C+1 representing the number of classes plus the background.<\/li>\n<li id=\"07ee\">Run a fully convolutional region proposal network (RPN) to generate regions of interest (RoI\u2019s)<\/li>\n<li id=\"af34\">For each RoI, divide it into the same k\u00b2 \u201cbins\u201d or subregions as the score maps<\/li>\n<li id=\"5216\">For each bin, check the score bank to see if that bin matches the corresponding position of some object. For example, if I\u2019m on the \u201cupper-left\u201d bin, I will grab the score maps that correspond to the \u201cupper-left\u201d corner of an object and average those values in the RoI region. This process is repeated for each class.<\/li>\n<li id=\"3891\">Once each of the k\u00b2 bins has an \u201cobject match\u201d value for each class, average the bins to get a single score per class.<\/li>\n<li id=\"c4e4\">Classify the RoI with a softmax over the remaining C+1 dimensional vector<\/li>\n<\/ol>\n<p id=\"aed3\">Altogether, R-FCN looks something like this, with an RPN generating the RoI\u2019s:<\/p>\n<figure id=\"628b\"><canvas width=\"75\" height=\"32\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 309px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*cHEvY3E2HW65AF-mPeMwOg.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*cHEvY3E2HW65AF-mPeMwOg.png\" \/><\/figure>\n<p id=\"2a28\">Even with the explanation and the image, you might still be a little confused on how this model works. Honestly, R-FCN is much easier to understand when you can visualize what it\u2019s doing. Here is one such example of an R-FCN in practice, detecting a baby:<\/p>\n<figure id=\"8a91\"><canvas width=\"75\" height=\"72\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 690px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*Q20DdanzQbvBjg4DLvJkGg.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*Q20DdanzQbvBjg4DLvJkGg.png\" \/><\/figure>\n<p id=\"a380\">Simply put, R-FCN considers each region proposal, divides it up into sub-regions, and iterates over the sub-regions asking: \u201cdoes this look like the top-left of a baby?\u201d, \u201cdoes this look like the top-center of a baby?\u201d \u201cdoes this look like the top-right of a baby?\u201d, etc. It repeats this for all possible classes. If enough of the sub-regions say \u201cyes, I match up with that part of a baby!\u201d, the RoI gets classified as a baby after a softmax over all the classes.<\/p>\n<p id=\"e10a\">With this setup, R-FCN is able to simultaneously address\u00a0<em>location variance\u00a0<\/em>by proposing different object regions, and\u00a0<em>location invariance<\/em>\u00a0by having each region proposal refer back to the same bank of score maps. These score maps should learn to classify a cat as a cat, regardless of where the cat appears. Best of all, it is fully convolutional, meaning all of the computation is shared throughout the network.<\/p>\n<p id=\"1174\">As a result, R-FCN is several times faster than Faster R-CNN, and achieves comparable accuracy.<\/p>\n<h3 id=\"0576\">SSD<\/h3>\n<p id=\"94e4\">Our final model is SSD, which stands for\u00a0<strong>S<\/strong>ingle-<strong>S<\/strong>hot\u00a0<strong>D<\/strong>etector. Like R-FCN, it provides enormous speed gains over Faster R-CNN, but does so in a markedly different manner.<\/p>\n<p id=\"34a7\">Our first two models performed region proposals and region classifications in two separate steps. First, they used a region proposal network to generate regions of interest; next, they used either fully-connected layers or position-sensitive convolutional layers to classify those regions. SSD does the two in a \u201csingle shot,\u201d simultaneously predicting the bounding box and the class as it processes the image.<\/p>\n<p id=\"b24c\">Concretely, given an input image and a set of ground truth labels, SSD does the following:<\/p>\n<ol>\n<li id=\"7fd2\">Pass the image through a series of convolutional layers, yielding several sets of feature maps at different scales (e.g. 10&#215;10, then 6&#215;6, then 3&#215;3, etc.)<\/li>\n<li id=\"e375\">For each location in\u00a0<em>each<\/em>\u00a0of these feature maps, use a 3&#215;3 convolutional filter to evaluate a small set of default bounding boxes. These default bounding boxes are essentially equivalent to Faster R-CNN\u2019s anchor boxes.<\/li>\n<li id=\"48d2\">For each box, simultaneously predict a) the bounding box offset and b) the class probabilities<\/li>\n<li id=\"a77c\">During training, match the ground truth box with these predicted boxes based on\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Jaccard_index\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/en.wikipedia.org\/wiki\/Jaccard_index\" data->IoU<\/a>. The best predicted box will be labeled a \u201cpositive,\u201d along with all other boxes that have an IoU with the truth &gt;0.5.<\/li>\n<\/ol>\n<p id=\"d9b5\">SSD sounds straightforward, but training it has a unique challenge. With the previous two models, the region proposal network ensured that everything we tried to classify had some minimum probability of being an \u201cobject.\u201d With SSD, however, we skip that filtering step. We classify and draw bounding boxes from\u00a0<em>every single position in the image<\/em>, using\u00a0<em>multiple different shapes<\/em>, at\u00a0<em>several different scales<\/em>. As a result, we generate a much greater number of bounding boxes than the other models, and nearly all of the them are negative examples.<\/p>\n<p id=\"fab5\">To fix this imbalance, SSD does two things. Firstly, it uses\u00a0<a href=\"https:\/\/docs.microsoft.com\/en-us\/cognitive-toolkit\/Object-Detection-using-Fast-R-CNN#algorithm-details\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/docs.microsoft.com\/en-us\/cognitive-toolkit\/Object-Detection-using-Fast-R-CNN#algorithm-details\" data->non-maximum suppression<\/a>\u00a0to group together highly-overlapping boxes into a single box. In other words, if four boxes of similar shapes, sizes, etc. contain the same dog, NMS would keep the one with the highest confidence and discard the rest. Secondly, the model uses a technique called\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1608.02236.pdf\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/arxiv.org\/pdf\/1608.02236.pdf\" data->hard negative mining<\/a>\u00a0to balance classes during training. In hard negative mining, only a subset of the negative examples with the highest training loss (i.e. false positives) are used at each iteration of training. SSD keeps a 3:1 ratio of negatives to positives.<\/p>\n<p id=\"a39d\">Its architecture looks like this:<\/p>\n<figure id=\"4a13\" data-scroll=\"native\"><canvas width=\"75\" height=\"20\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 202px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1200\/1*p-lSawysBsiBzlcWZ9_UMw.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1200\/1*p-lSawysBsiBzlcWZ9_UMw.png\" \/><\/figure>\n<p id=\"a16a\">As I mentioned above, there are \u201cextra feature layers\u201d at the end that scale down in size. These varying-size feature maps help capture objects of different sizes. For example, here is SSD in action:<\/p>\n<figure id=\"17e8\"><canvas width=\"75\" height=\"27\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 262px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*JuhjYUWXgfxMMoa4SIKLkA.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*JuhjYUWXgfxMMoa4SIKLkA.png\" \/><\/figure>\n<p id=\"e605\">In smaller feature maps (e.g. 4&#215;4), each cell covers a larger region of the image, enabling them to detect larger objects. Region proposal and classification are performed simultaneously: given\u00a0<em>p<\/em>\u00a0object classes, each bounding box is associated with a (4+<em>p<\/em>)-dimensional vector that outputs 4 box offset coordinates and\u00a0<em>p<\/em>\u00a0class probabilities. In the last step, softmax is again used to classify the object.<\/p>\n<p id=\"4bb5\">Ultimately, SSD is not so different from the first two models. It simply skips the \u201cregion proposal\u201d step, instead considering every single bounding box in every location of the image simultaneously with its classification. Because SSD does everything in one shot, it is the\u00a0<a href=\"https:\/\/github.com\/tensorflow\/models\/blob\/master\/object_detection\/g3doc\/detection_model_zoo.md\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/github.com\/tensorflow\/models\/blob\/master\/object_detection\/g3doc\/detection_model_zoo.md\" data->fastest of the three models<\/a>, and still performs quite comparably.<\/p>\n<h3 id=\"b85b\">Conclusion<\/h3>\n<p id=\"ac21\">Faster R-CNN, R-FCN, and SSD are three of the best and most widely used object detection models out there right now. Other popular models tend to be fairly similar to these three, all relying on deep CNN\u2019s (read: ResNet, Inception, etc.) to do the initial heavy lifting and largely following the same proposal\/classification pipeline.<\/p>\n<p id=\"8dec\">At this point, putting these models to use just requires knowing Tensorflow\u2019s API. Tensorflow has a starter tutorial on using these models\u00a0here. Give it a try!<\/p>\n<p>Original appeared at <a href=\"https:\/\/towardsdatascience.com\/deep-learning-for-object-detection-a-comprehensive-review-73930816d8d9\" rel=\"noopener\">Towards Data Science<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Faster R-CNN, R-FCN, and SSD are three of the best and most widely used object detection models out there right now. Other popular models tend to be fairly similar to these three, all relying on deep CNN&rsquo;s (read: ResNet, Inception, etc.) to do the initial heavy lifting and largely following the same proposal\/classification pipeline.&nbsp; At this point, putting these models to use just requires knowing Tensorflow&rsquo;s API. Tensorflow has a starter tutorial on using these models. Give it a try!<\/p>\n","protected":false},"author":523,"featured_media":4276,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[92],"ppma_author":[3167],"class_list":["post-1613","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-machine-learning"],"authors":[{"term_id":3167,"user_id":523,"is_guest":0,"slug":"joyce-xu","display_name":"Joyce Xu","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Xu","first_name":"Joyce","job_title":"","description":"Joyce Xu&nbsp;is AI\/ML engineer at DeepMind. She has a special interest in reinforcement learning, natural language processing, and distributed computing. Her writings on deep learning have been featured on Hacker News, KDnuggets, AITopics. a publication of the Association for the Advancement of Artificial Intelligence, the Startup Grind and Towards Data Science. She was invited to speak at a few conferences for data science and ML."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1613","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/523"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1613"}],"version-history":[{"count":3,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1613\/revisions"}],"predecessor-version":[{"id":28249,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1613\/revisions\/28249"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/4276"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1613"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1613"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1613"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1613"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}