The field of digital image processing is increasingly shifting from deterministic mathematical calculations and statistical methods, to machine learning (ML)-based approaches which can provide better and more accurate results. This in turn is also helping to drive new methods and use cases for computer vision in the pursuit of extracting information from images.
The use of ML for image processing and computer vision is a growing part of ML research. While it's a hot topic in and of itself, it's driving some of the hottest topics in the tech industry including robotics, self-driving cars, and facial recognition.
In this blog we explore these top five ways in which ML is being used for image processing and computer vision, and look into why PerceptiLabs is so well suited for enabling them:
- Image Classification
- Object Detection and Tracking
- Object and Instance Segmentation
- Image Enhancement and Reconstruction
- Generative Imagery
Computer Vision Problem Types
1. Image Classification
Image classification, sometimes referred to as image recognition, seeks to identify what an entire image represents, and to then classify that representation (i.e., associate it with a label).
ML practitioners often use the infamous MNIST database of handwritten digits as their hello world project to learn about neural networks and supervised learning. It consists of 28x28-pixel grayscale images of hand-written digits and the corresponding labels for the digits (ranging from 0 through 9).
Using these images, ML practitioners learn how to build an ML model where parts of each image are used as input to the neural network, and how to build up multiple network layers (i.e., deep neural networks), based on increasingly complex features (e.g., edges, lines, etc.) to derive a probability that a given image represents a certain digit.
Of course the use of image classification goes far beyond training new ML practitioners, and incorporates a wide variety of ML tactics. One of the key structures used for image classification and indeed all of the other categories we'll discuss below, is a Convolutional Neural Network (CNN or ConvNet). There is an excellent article about CNNs here, but in a nutshell, CNNs roughly mimic the visual cortex system where certain neurons respond to certain regions of a visual field. A Convolutional Layer in this network takes a tensor of a certain shape as input (e.g., two dimensional array of pixels) and convolves it into a higher-level feature map of a different shape. Similarly, a Pooling Layer can be used in the network to reduce dimensionality and thus the processing power required to analyze features.
However, when such networks become overly large they can suffer from the vanishing gradient problem during back propagation in which smaller and smaller gradients of the loss function cause updates to the weights to tend towards zero. This can increase training time or stall it altogether, and in some cases, even decrease the model's ability to classify images accurately.
One way around this is to use a Residual Neural Network (ResNet) which employs skip connections. These link features from earlier in the network to newly-computed features after the Convolutional layers, providing multiple, shorter paths for gradients to flow during back propagation. These elements are often referred to as residual blocks:
Use cases for image recognition: A range of industries from IoT and social media to retail and robotics make use of image recognition. Some of the most popular uses today include facial recognition, medical or industrial IoT image analysis to flag abnormalities, ML-aided CAD, and locating certain types of objects in satellite imagery.
2. Object Detection and Tracking
Object detection is similar to image classification, but handles the additional challenge of finding one or more objects within an image, a process known as localization. Object tracking takes this one step further and seeks to identify and follow objects across multiple image frames (e.g., across recorded frames of video).
ML practitioners often use object detection to display bounding boxes around the identified objects while also determining and rendering each images' classification. Object tracking may extend this to also render vectors depicting the known and predicted paths of objects.
Region-Based Convolutional Neural Networks (R-CNNs) and variations thereof, are a class of commonly-used multi-stage algorithms for object detection. In the first stage, an R-CNN attempts to find and propose general regions of pixels under the assumption that similar pixels usually belong to the same object. The process then refines these proposals and begins the second stage of building more fine-grained recognition.
Another popular class of methods is known as You Only Look Once (YOLO) which is based on a regression rather than classification approach. YOLO splits an image into a S x S grid and where each cell regresses the bounding box location. Each cell is also assigned a confidence score indicating the likelihood that the cell is located at the center of an object, and the score is compared against a ground truth label during training.
Use cases for object detection: Many industries make use of object detection including IoT, social media, healthcare, and robotics. Some of today's popular uses include highlighting and labelling anomalies in medical images, tracking cars and people for autonomous driving, and monitoring suspects in security footage.
For more information about object detection methods, check out this excellent article.
3. Object and Instance Segmentation
Object segmentation seeks to partition groups of pixels of a digital image into segments which can then be labelled, located, and even tracked as objects in one or more images. It's similar to object detection, but is often called semantic segmentation or standard semantic segmentation, because it seeks to identify, classify, and often highlight each pixel of an object, rather than finding features (e.g., curves and lines) to provide the general localization (e.g., bounding box) for an object.
Segmentation can be a powerful tool for building complex image classification and detection strategies because it allows us to understand what object each pixel in an image belongs to and to classify objects detected in the image.
Instance segmentation or instance-aware semantic segmentation takes things a step further by not only classifying groups of pixels as certain types of objects, but distinguishing them as instances of an object type (e.g., multiple cars in a scene). For example, it can distinguish between people in a scene as instances of one type of object (people) and the background as a different type of object. This in turn, allows us to build systems which effectively provide more context about an image (e.g., it can produce a high-level description like there are five cars waiting at the crosswalk while three pedestrians cross the road).
A special type of CNN called a U-Net is often used for segmentation which outputs a label for every pixel. It consists of a contracting path that repeatedly convolves the image into lower resolutions to reduce the receptive field and capture the context of the image, and an expansive path that upsamples the image. Skip connections are used to feed feature information from the convolution layers of the contracting path to the concatenation layers in the expansive path.
Use cases for semantic segmentation: Semantic segmentation plays an important role where a very precise understanding of an image is required. Industries that make use of semantic segmentation include military, robotics, IoT, city planners, and others who need very detailed analysis of scenes.
A typical example of semantic segmentation is to identify different features of land (e.g., urban, agriculture, water, etc.) captured in satellite images for use in city planning. For cases where safety is involved such as in robotics and autonomous vehicles, semantic segmentation can be used to identify elements like the amount of free space on a road or important markers. Similarly, segmentation can also be used for various healthcare purposes.
For additional information about semantic segmentation, check out this great article.
4. Image Enhancement and Reconstruction
Image enhancement techniques have benefited greatly from the application of ML and are evolving rapidly. Some of the more well-known techniques include smoothing, sharpening, upscale (aka super resolution) or downscaling, adjusting contrast, colorizing or grayscaling, and transformations (e.g., resizing, sheering, etc.). While traditional approaches like interpolation, statistical techniques, etc. continue to work well, ML techniques are often able to provide even better results.
Again, CNN's reign supreme due to their ability to reduce images into forms which are easier to process without losing critical information and their ability to extract features. By leveraging such aspects of CNNs, developers have used them to solve a number of image enhancement problems. Many variations of CNNs are in use today for image enhancement including Generative Adversarial Networks (GANs) (described in the next section) and U-Nets (e.g, to enhance dark photos).
Use cases for image enhancement: Many industries make use of image enhancement including entertainment (e.g., movie and video games), art production and graphic design, photography, medical, and security. Uses for image enhancement and reconstruction include improving or extracting features from tomographic images, cleaning up noisy (e.g., grainy) images, filling in missing content, increasing brightness, etc. A great example of ML, and more specifically GANs for image enhancement, is described in Disney Research's A Fully Progressive Approach to Single-Image Super-Resolution.
5. Generative Imagery
The ability to generate imagery to varying degrees is probably one of the most exciting, and at times, controversial applications of ML for image processing.
Generative imagery can range from the creation of deep fakes (i.e., the creation of images or even videos of objects or persons that look like their real-world counterparts), to image synthesis where objects may be added to a scene or modified.
Another use is Neural Style Transfer in which an image is re-styled or painted in the style of another image:
In addition to CNNs, GANs are commonly used in generative imagery, notably in the creation of deep fakes where they learn to both generate images (often starting with just random noise) and discern between real and generated images.
Use cases for generative imagery: Numerous industries use generative imagery including social media, artists, marketing and advertising, etc., and the use cases for generative imagery are growing as this area advances. Uses include developing systems which learn to recognize fake imagery (e.g., for security), facial modifications in social media games, creation of art, creation of alternative examples for sharing (e.g., when the original data is to remain private), etc. However, given the powerful capabilities of generative imagery, ML practitioners should always seek to use them for positive purposes.
For more information about GANs, check out this article on StyleGAN, originally developed by Nvidia.
PerceptiLabs at the Center of Your Image Processing
ML is a powerful tool to solve problems of a visual nature because you can literally see the results as you build deep ML models. That's why when we began building PerceptiLabs we focused on providing a GUI and visual API (components) for solving image processing problems, an approach that aligns with our philosophy of transparency and explainability of ML models.
Out of the box, PerceptiLabs' visual API includes a number of powerful components for solving image-related problems including Convolution and Deconvolution, Classification, Regression, GAN, and Object Detection. Users are also free to modify these components as required or even use the Custom component to write one from scratch.
The visual, drag-and-drop nature of PerceptiLabs' workflow is a powerful element in and of itself. Through the UI users can easily author complex models, visually see how each component is modifying the data, and easily construct very complex networks such as those involving skip connections. For example, the figure above allows the user to immediately see how input tensor data reshaped into an image, followed by a simple convolution, and classification during training.
For additional information on using PerceptiLabs to bring ML into your image processing and computer vision projects, be sure to check out the use cases and tutorials in our documentation.
Also be sure to check out the following image-related datasets which you can use to experiment in PerceptiLabs: