What Is the Role of Opinion Mining/Sentiment Analysis in NLP?
Artificial intelligence (AI) has risen from a fringe concept in sci-fi to one of the most influential technologies conceived.
Building systems that can understand visual information has been a cornerstone of AI research and development. This allows machines to ‘see’ and respond to the world around them.
The domain of AI related to vision and visual data is called computer vision (CV). Computer vision equips computers with the ability to process, interpret, analyze, and understand visual data.
Computer vision has many practical applications, ranging from building autonomous vehicles (AVs) to medical diagnostics. Apps equipped with computer vision are ubiquitous, such as Google Lens, which extracts features from images taken with your phone to understand what they are and look them up on the internet.
In this article, we will explore computer vision, how it works, and some of its most exciting applications.
Computer vision equips digital systems with the ability to understand and interpret visual data.
The fundamental goal of computer vision is to create machines that can “see” and interpret the visual world.
Visual data includes both visible light, which humans can see, and other types of invisible light such as ultraviolet and infrared.
We live in a world illuminated by the electromagnetic spectrum.
Light emitted by the sun and other light sources strikes reflective materials and enters our eyes. Humans can see but a small portion of the electromagnetic spectrum, called “visible light”.
Our retina (light-sensitive tissue in the eye) contains photoreceptors that turn light into electrical signals, which are then passed to our brain.
In many ways, this is the easy bit. The eye is the lens – it fulfils a mechanical function, similar to a camera lens. The first cameras were invented in the early 19th century, and the first video cameras around 1890, around 100 years before competent computer vision technology.
In other words, physically capturing an image is simpler than understanding it.
Computer vision links vision technology, e.g., cameras – the “eyes” – to a system of understanding – the “brain.”
With computer vision, visual imaging devices like cameras can be combined with computers to derive meaning from visual data.
The computer is analogous to the brain, and the vision to the eye.
The first digital scan was created in 1957 by Russell Kirsch, birthing the concept of the ‘pixel.’ Soon after, the first digital image scanners were built, which could turn visual images into grids and numbers.
In the early 1960s, researchers at MIT launched research into computer vision and believed they could attach a camera to a computer and have it “describe what it saw.” This appears somewhat simple on paper, but the realities of accomplishing it quickly dawned on the computer science community.
MIT’s early investigations into CV triggered a series of international projects culminating in the first functional computer vision technologies.
The first milestone was met in the late 1970s, when Japanese researcher Kunihiko Fukushima built The Neocognitron, a neural network inspired by the human brain’s primary visual cortex.
The pieces were coming together – these early CV systems linked a system of vision, a camera, with a system of understanding, a computer.
Fukushima’s work eventually culminated in the development of modern convolutional neural networks.
Computer vision has a short history. Here’s a brief timeline of how computer vision has progressed:
Computer vision is a complex process that involves many stages, including image acquisition, pre-processing, data labeling, feature extraction, and classification.
Generally speaking, modern computer vision works via a combination of image processing techniques, algorithmic processing, and deep neural networking.
The process starts with data ingestion, using an image or video feed captured by a camera or some other visual sensor to capture information. Next, images are pre-processed into a digital format that the system can understand.
Initially, the model uses various image analysis techniques, such as edge detection, to identify key features in the image. In the case of a still image, a convolutional neural network (CNN) helps the model “look” by analyzing pixels and performing convolutions, a type of mathematical calculation. Recurrent neural networks (RNNs) are used for video data.
Computer vision with deep learning has revolutionized the capability of models to understand complex visual data by passing data through layers of nodes that perform iterative calculations.
This process is similar to how humans understand visual data. We tend to see edges, corners, and other stand-out features first. We then work to determine the remainder of the scene, which involves a fair amount of prediction. This is partly how optical illusions work – our brain predicts the probable characteristics of visual data similar to CV algorithms.
Image classification, object detection and image segmentation are the two main types of computer vision tasks.
There are several others, and while the list is not exhaustive, the most common tasks include:
1: Image Classification: The task of categorizing images into distinct classes or labels based on their content. For example, classifying images of cats, dogs, and birds.
2: Object Detection: The process of identifying and locating specific objects within an image, usually by drawing bounding boxes around them and associating them with class labels. For example, detecting cars, pedestrians, and traffic signs in a street scene.
3: Image Segmentation: The task of partitioning an image into multiple segments or regions, often based on the objects or distinct features present. There are two main types of image segmentation:
A) Semantic Segmentation: Assigning a class label to each pixel in the image, resulting in a dense classification map where each pixel is associated with a specific class.
B) Instance Segmentation: Extending semantic segmentation to distinguish and separate instances of the same object class, such as differentiating between multiple cars in an image.
4: Object Tracking: The process of locating and following the movement of specific objects over time in a sequence of images or video frames. This is important in applications like video surveillance, autonomous vehicles, and sports analytics.
5: Optical Character Recognition (OCR): The process of converting printed or handwritten text in images into machine-readable and editable text. OCR is commonly used in document scanning, license plate recognition, and text extraction from images.
Understanding the difference between object detection and image segmentation is often particularly tricky.
Computer vision algorithms and applications can be both supervised and unsupervised
In the case of supervised machine learning, algorithms are trained on a large dataset of labeled images.
During training, the machine learning algorithms learn to recognize patterns in the training dataset in order to classify new images based on similar features.
For example, if you want a model to understand road signs, you first need to teach it what the road signs are.
Here’s a brief rundown of data labeling for machine learning computer vision (CV):
Read our ultimate guide to data labeling for more information.
The process of data labeling takes place before any data is fed into the model.
There is a famous phrase: “rubbish in, rubbish out!” To build accurate, efficient models that generalize well, preparing high-quality datasets is essential.
While the model training, tuning, and optimization process is also crucial for building a quality model, the foundations need to be solid, which means producing high-quality datasets.
Get in touch for more information.
Depending on the intent of the model, image and video data can be labeled in many ways.
It all depends on what model you’re using.
For example, popular models like YOLO, for object detection, require bounding box or polygon-labeled data.
Image segmentation models such as Mask RCNN, Segnet and Unet, require pixel-segmented labels.
Here are the core types of data annotations:
Characteristics
Advantages
Use Cases
Characteristics
Advantages
Use Cases
Characteristics
Advantages
Use Cases
Characteristics
Advantages
Use Cases
Characteristics
Advantages
Use Cases
Computer vision has many practical applications across virtually every sector and industry.
Here are a few examples:
While the field of computer vision has made significant progress in recent years, there are still many challenges and limitations to overcome.
Complex models are extremely data-hungry. The more complex the model, the more data it needs.
Creating accurate training data is resource-intensive, and training models on large datasets is computationally expensive.
Obtaining labeled data can be time-consuming and expensive, particularly for specialized applications. For instance, human teams must work with domain specialists when labeling data for high-risk applications such as medical diagnostics.
Automated and semi-automated data labeling is helping alleviate this bottleneck, but replacing skilled human annotation teams is proving difficult.
Another challenge of computer vision is the ever-changing complexity of the visual world.
The visual world is forever changing, which poses an issue when training models with data available right now.
For example, autonomous vehicles are trained on datasets containing street features contemporary at the time, but these change. For instance, there are now many more eScooters on the roads than 5 years ago.
To respond to changes in the visual world, complex CV models like those used in self-driving cars must combine supervised models with unsupervised models to extract new features from the environment.
Latency is one of the most pressing issues facing real-time AI models. We take our reaction times for granted, but they result from millions of years of complex evolution.
Visual data enters the eye before traveling down the optic nerve to the brain. Our brain then starts processing data and makes us conscious of it.
To build robots that respond to stimuli in real time like biological systems do, we need to build extremely low-latency models. Read more about this here.
Computer vision is a fascinating field of AI that has many practical applications, many of which have already changed lives.
Despite the field being only 50 years or so old, we’ve now built advanced technologies that can “see” and understand visual data in a similar fashion to humans, and at similar speeds.
While computer vision has made massive progress in recent years, there are still many barriers to overcome for it to fully realize its potential. However, at current rates of development, it’s only a matter of time.
The future of computer vision relies on high-quality training data, which specialist data providers like Aya Data provide. Contact us to discuss your next computer vision project.