Computer vision can’t do it all yet
Jitendra Malik, a researcher in computer vision for three decades, does not own a Tesla, but he has advice for people who do.
“Knowing what I know about computer vision, I wouldn’t take my hands off the steering wheel,” he said.
Malik, a professor at the University of California, Berkeley, was referring to a fatal crash in May of a Tesla electric car that was equipped with its Autopilot driver-assistance system. An Ohio man was killed when his Model S car, driving in the Autopilot mode, crashed into a tractor-trailer.
Federal regulators are still investigating the accident. But it appears likely that the man placed too much confidence in Tesla’s self-driving system. The same may be true of a fatal Tesla accident in China that was reported last month. Other automakers such as Ford, which last month announced its plan to produce driverless cars by 2021, are taking a go-slow approach, saying the technology for even occasional hands-free driving is not ready for many traffic situations.
Tesla has said that Autopilot is not meant to take over completely for a human driver. And last month, the company implicitly acknowledged that its owners should heed Malik’s advice, announcing that it was modifying Autopilot so that the system will issue drivers more frequent warnings to put their hands on the steering wheel. Tesla is also fine-tuning its radar sensors to more accurately detect road hazards and rely less on computer vision.
The Tesla accident in May, researchers say, was not a failure of computer vision. But it underscored the limitations of the science in applications such as driverless cars despite remarkable progress in recent years, fuelled by digital data, computer firepower and software inspired by the human brain.
Today, computerised sight can quickly and accurately recognise millions of individual faces, identify the makes and models of thousands of cars and distinguish cats and dogs of every breed in a way no human being could.
Yet the recent advances, while impressive, have been mainly in image recognition. The next frontier, researchers agree, is general visual knowledge — the development of algorithms that can understand not just objects, but also actions and behaviours.
Computing intelligence often seems to mimic human intelligence, so computer science understandably invites analogy. In computer vision, researchers offer two analogies to describe the promising paths ahead: a child and the brain.
The model borrowed from childhood, many researchers say, involves developing algorithms that learn as a child does, with some supervision but mostly on its own, without relying on vast amounts of hand-labelled training data, which is the current approach. “It’s early days,” Malik said, “but it’s how we get to the next level.”
In computing, the brain has served mainly as an inspirational metaphor rather than an actual road map. Aircraft do not flap their wings, artificial intelligence experts often say. Machines do it differently than biological systems.
But Tomaso Poggio, a scientist at the McGovern Institute for Brain Research at the Massachusetts Institute of Technology, is building computational models of the visual cortex of the brain, seeking to digitally emulate its structure, even how it works and learns from experience.
If successful, the outcome could be a breakthrough in computer vision and machine learning in general, Poggio said. “To do that,” he added, “you need neuroscience not just as an inspiration, but as a strong light.”
The big gains in computer vision owe much to all the web’s raw material: countless millions of online photos used to train the software algorithms to identify images. But collecting and tagging that training data have been a formidable undertaking.
ImageNet, a collaborative effort led by researchers at Stanford and Princeton, is one of the most ambitious projects. Initially, nearly 1 billion images were downloaded. Those were sorted, labelled and winnowed to more than 14 million images in 22,000 categories. The database, for example, includes more than 62,000 images of cats.
For a computer-age creation, ImageNet has been strikingly labour intensive. At one point, the sorting and labelling involved nearly 49,000 workers on Mechanical Turk, Amazon’s global online marketplace.
Vast image databases such as ImageNet have been employed to train software that uses neuron-like nodes, known as neural networks. The concept of computing neural networks stretches back more than three decades, but has become a powerful tool only in recent years.
“The available data and computational capability finally caught up to these ideas of the past,” said Trevor Darrell, a computer vision expert at the University of California, Berkeley.
If data is the fuel, then neural networks constitute the engine of a branch of machine learning called deep learning. It is the technology behind the swift progress not only in computer vision, but also in other forms of artificial intelligence such as language translation and speech recognition. Technology companies are investing billions of dollars in artificial intelligence research to exploit the commercial potential of deep learning.
Just how far neural networks can advance computer vision is uncertain. They emulate the brain only in general terms — the software nodes receive digital input and send output to other nodes. Layers upon layers of these nodes make up so-called convolutional neural networks, which, with sufficient training data, have become better and better at identifying images.
Fei-Fei Li, the director of Stanford’s computer vision lab, was a leader of the ImageNet project, and her research is at the forefront of data-driven advances in computer vision. But the current approach, she said, is limited. “It relies on training data,” Li said, “and so much of what we humans possess as knowledge and context are lacking in this deep learning technology.”
Facebook recently encountered the contextual gap. Its algorithm took down the image, posted by a Norwegian author, of a naked, 9-year-old girl fleeing napalm bombs. The software code saw a violation of the social network’s policy prohibiting child pornography, not an iconic photo of the Vietnam War and human suffering. Facebook later restored the photo.
Or take a fluid scene such as a dinner party. A person carrying a platter will serve food. A woman raising a fork will stab the lettuce on her plate and put it in her mouth. A water glass teetering on the edge of the table is about to fall, spilling its contents. Predicting what happens next and understanding the physics of everyday life are inherent in human visual intelligence, but beyond the reach of current deep learning technology.
At the major annual computer vision conference this summer, there was a flurry of research representing encouraging steps, but not breakthroughs. For example, Ali Farhadi, a computer scientist at the University of Washington and a researcher at the Allen Institute for Artificial Intelligence, showed off ImSitu.org, a database of images identified in context, or situation recognition.
As he explains, image recognition provides the nouns of visual intelligence, while situation recognition represents the verbs. Search “What do babies do?” The site retrieves pictures of babies engaged in actions including “sucking”, “crawling”, “crying” and “giggling” — visual verbs.
Recognising situations enriches computer vision, but the ImSitu project still depends on human-labelled data to train its machine learning algorithms. “And we’re still very, very far from visual intelligence, understanding scenes and actions the way humans do,” Farhadi said.
But for cars that drive themselves safely, several years of continuous improvement — not an AI breakthrough — may well be enough, scientists say. It will take not just steady advances in computer vision, they say, but also more high-definition digital mapping and gains in radar and lidar, which uses laser light to scan across a wider field of vision than radar and in greater detail.
Millions of miles of test driving in varied road and weather conditions, scientists say, should be done before self-driving cars are sold. Google has been testing its vehicles for years, and Uber is beginning a pilot programme in Pittsburgh.
Carmakers around the world are developing self-driving cars, and 2021 seems to be the consensus year for commercial introduction. The German auto company BMW recently announced plans to deliver cars by 2021, in a partnership with Intel and Mobileye, an Israeli computer vision company. The cars would allow hands-free driving first in urban centres, and everywhere a few years later. And last month, Ford announced its driverless car plan with a similar timetable.
“We’re not there yet, but the pace of improvement is getting us there,” said Gary Bradski, a computer vision scientist who has worked on self-driving vehicles. “We don’t have to wait years and years until some semblance of intelligence arrives, before we have self-driving cars that are safer than human drivers and save thousands of lives.”
–New York Times News Service