Instead, your visual functioning, your innate knowledge, and your lived experience all work together, instantly weighing and combining uncountable (and unknowable) factors to bring reality into focus and help you move safely through the world. It’s one thing to tell the difference between a cat and a cantaloupe, but think of how the variables—and the risks associated with getting even one wrong—multiply when it comes to the thousands of images bombarding the eyes of a driver behind the wheel of a moving car. And now imagine that the eye isn’t an eye, but a camera mounted on a shiny new car sold by Tesla, whose CEO, Elon Musk, has promised that full, driverless autonomy will be “complete this year.” The most ancient fossil of an animal eye is over 500 million years old; the first Model T Ford rolled off the assembly line about 100 years ago. Although we take it for granted, driving is the most dangerous thing we do on a daily basis. Is it reasonable to think that cameras and computers can competently replace us? Is it possible that somehow they might even be better than us?
“The problem of making autonomous vehicles safe is just orders of magnitude more difficult than anybody realizes,” says Bruno Olshausen, Professor of Vision Science, Optometry, and Neuroscience at UC Berkeley. “There are certainly some things you can do, like alert the driver when they’re beginning to fall asleep or weave out of their lane. And I think that’s great and we should definitely do all of those. But the idea that you’re going to solve all of the problems? It’s like playing whack-a-mole; as soon as you solve one problem another pops up.”
The main difficulty isn’t the mechanical act of driving; training robots to react predictably to defined stimuli is already robust. Instead, the problem is more basic: how does a self-driving car “see” and how does it process the images it takes in? Most of these vehicles use multiple mounted cameras which feed images to a computer. Those images are processed and used to determine variables like where the road is, what traffic conditions are, or if there are pedestrians nearby. Autonomous cars also often employ LIDAR, a laser that measures the distance between objects. “But the really hard part,” says Olshausen “is the part of intelligence or perception that we call common sense reasoning.” If you’re driving fast on an elevated freeway in heavy traffic and you see a ball bounce out of the back of a pickup truck, your best bet is probably to drive right over it. But that same ball, taking that same bounce, but this time originating from a playground in a residential neighborhood is likely to be followed by a child running into the street. “It’s a very difficult problem to solve,” says Olshausen. “Because if you have an autonomous agent moving about in a three-dimensional world, you’re going to encounter all kinds of unpredictable things.”
Proponents of autonomous cars would doubtless suggest more data, more artificial intelligence “deep learning,” better algorithms to deal with more and more situations. But Stella Yu, who is a member of the Berkeley Optometry’s Vision Science Group and is also the Director of the Vision Group at the International Computer Science Institute, suggests that big data has big limitations. “The long tail is actually the norm,” she says, referring to the staggering array of different visual stimuli that drivers encounter every day. Yu mentions ImageNet, a commonly used data set with millions of photos, each painstakingly labeled and annotated by hand. Once a system is trained to use these images it will do a great job of responding appropriately to what it already knows. “But if you take a camera and walk around the Berkeley campus,” she says, “the images you record won’t match the images that the data set curated.”
Every stock image has innumerable variations like tricks of light and shadow, people lost in crowds, pedestrians partially obscured by trees. And what is a pedestrian anyhow? It could be anything from a dad with a stroller waiting patiently in a crosswalk to a woman in a billowing, sequined dress leaping for a frisbee, to an old man with two canes tripping over the curb and stumbling into the road. ImageNet and YouTube will train your computer to identify a cat, but can we count on the datasets to correctly capture all of human experience? “There are all sorts of low-probability events that the world is going to throw at you every day,” says Olshausen. “And if your performance is at a 99 percent level, then that means that one out of a hundred times, the system is going to make these really stupid errors.” Multiplied across billions of cars and hundreds of billions of miles driven, minuscule probabilities compound into significant risks. Yu echoes Olshausen’s concerns: “If your system only performs on things it has already seen, then it just memorizes the answers. But the purpose of learning is so that you can generalize to new instances. The setup itself is set up for failure.”
Humans didn’t evolve to be good at driving, of course, so there’s an obvious appeal to the idea that computers and robots might be better than us at this new-fangled activity. “We don’t have sensors that directly measure distance. We only have two eyes.” Yu says. “But what really is an imageto a computer vision system? It’s just an array of numbers. When we look at something, we immediately process it through a complicated system, from the retina to V1, V2 [the primary centers in the cerebral cortex responsible for processing visual stimuli]. But a computer just has a camera that acquires a certain intensity of light, pixel by pixel, and then it has to make sense of the data.” Even just deciding how many pixels and what height-to-width aspect ratio to use—a pedestrian is an entirely different shape than a speed bump—is a problem that humans solve instantly but computers struggle with.
A computer vision system will take, say, a fifty-by-fifty pixel patch, determine where the image inside that box lies on the red-green-blue spectrum, do a 3D numerical analysis of the light intensity, and then ask itself whether what it’s currently processing resembles something it has already seen in training. Some of the patches in the stored data set will match with what the computer knows as “person” and some will match with “crosswalk” or “low-hanging tree branch.” But, Dr. Yu says, “the thing is that you don’t know that this pixel patch is where the important object is, so how does the computer even propose a particular object area to start the classification work?” Yu calls this a “windowed classification” system, which is quite different from the way human vision works. “We don’t do piecemeal analysis,” she says. Unlike a computer, the human visual system doesn’t analyze each part of the visual field discretely and separately from each other part. Instead, our ability to recognize a part depends on our ability to recognize the whole. The computerized system works well if it is looking for a pre-defined target of interest, but breaks down with the introduction of confounding variables.
“Neural networks can play computer chess to beat a human,” says Olshausen, “but if you actually gave the computer a chess board and it had to move pieces, it would make catastrophic mistakes just because you’d have the sun shining on the pieces from a weird angle, and the robot would do something like misestimate its position and jam the pieces on the wrong side of the board.” Olshausen mentions the difficulty of incorporating something as simple as rain into the worldview of a computer vision system. “Imagine if some drops get on the camera lens,” he says. “So now it’s going to create a corrupted image and have trouble perceiving the world through all that interference. Intuitively understanding the physics of water is something that we don’t ever think about, and yet our perception has evolved to deal with it effortlessly. And then you start thinking about how the world looks once the wind starts blowing…”
While Olshausen believes that truly safe and effective autonomous cars are still a long way off, he does have suggestions for how engineers and programmers can make improvements. “The people who are working on these systems have a lot to gain by studying human perception and biological vision,” he says. While trying to mimic biology can be carried to extremes—early airplane designs called for wings that flapped—nature has developed strategies that could prove crucial.
One innovation that Olshausen sees as important involves making the mounted cameras less static. “Photoreceptors in the eye are not uniformly sampled by the retinal ganglion cells,” he says. “They’re very densely sampled in the fovea, the central one degree of vision, and then they become more coarsely sampled as you go out towards the periphery.” Standard cameras, by contrast, have their photoreceptors uniformly distributed on a rectangular grid. Processing visual information is resource-intensive for humans or machines, a problem animal eyes have solved by packing the high-resolution capabilities into the center and then moving the eye itself towards the object of interest. “This strategy is much more important than people acknowledge,” Olshausen says. He suggests that engineers adopt a strategy in which the mounted cameras “move in different directions very rapidly like our eyes, and then have a sampling strategy where they focus the high resolution on the sensor and then fall off at the periphery. Together with the eye movements, this would give you a virtual high-res sensor with wide field of view.”
Both Olshausen and Yu think that truly autonomous vehicles are many years further away than the titans of industry would have consumers believe. When asked if she’d be comfortable riding in one today, Yu answered, “It depends on what kind of road I’m on and what the consequences of a mistake are. There are some situations like warehouse driving or a safari in a park that could work. If the highest speed is five miles an hour and there’s a fixed route with no crowds around and the lighting is good, then maybe yes.” Which, given that those constraints describe almost zero real-world driving situations, sounds more like a “no.”
Olshausen is more pointed in his criticism. “It would take a breakthrough to make this happen. It could happen next year or it could take twenty years. We are going to have to solve the larger problem of how to give computers general purpose reasoning, common sense reasoning about the world.” While the science doesn’t yet warrant unbridled enthusiasm about autonomous vehicles, the CEOs and salespeople continue their bullish predictions. “You have a risk of lulling people into a false sense of security where they’ll put their car on autopilot and watch a movie,” Olshausen continues. “That’s just really dangerous and somebody has to be out there saying ‘Don’t you dare, it doesn’t work that well yet.’ And that should really be these executives, who are authority figures that people trust.”
For as much as we know about how human vision works, and for as good as we already are at making robots do amazing things, combining those two knowledge bases is jaw-droppingly complex. The sheer number of calculations and interpretations that it takes to do something as simple as drive around the block is a problem of daunting size. Despite years of studying the most complicated aspects of vision, Olshausen has never lost his sense of wonderment at how it all works. “I’m not saying it’s magic,” he says “But I think we have to admit that there are some things in science that, despite how much we’ve worked on them, they still remain basically at the level of miracles.”