Computer Vision Expert Serge Belongie Talks AI, Alexa’s Post-Voice Future and Warby Parker AR Glasses
At first glance, Serge Belongie might look like just another chill guy in a t-shirt, who might even be in a rock band, but his unpretentious appearance belies a deep and earned expertise in computer vision. Before his current academic gig as a professor of Computer Science at Cornell University and Cornell Tech, where he teaches courses in machine learning, Belongie taught at the University of California, San Diego and co-founded several startups in the computer vision space. For the last four years, Belongie has also been helping organize the LDV Vision Summit, an annual two-day gathering for technologists, investors, academics, entrepreneurs, and anyone else interested in the computer vision space. We spoke with Belongie right after the latest installment of the LDV Vision Summit to get his take on the computer vision space past, present, and future.
What was the idea behind this year’s LDV Vision Summit versus previous years?
Serge Belongie: The first time we did this in 2014, the deep learning tsunami hadn’t fully hit yet. Certainly within academia it was clear that deep learning was going to have a huge impact as early as 2012, because there was a seminal paper that came out near the end of that year. But it took a few years for deep learning to attain critical mass within the industry. So the idea of having a summit that was focused on visual technologies in particular was a bit of a risk, because it seemed like it might be too narrow. A lot of things that seemed ubiquitous before just didn’t exist before.
What kinds of things?
SB: This was before the boom in self-driving cars and also before TensorFlow, which seems like it’s been around forever and is already taken for granted, but in fact it’s still quite new. What’s emerged is this incredible commoditization of so many parts of computer vision and machine learning that used to require teams of Ph.Ds to develop in terms of infrastructure, and now it’s possible for individual hackers or developers on small startup teams to bring that kind of functionality to any kind of product.
I think there’s a general sense that there’s an unstoppable force involving computer vision and machine learning that’s going to propel us towards wearables, very lightweight powerful devices that kind of disappear in to the fabric. There’s this commoditization and a sense that it’s almost no longer necessary to talk about computer vision and machine learning for their own sake.
AI is almost an overused term today, and can mean anything from basic voice recognition, often called narrow or specific AI, to Scarlett Johansson’s character in Her, which is referred to as general AI. What is AI to you?
SB: I still think of general AI as science fiction. The AI that mostly works today and for the next five to ten years is actually just increasingly powerful forms of automation. Most people talk about AI today in the business world in terms of utility, like collision avoidance and lane keeping assist that’s essentially cruise control on steroids.
I don’t actually use the term AI to describe what my group at Cornell Tech does, but, as it turns out, my university uses the term to describe what I do. That’s true at a lot of universities because the general public knows what AI is, but they don’t know what computer vision and machine learning are.
So where are we with AI now?
SB: I’m going to share an analogy I heard from a dean at Cornell Tech. The extent to which scientists have captured AI is at the reptilian level, and he’s referring to the reptilian brain, which enables tasks that a reptile can accomplish within 200 milliseconds, like hearing a particular sound and reacting or seeing a bug and taking an action. Reptilian cognition is not particularly complex, but it’s still something that’s very powerful visually and presumably auditorily as well.
We’re at the point where if you can get the training data, a computer vision system can beat you every single time on specific tasks like classifying mold in an apartment, grading or scoring cancer in tissue samples, classifying birds, you name it. Lay down all these potential problems, collect the training data, label it, run it through deep learning, and that thing will just beat everyone except the top experts, and sometimes even then. But again, this is all very focused in terms of “what is this thing in this image.” As long as you key up that problem, the machine is going to beat you.
That all sounds fairly practical and non-lethal. Are the doomsday warnings about AI’s future overstating the case?
The people in the trenches—scientists who are working in computer vision, natural language processing, and all sorts of applied AI—think that this kind of sci-fi talk is silly.
SB: Yes. Although a lot of famous and successful people out there are talking about apocalyptic scenarios like Terminator and Skynet, the people in the trenches—scientists who are working in computer vision, natural language processing, and all sorts of applied AI—think, at least privately, that this kind of sci-fi talk is silly. I mean, we all take the ethics aspect of it very seriously, but to look at a computer that wins at Go and then extrapolate from that something like the AlphaGo system, and then extrapolate that humanity’s future is on the line, is just a really big leap.
How big of an issue is getting good data?
SB: That’s definitely a huge problem: How do startups survive in the shadow of Google, Facebook, and these other huge companies that have so much data, verticals, and computation power? For one, there definitely seems to be room for many niche applications where certain stakeholders don’t want their content in the cloud, like semiconductor manufacturing or pharmaceuticals, where the microscopic images are of great value. Companies might not want to give competitors the chance to see those images that give away a new drug that’s been discovered or a chip that’s been created.
And that certainly fits in with this push towards edge computing, where we are moving, again, away from all these cloud services like Amazon, Google, and Facebook. Even so, the vast bulk of the general public is just heaping data on Google, Facebook, Amazon, and so on, so it’s just incredibly difficult to compete. Now, a lot of these startups are actually aiming to get acquired, because these companies are so hungry for talent.
Next-gen interfaces like the voice-enabled Alexa, which has just added a camera, are growing in popularity. Will vision- and voice-enabled interfaces of the future replace today’s touchscreen and keyboard?
SB: I don’t think so. There’s something about touchscreens and keyboards—when lumped together as input devices—that’s hard to let go of. They’re still just so precise. Sometimes you just want a very specific thing—an emoji, let’s say. Obviously, with things like Google Lens and Amazon Alexa, there’s definitely a push toward making search visual or auditory. I definitely think there’s going to be another fully-fleshed out input medium for searching, but all these are ultimately pointing back to the idea of multi-modal or multimedia search. If you know exactly what you want, you should just type it in and not monkey around at all. Other times, you have no idea what something is called, but you can take a picture of it.
How do computer vision and natural language processing compare with their human equivalents?
SB: Audio processing and computer vision are cousin fields in some ways. They both attach to machine learning in roughly the same way, and they both work really well when in a controlled environment. Computer vision works best when dealing with a well-composed, well-lit photo, with one main object taking up the field of view. Same with Alexa’s audio recognition, which works best if there isn’t too much background noise.
But when you start to create background noise, multiple people talking and saying um and ah, or low light images—these are challenges still for computer systems, but which human visual and auditory systems are still amazingly good at dealing with. The big advances we’re seeing on the machine end are still in the relatively controlled settings, but we will just gradually move up from there.
Advancing in this area is important for semi- and fully-autonomous vehicles, where it’s all an uncontrolled environment. There’s no way to predict when that pedestrian is going to pop out in front of you, whereas the typical domestic setting of a Google Home or an Amazon Alexa is still considerably more controlled.
How do you feel about AR and VR in the computer vision space?
Some cool stuff is happening with augmented reality and mixed reality, but I think it’s still in that brick-sized cell phone phase.
SB: I don’t track VR very much, but I do stay on top of augmented and mixed reality, and it’s exploding right now. And this despite Google Glass’s hiccup and Magic Leap’s creating some confusion by making some big promises and putting out these dazzling concept videos in which it’s unclear what’s real and what’s fake. But putting aside these anomalies, the new HoloLens is amazing. And Meta has a good set of goggles, and there’s just a really mature technology stack for simultaneous localization and mapping. Independently of HoloLens, Microsoft has some cool new optic technology for near-field imaging.
Some cool stuff is happening with augmented reality and mixed reality, but I think it’s still in that brick-sized cell phone phase. And you need to have quite an imagination to just jump ahead and say that this AR or MR wearable will eventually be really lightweight, and the battery will last a long time, and all that. Right now, the hardware is still very clunky, but I’m convinced it’s going to be huge.
AR and MR are going to be the way that we actually experience all this computer vision and machine learning functionality. Right now, of necessity, we do it via this phone in our pockets that we have to pull out, but it’s all clearly moving to AR and MR.
So what needs to happen for that to be a reality? Google Glass contact lenses?
SB: I don’t think we have to go all the way down to contact lenses, but imagine a pair of Warby Parker glasses, or those spectacles from Snap—something that has that kind of form factor would be enough. So once we get to a point where it’s comparable in weight to a regular pair of glasses, but has the image projection capability with wide field of view, a battery that lasts all day, and it’s low cost—that’s when we’ll have arrived.
Illustrations by Cam Floyd