Last month, I talked about the coming era of responsive media, media that changes the content dynamically to fit the consumer’s attention, engagement and situation. Some of the technologies that need to make that happen are coming out now (VR goggles, emotion-sensing algorithms, and multi-camera systems) but there’s one fly in the ointment: the new bandwidth required by responsive media could be too much and the internet’s firehose will burst.
That’s a pretty dramatic claim so let me explain where this alarm is coming from. The aim of virtual reality is to generate a digital experience at the full fidelity of human perception – to recreate every photon your eyes would see, every small vibration your ears would hear and eventually other details like touch, smell and temperature. That’s a tall order because humans can process an equivalent of nearly 5.2 gigabits per second of sound and light – 200x what the US Federal Communications Commission predicts to be the future requirement for broadband networks (25 Mbps). Allow me to explain where an enormous number like 5.2 Gbps comes from (or skip to the next section if you trust me).
The fovea of our eyes can detect dots as fine-grained as 0.3 arc-minutes of a degree, meaning we can differentiate approximately 200 distinct dots per degree of our foveal field of view. Converting that to “pixels” on a screen depends on the size of the pixel and the distance between our eyes and the screen, but let’s use 200 pixels per degree as a reasonable estimate. Without moving the head, our eyes can mechanically shift across a field of view of at least 150 degrees horizontally and 120 degrees vertically within an instant (less than 100 milliseconds). That’s 30,000 horizontal by 24,000 vertical pixels. This means the ultimate display would need a region of 720 million pixels for full coverage because even though your foveal vision has a more narrow field of view, your eyes can saccade across that full space within an instant. Now add head and body rotation for 360 horizontal and 180 vertical degrees for a total of more than 2.5 billion (giga) pixels.
Those are just for a static image; but the world does not sit still. For motion video, multiple static images are flashed in sequence, typically at a rate of 30 images per second for film and television. But the human eye does not operate like a camera. Our eyes actually receive light constantly, not discretely, and while 30 frames per second is adequate for moderate-speed motion in movies and TV shows, the human eye can perceive much faster motion – some estimates are as high as 150 frames per second. For sports, games, science and other high-speed immersive experiences, video rates of 60 or even 120 fps are needed to avoid “motion blur” and disorientation.
Woah! Bringing this down to the bottom line, assuming no head or body rotation, the eye can receive 720 million pixels for each of 2 eyes, at 36 bits per pixel for full color and at 60 frames per second: that’s 3.1 trillion (peta) bits! Today’s compression standards can reduce that by a factor of 300 and even if future compression could reach a factor of 600 (which is the goal of future video standards), that still means we need 5.2 gigabits per second of network throughput; maybe more.
“Hold on, Chicken Little,” you are saying. “5.2 Gbps is just a theoretical upper limit. No cameras or displays today can deliver 30K resolution – we’re only seeing 8K cameras coming out this year.”
But there’s the rub. Cinematographers and even consumers are no longer using just a single camera to create experiences. I mentioned several 360 degree panorama camera systems last month – these rigs generally consist of 16 or more outward facing cameras. At today’s 4K resolution, 30 frames per second and 24 bits per pixel, and using a 300:1 compression ratio, these rigs generate 300 megabits per second of imagery. That’s more than 10x the typical requirement for a high-quality 4K movie experience.
There’s more. While panorama camera rigs face outward, there’s another kind of system where the cameras face inward to capture live events. This year’s Super Bowl, for example, was covered by 70 cameras, 36 of which were devoted to a new kind of capture system called Eye Vision 360 which allows the action to be frozen while the audience pans around the center of the action to see the details of the play at the line! Did the ball cross the goal line? Was the player fouled? See for yourself by spinning the view to any angle you choose.
Previously, these kinds of effects are only possible in video games or Matrix-style movies because they require heavy computation to stitch the multiple views together. Heavy duty post-processing means that such effects haven’t been available during live action in the past but soon enough they will be. That’s because new network architectures can move the post-processing off of high-end workstations at the cameras so that processors on the edge of the network and the client display devices (VR goggles, smart TVs, tablets and phones) themselves can perform advanced image processing to stitch the camera feeds into dramatic effects.
“Okay, Chicken Little,” you continue, “but even when there are 16 or 36 or 70 or more cameras capturing a scene, audiences today only see one view at a time. So the bandwidth requirements would not total up to the sum of all the cameras in the rig. Plus, dynamic caching and multicast should be able to reduce the load, by delivering content to thousands from a single feed.”
Ah, but Responsive Media changes that. VR goggles and Tango-embedded tablets will let audiences dynamically select their individual point of view. That means two things: 1) the feed from all of the cameras needs to be available in an instant and 2) conventional multicast won’t be possible when each audience member selects an individualized viewpoint. So, there it is. The network will be overwhelmed. The sky is falling.
Finally, video/audio compression technologies are being developed that can achieve much higher compression ratios for these new multi-camera systems. Whereas conventional video compression gets most of its bang from the similarity of the images between one frame and the next (called temporal redundancy), VR compression adds to that and can leverage the similarity among the images from different cameras (like the sky, trees, large buildings and others, called spatial redundancy) and use intelligent slicing and tiling techniques, using less bandwidth to deliver full 360 degree video experiences.
All of these advances may still not be enough to reach the theoretical limits of a fully immersive experience but they should carry us through for several years to come. Ultimately, we likely need a fundamentally new network architecture – one that can dynamically multicast and cache multiple video feeds close to consumers and then perform advanced video processing within the network to construct individualized views. Responsive media will require this kind of information-centric networking avoids the choke points of a conventional host-centric networking and allows the network itself to optimally distribute bits only where needed so we can fully enjoy the future of individualized responsive media.