Face alignment, noise reduction, and AI-powered superresolution
Nvidia has announced a new videoconferencing platform for developers named Nvidia Maxine that it claims can fix some of the most common problems in video calls.
Maxine will process calls in the cloud using Nvidia’s GPUs and boost call quality in a number of ways with the help of artificial intelligence. Using AI, Maxine can realign callers’ faces and gazes so that they’re always looking directly at their camera, reduce the bandwidth requirement for video “down to one-tenth of the requirements of the H.264 streaming video compression standard” by only transmitting “key facial points,” and upscale the resolution of videos. Other features available in Maxine include face re-lighting, real-time translation and transcription, and animated avatars.
Nvidia Maxine, a platform that provides developers with a suite of GPU-accelerated AI conferencing software to enhance video quality. The company describes Maxine as a “cloud-native” solution that makes it possible for service providers to bring AI effects — including gaze correction, super-resolution, noise cancellation, face relighting, and more — to end users.
Developers, software partners, and service providers can apply for early access to Maxine starting this week.
Videoconferencing has exploded during the pandemic, as it offers a way to communicate while minimizing infection risk. In late April, Zoom surpassed 300 million daily meeting participants, up from 200 million earlier in the month and 10 million in December. According to a report from App Annie, business conferencing apps topped 62 million downloads during the week of March 14-21.
Nvidia says Maxine “dramatically” reduces how much bandwidth is required for videoconferencing calls. Instead of streaming an entire screen of pixels, the platform analyzes the facial points of each person on a call and then algorithmically reanimates the face in the video on the other side. This ostensibly makes it possible to stream with far less data flowing back and forth across the internet. Nvidia claims developers using Maxine can reduce bandwidth to one-tenth the requirements of the H.264 standard.
To achieve this improved compression, Nvidia says it’s employing AI models called generative adversarial networks (GANs). GANs — two-part models consisting of a generator that creates samples and a discriminator that attempts to differentiate between these samples and real-world samples — have demonstrated impressive feats of media synthesis. Top-performing GANs can create realistic portraits of people who don’t exist, for instance, or snapshots of fictional apartment buildings.
Maxine’s other spotlight feature is face alignment, which enables faces to be automatically adjusted so participants appear to be facing each other during a call. Gaze correction helps simulate eye contact, even if the camera isn’t aligned with the user’s screen. Auto-frame allows the video feed to follow a speaker as they move away from the screen. And developers can let call participants choose their own avatars, with animations automatically driven by their voice and tone.
Not all of these features are new of course. Video compression and real-time transcription are common enough, and Microsoft and Apple have introduced gaze-alignment in the Surface Pro X and FaceTime to ensure people keep eye contact during video calls (though Nvidia’s face-alignment features looks like a much more extreme version of this).
But Nvidia is no doubt hoping its clout in cloud computing and its impressive AI R&D work will help it rise above its competitors. The real test, though, will be to see if any established videoconferencing companies actually adopt Nvidia’s technology. Maxine is not a consumer platform but a toolkit for third-party firms to improve their own software. So far, though, Nvidia has only announced one partnership — with communications firm Avaya, which will be using select features of Maxine. As indicated in the image below, all major cloud vendors are offering Maxine as part of their Nvidia GPU cloud services.
In a conference call with reporters, Nvidia’s general manager for media and entertainment Richard Kerris, described Maxine as a “really exciting and very timely announcement,” and highlighted its AI-powered video compression as a particularly useful feature.
“We’ve all experienced times where bandwidth has been a limitation in our conferencing we’re doing on a daily basis these days,” said Kerris. “If we apply AI to this problem we can reconstruct the difference scenes on both ends and only transmit what needs to transmit, and thereby reducing that bandwidth significantly.”
Nvidia says its compression feature uses an AI method known as generative adversarial networks or GANs to partially reconstruct callers’ faces in the cloud. This is the same technique used in many deepfakes. “Instead of streaming the entire screen of pixels, the AI software analyzes the key facial points of each person on a call and then intelligently re-animates the face in the video on the other side,” said the company in a blog post. “This makes it possible to stream video with far less data flowing back and forth across the internet.”
As ever with these early announcements, we’ll need to see more of this tech in action and wait for any partnership deals Nvidia makes before we know how much of an effect this will have on everyday video calls. But Nvidia’s announcement shows how the future of videoconferencing will be more artificial than ever before, with AI used to straighten your gaze and even reconstruct your face, all in the name of saving bandwidth.
Maxine also leverages Nvidia’s Jarvis SDK for conversational features, including AI language models for speech recognition, language understanding, and speech generation. Developers can use them to build videoconferencing assistants that take notes and answer questions in humanlike voices. Moreover, the toolsets can power translations and transcriptions to help participants understand what’s being discussed.
Avaya is an early adopter of the Maxine platform. Through the company’s Avaya Spaces videoconferencing app, customers will benefit from background noise removal, virtual green screen backgrounds, and features enabling presenters to be overlaid on top of presentation content, as well as live transcriptions that can recognize and differentiate voices.
According to Nvidia, the AI models powering Maxine’s infrastructure, audio, and visual components were developed through hundreds of thousands of training hours on Nvidia DGX systems. This robustness and Maxine’s backend, which takes advantage of microservice running in Kubernetes container clusters on GPUs, enable the platform to support up to hundreds of thousands of users even while running AI features simultaneously.
Source: The Verge and VentureBeat
Full Story: https://www.theverge.com/2020/10/5/21502003/nvidia-ai-videoconferencing-maxine-platform-face-gaze-alignment-gans-compression-resolution