The technology has improved since the introduction of the smartphone in the global market. The smartphone cameras are now so smart that they allow users to focus on a single object among many. Soon we will also see some advanced technology which will pick up individual voices in a crowd while suppressing other ambience sounds.
The new AI (Artificial Intelligence) system developed by Google researchers is making this thing possible. This is very important because computers are not good at paying attention to a particular person during noisy circumstances. Usually, it gets confused when there is more than one voice speaking in the background and give unexpected results.
"However, automatic speech separation -- separating an audio signal into its individual speech sources -- remains a significant challenge for computers," Inbar Mosseri and Oran Lang, software engineers at Google Research, wrote in a blog post this week.
"In this work, we are able to computationally produce videos in which speech of specific people is enhanced while all other sounds are suppressed," Mosseri and Lang said.
In a news report, the researchers from Google has demonstrated the deep learning audio-visual model for separating a single audio signal from a mixture of sounds.
The technique works on ordinary video recording with a single audio track. The user needs to just select the face of the person whom he or she is recording. While doing this the technology will automatically start focusing on the person's voice based on context.
Google researchers believe that this will bring a lot of improvement for a wide range of application, from speech enhancement and recognition in videos. The hearing aids will be improved especially in conditions like video conferences, where there are more than two or three people speaking.
"A unique aspect of our technique is in combining both the auditory and visual signals of an input video to separate the speech," the researchers added.
"Intuitively, movements of a person's mouth, for example, should correlate with the sounds produced as that person is speaking, which in turn can help identify which parts of the audio correspond to that person," they explained.