In the case of two simultaneous speakers, accuracy exceeded 90 percent, sufficient for commercial applications, compared with 51 percent accuracy using conventional technology. The new technology is able to discern between combinations of several spoken languages and gender. The above results are based on ideal recording conditions, including low ambient noise and speakers talking at roughly similar volume. The Deep Clustering technology uses Mitsubishi Electric's proprietary deep-learning method to learn how to encode signal components of the original speech data of multiple people so that signal components belonging to each individual speaker can be easily distinguished by their encodings. To accomplish this, the encodings are optimized such that different signal components belonging to the same speaker have similar encodings, and those belonging to different speakers have dissimilar encodings. The learned encoding transformation is applied to the input speech, and the encodings of the signal components of each speaker are identified using a clustering algorithm, which processes data points into groups depending on their similarities. Each person's speech is then reconstructed by resynthesizing their separated speech components.

Mitsubishi Electric Corporation published this content on 24 May 2017 and is solely responsible for the information contained herein.
Distributed by Public, unedited and unaltered, on 24 May 2017 03:45:15 UTC.

Original documenthttp://www.mitsubishielectric.com/news/2017/0524-e.html

Public permalinkhttp://www.publicnow.com/view/F813E051290F4504C763485DB2B8ED657A476547