T. Mariotte talks to ADASP about his research in speech detection in a farfield audio scene.

Abstract

Overlapped speech naturally occurs in the multi-party scenario when several speakers are simultaneously active. This may lead to severe performance degradation in automatic speech processing systems (e.g. speaker diarization, speech recognition…). Overlapped speech detection (OSD) aims at detecting time segments in which several speakers are simultaneously active. Recently, deep neural networks have been extensively used to solve this task. However, detection performance tends to deteriorate in the far-field scenario. Distant speech however offers practical benefits by avoiding speakers to wear individual microphones. In this context, the acoustic scene can be recorded by several microphones (e.g. microphone arrays) to obtain spatially-aware data. Spatial information can be exploited to improve distant speech processing. In the OSD context, we suppose that spatially-aware systems may perform better, mainly if active speakers are distant from each other.

The work presented in this seminar focuses on channel combination approaches for OSD under distant speech conditions. Recently, it has been shown that self attention can be applied to weight channels in the Short Time Fourier Transform (STFT) domain before combining them. Two self attentive methods are then implemented as feature extractor for OSD. The former is from the literature and the latter is its extension in the complex space. The complex model allows to exploit all the STFT information. These approaches are compared to close-talk OSD and several distant OSD baselines (single distant microphone, beamforming…). Results show that self-attentive methods reaches performance closer to the close-talk scenario than basic approaches (e.g. single distant microphone). Although the complex model is less efficient than the original one, we also show that it allows better interpretability of the combination weights learned for OSD.

Bio

Hello, I’m Théo Mariotte, a second year PhD candidate in machine learning based speech processing. I study at Le Mans Université. My work takes place in two laboratories: the acoustic lab (LAUM) and the computer science lab (LIUM). My PhD thesis aims at developing automatic speech processing methods exploiting information recorded by microphone arrays. Recent work has been focused on distant Overlapped Speech Detection, a key preprocessing task for speaker diarization. Before starting the PhD, I graduated the ENSIM engineering school in acoustics. During this formation, I worked in apprenticeship at the CSTB on urban audio sources synthesis. I have also been involved in some short research projects focused on acoustic sources localization and binaural audio scene rendering in rooms.

More on the speaker’s website.