A multimodal dynamical variational autoencoder for audiovisual speech representation learning
By Simon Leglaive

Simon Leglaive, tenured Assistant Professor at CentraleSupélec, will give a talk about multimodal DVAE for speech representation learning:


High-dimensional data such as natural images or speech signals exhibit some form of regularity, preventing their dimensions from varying independently. This suggests that there exists a smaller dimensional latent representation from which the high-dimensional observed data were generated. Uncovering the hidden explanatory features of complex data is the goal of representation learning, and deep latent variable generative models have emerged as promising unsupervised approaches. In particular, the variational autoencoder (VAE), which is equipped with both a generative and inference model, allows for the analysis, transformation, and generation of various types of data. Over the past few years, the VAE has been extended in many ways, including for dealing with data that are either multimodal or dynamical (i.e., sequential). In this talk, we will present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audiovisual speech representation learning. The probabilistic graphical model is structured to dissociate the latent dynamical factors that are shared between the modalities (e.g., the speaker’s lip movements) from those that are specific to each modality (e.g., the speaker’s head or eye movements). A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence (e.g., the speaker’s identity or global emotional state). The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two steps. In the first step, a vector quantized VAE (VQ-VAE) is learned independently for each modality, without temporal modeling. The second step consists in learning the MDVAE, whose inputs are the intermediate representations of the VQ-VAE before quantization. The disentanglement between static versus dynamical and modality-specific versus shared information occurs during this second training stage. Preliminary experimental results will be presented, featuring what characteristics of the audiovisual speech data are encoded within the different latent spaces, how the proposed multimodal model can be beneficial compared with a unimodal one, and how the learned representation can be leveraged to perform downstream tasks.


Simon Leglaive is a tenured Assistant Professor at CentraleSupélec and a member of the AIMAC team of the IETR laboratory (UMR CNRS 6164) in Rennes, France. He received the Engineering degree from Télécom Paris (Paris, France) and the M.Sc. degree in acoustics, signal processing and computer science applied to music (ATIAM) from Sorbonne University (Paris, France) in 2014. He received the Ph.D. degree from Télécom Paris in the field of audio signal processing in 2017. He was then a post-doctoral researcher at Inria Grenoble Rhône-Alpes, in the Perception team. His research interests lie at the crossroads of audio signal processing, probabilistic graphical modeling and machine/deep learning.

More on the speaker’s website.