Kilian Schulze-Forster's PhD defense
By K. Schulze-Forster

Kilian Schulze-Forster will defend soon his PhD thesis, entitled “Informed Audio Source Separation with Deep Learning in Limited Data Settings”.

The defense will take place on Thursday, December 9, 2021, at 2pm at Télécom Paris.

It is also possible to attend via Zoom using the following link:

Meeting ID: 954 4708 5102, Code: 992210

The jury is composed as follows:

  • Mr. Emmanuel Vincent, Directeur de recherche, Inria Nancy - Grand Est (President)

  • Mr. Xavier Serra, Professeur, Universitat Pompeu Fabra (Reviewer)

  • Mr. Laurent Girin, Professeur, Grenoble-INP, Institut Polytechnique de Grenoble (Reviewer)

  • Ms. Hélène-Camille Crayencour, Chargée de recherche, CNRS (Examiner)

  • Mr. Roland Badeau, Professeur, Télécom Paris (Thesis director)

  • Mr. Gaël Richard, Professeur, Télécom Paris (Thesis co-director)

  • Mr. Clément S. J. Doire, Senior research scientist, Sonos Inc. (Invited guest)

The presentation will be in English.


Audio source separation is the task of estimating the individual signals of several sound sources when only their mixture can be observed. State-of-the-art performance for musical mixtures is achieved by Deep Neural Networks (DNN) trained in a supervised way. They require large and diverse datasets of mixtures along with the target source signals in isolation. However, it is difficult and costly to obtain such datasets because music recordings are subject to copyright restrictions and isolated instrument recordings may not always exist.

In this dissertation, we explore the usage of prior knowledge for deep learning based source separation in order to overcome data limitations.

First, we focus on a supervised setting with only a small amount of available training data. We investigate to which extent singing voice separation can be improved when it is informed by lyrics transcripts. To this end, a novel deep learning model for informed source separation is proposed. It aligns text and audio during the separation using a novel monotonic attention mechanism. The lyrics alignment performance is competitive with state-of-the-art methods while a smaller amount of training data is used. We find that exploiting aligned phonemes can improve singing voice separation, but precise alignments and accurate transcripts are required.

Finally, we consider a scenario where only mixtures but no isolated source signals are available for training. We propose a novel unsupervised deep learning approach to source separation. It exploits information about the sources’ fundamental frequencies (F0). The method integrates domain knowledge in the form of parametric source models into the DNN. Experimental evaluation shows that the proposed method outperforms F0-informed learning-free methods based on non-negative matrix factorization and a F0-informed supervised deep learning baseline. Moreover, the proposed method is extremely data-efficient. It makes powerful deep learning based source separation usable in domains where labeled training data is expensive or non-existent.