Laure Prétet's PhD defense
By S. Zaiem

PhD candidate Laure Prétet will defend soon her PhD thesis, entitled “Metric Learning for Video to Music Recommendation”.

The defense will take place on Monday, January 24th, 2022 at 9:30 in Télécom Paris (Amphi 7, Télécom Paris, 19 Place Marguerite Perey, 91120 Palaiseau, France).

19 place Marguerite Perey F-91120 Palaiseau Salle : Amphi 4

The jury is composed as follows:

  • Mr. Frédéric Bevilacqua, Directeur de recherche, IRCAM-CNRS-Sorbonne Université, UMR STMS (Reviewer, President)
  • Ms. Jenny Benois-Pineau, Professor, Université de Bordeaux (Reviewer)
  • Ms. Estefania Cano, Chief scientist, AudiosourceRe (Examiner)
  • Mr. Guillaume Gravier, Directeur de recherche, CNRS (Examiner)
  • Mr. Stéphane Lathuilière, Maître de conférences, Télécom Paris, LTCI (Examiner)
  • Mr. Alexander Schindler, Doctor, Austrian Institute of Technology (Examiner)
  • Mr. Geoffroy Peeters, Professor, Télécom Paris, LTCI (PhD supervisor)
  • Mr. Gaël Richard, Professor, Télécom Paris, LTCI (PhD co-supervisor)

The presentation will be in English.


Music enhances moving images and allows to efficiently communicate emotion or narrative tension, thanks to cultural codes common to the filmmakers and viewers. A successful communication requires not only a choice of tracks matching the video’s mood and content, but also a temporal synchronization of the audio and visual main events. This is the goal of the music supervision industry, which traditionally carries out the task manually.

In this dissertation, we study the automation of tasks related to music supervision. The music supervision problem generally doesn’t have a unique solution, as it includes external constraints such as the client’s identity or budget. It is thus relevant to proceed by recommendation. As the number of available musical videos is in constant augmentation, it makes sense to use data-driven tools. More precisely, we use the metric learning paradigm to learn the relevant projections of multimodal (video and music) data.

First, we address the music similarity problem, which is used to broaden the results of a music search. We implement an efficient content-based imitation of a tag-based similarity metric. To do so, we present a method to train a convolutional neural network from ranked lists. Then, we focus on direct, content-based music recommendation for video. We adapt a simple self-supervised system and we demonstrate a way to improve its performance, by using pre-trained audio features and learning their aggregation. We then carry a qualitative and quantitative analysis of official music videos to better understand the temporal organization of musical videos. Results show that official music videos are carefully edited in order to align audio and video events, and that the level of synchronization depends on the music and video genres. With this insight, we propose the first recommendation system designed specifically for music supervision: the Seg-VM-Net, which uses both content and structure to perform the matching of music and video.