New internship position with ADASP: Deep Learning for Multimodal Music-Language Models

New internship position with ADASP: Deep Learning for Multimodal Music-Language Models
By A. Quelennec

Our group is hiring a M2 intern on the topic “Deep Learning for Multimodal Music-Language Models”.

Important informations

Date: March - September
Duration: 6 months
Place of work: Palaiseau (Paris outskirts), France
Supervisors: Aurian Quélennec, Slim Essid
Wage : 4.35e / hour net
Contact: aurian.quelennec@telecom-paris.fr

Problem statement and context

In the domain of music information retrieval (MIR), large models trained on a large corpus of data, either in a supervised or unsupervised fashion, can solve numerous tasks with similar performance to specialized models [1,2,3]. Leveraging the effectiveness of Large Language Models (LLM) and powerful audio encoders, new methods which combine audio and text descriptions have emerged and shown promising results [4,5,6]. A lot of work is done to find the right audio descriptor and generate text captions for the audio to improve the models’ results. In addition, the training task of such models is a crucial point that can be explored to improve the performances. The goal of this internship is to use an in-house state-of-the-art music encoder and build a new multimodal model around music recordings and text captions.

References

[1] K. Koutini, J. Schlüter, H. Eghbal-zadeh, and G. Widmer, “Efficient training of audio transformers with patchout,” in Proc. Interspeech, Incheon, September 2022, pp. 2753–2757. [2] S. Chen, Y. Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei, “Beats: Audio pre-training with acoustic tokenizers,” in Proc. ICML, vol. 202, Honolulu, July 2023, pp. 5178–5193. [3] D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “Masked modeling duo: Learning representations by encouraging both networks to model the input,” in Proc. ICASSP, June 2023, pp. 1–5. [4] B. Elizalde, S. Deshmukh, M. Ismail, H. Wang, “CLAP : LEARNING AUDIO CONCEPTS FROM NATURAL LANGUAGE SUPERVISION,” in Proc. ICASSP, June 2023 [5] J. Gardner, S. Durand, D. Stoller, R. Bittner, “LLARK : A Multimodal Instruction-Following Language Model for Music,” in Proc. ICML, 2024 [6]C. Tang, W. Yu, G. S, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, C. Zhang. “SALMONN: TOWARDS GENERIC HEARING ABILITIES FOR LARGE LANGUAGE MODELS,” in Proc. ICLR, 2024

Candidate profile

Master in Computer Science / Mathematics / Signal Processing
Proficient Python Skills (PyTorch is a must/ Numpy / Sklearn)
Interested in research, especially audio and music.
Good self-organization and autonomous working.
Language: fluent in English, both in writting and speaking.

Share on

Twitter Facebook LinkedIn