New internship position with ADASP: Deep Learning for Multimodal Music-Language Models
By A. Quelennec
Our group is hiring a M2 intern on the topic “Deep Learning for Multimodal Music-Language Models”.
Important informations
- Date: March - September
- Duration: 6 months
- Place of work: Palaiseau (Paris outskirts), France
- Supervisors: Aurian Quélennec, Slim Essid
- Wage : 4.35e / hour net
- Contact: aurian.quelennec@telecom-paris.fr
Problem statement and context
In the domain of music information retrieval (MIR), large models trained on a large corpus of data, either in a supervised or unsupervised fashion, can solve numerous tasks with similar performance to specialized models [1,2,3]. Leveraging the effectiveness of Large Language Models (LLM) and powerful audio encoders, new methods which combine audio and text descriptions have emerged and shown promising results [4,5,6]. A lot of work is done to find the right audio descriptor and generate text captions for the audio to improve the models’ results. In addition, the training task of such models is a crucial point that can be explored to improve the performances. The goal of this internship is to use an in-house state-of-the-art music encoder and build a new multimodal model around music recordings and text captions.
References
[1] K. Koutini, J. Schlüter, H. Eghbal-zadeh, and G. Widmer, “Efficient training of audio transformers with patchout,” in Proc. Interspeech, Incheon, September 2022, pp. 2753–2757. [2] S. Chen, Y. Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei, “Beats: Audio pre-training with acoustic tokenizers,” in Proc. ICML, vol. 202, Honolulu, July 2023, pp. 5178–5193. [3] D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “Masked modeling duo: Learning representations by encouraging both networks to model the input,” in Proc. ICASSP, June 2023, pp. 1–5. [4] B. Elizalde, S. Deshmukh, M. Ismail, H. Wang, “CLAP : LEARNING AUDIO CONCEPTS FROM NATURAL LANGUAGE SUPERVISION,” in Proc. ICASSP, June 2023 [5] J. Gardner, S. Durand, D. Stoller, R. Bittner, “LLARK : A Multimodal Instruction-Following Language Model for Music,” in Proc. ICML, 2024 [6]C. Tang, W. Yu, G. S, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, C. Zhang. “SALMONN: TOWARDS GENERIC HEARING ABILITIES FOR LARGE LANGUAGE MODELS,” in Proc. ICLR, 2024
Candidate profile
- Master in Computer Science / Mathematics / Signal Processing
- Proficient Python Skills (PyTorch is a must/ Numpy / Sklearn)
- Interested in research, especially audio and music.
- Good self-organization and autonomous working.
- Language: fluent in English, both in writting and speaking.