Audio-based inter-modal translation and cross-modal representation alignment
By K. Drossos

K. Drossos talks to ADASP about automated audio captioning


Everyday soundscapes contain a variety of information, providing valuable cues to machines to understand their surroundings. Recent advances in machine listening include methods that allow for contextual awareness by recognizing the acoustic scene (e.g. “office”, “urban area”, “airport”) and by detecting and classifying sound events (e.g. “dog barking”, “people speaking”). Though, there is more information in what we can hear and perceive than simply “busy street” or “car passing by”. For example, we can understand concepts (“muffled sound”), physical properties of objects and environment (“the sound of a big car”, “people talking in a big room”), and high-level knowledge (“a clock rings three times”). A promising way of creating learning algorithms and computational methods that can exploit the previously mentioned information is to leverage multi-modal information, such as audio and text. This talk will be focusing on two emerging research directions on audio-based multi-modal processing, which allow for learning the above-mentioned information. The first one is automated audio captioning (AAC), a newly introduced inter-modal translation task where a method takes as an input a general audio signal and outputs a textual description of its contents, e.g. “A man moves from the basement to upstairs, moving a heavy metal object with him”. The latest advances on AAC will be presented, including the process of creating the well-curated AAC dataset Clotho, and a recent AAC method, which yields state-of-the-art results on this dataset. The second research direction focuses on the cross-modal alignment of learned representations between audio and text. In this setting, a method aims at reducing the distance between two feature vectors learned from these two modalities. Then, the learned audio feature vectors can be used in various downstream tasks, e.g. audio classification and music genre recognition. Recent methods focusing on the latter direction show competitive to SOTA results, using less amount of data.


Dr. Konstantinos Drossos was born in Thessaloniki, Greece. He holds a BEng in Sound Technology (first-class honours), a BSc in Informatics, an MSc in Sound & Vibration Research, and a PhD (first-class honours) in the field of machine listening. Currently, he is a senior researcher at the Audio Research Group (ARG), Finland. He has been a postdoc researcher at ARG and a postdoc fellow at Montreal Institute for Learning Algorithms, Canada, and at Music Technology Group, Spain. He has authored or co-authored over 50 research papers, has pioneered the field of audio captioning, has organized scientific challenges, special sessions, and workshops in international conferences, serves as a reviewer for top journals and conferences, and has served as the Chairman of the Finnish IEEE Joint Chapter of SP&CAS. His research interests include audio captioning, domain adaptation, multimodal translation, source separation, detection and classification of acoustic scenes and events, and machine listening.