Self-supervised learning: a review and new perspectives
By V. Lostanlen

Vincent Lostanlen reviews for ADASP some current findings from NYU’s Music and Audio Research Lab as regards self-supervised machine listening.

Abstract

Short version (72 words): In this talk, I will review some current findings from NYU’s Music and Audio Research Lab as regards self-supervised machine listening. I will offer ideas for potential pretext tasks, with downstream applications in beat tracking, key estimation estimation, and tempo estimation. Lastly, I will discuss some open challenges facing the massive adoption of self-supervision in machine learning. In particular, the formulation of a bona fide pretext task deserves careful consideration.

Long version (554 words): The past decade has witnessed a breakthrough of deep learning in machine listening, with some noteworthy applications in acoustic event detection, music transcription, and structure retrieval. This progress was made possible thanks to several concurrent factors, among which: falling costs of sensing hardware, efficient GPU routines for differentiable programming, and better gradient-based algorithms for stochastic optimization.

However, not all machine listening tasks are born equal in terms of human cost. On one hand, due to its industrial impact in online streaming platforms, audio tagging enjoys vast amounts of user-generated content. On the other hand, more expert tasks such as beat tracking or key estimation are more annotation-intensive, and more expensive to crowdsource. Such discrepancy in the availability of training data leads to a two-tier situation, wherein few big annotated corpora (English speech, pop music, and so forth) quickly reach human-level performance whereas many “niche” applications (e.g. eco-acoustics, non-Western music) make little progress, if any.

It stems from the above that a large number of use cases in machine listening remain out of reach of the traditional deep learning paradigm; that is, trained from scratch, end to end (from raw waveform to one-hot label space), under full human supervision. In this context, it is worth pointing out that all past efforts of transferring knowledge—from some richly labeled, general-purpose audio tagging task to some scarcely labeled, niche task—have brought underwhelming results. Nevertheless, one emerging paradigm in pattern recognition, known as self-supervised learning, offers a promising alternative to cross-collection transfer.

The key idea behind self-supervised learning is to formulate a so-called “pretext task” in which the deep learning system is trained to predict computer-generated, rather than human-annotated, labels. This task serves as a principled initialization mechanism before solving the task of interest (so-called “downstream task”) on the same dataset. The main advantage of self-supervised learning lies in its scalability. Because the pretext task requires no human intervention, it may be applied to massive amounts of unlabeled data; unlike the downstream task.

In this talk, I will present some recent applications of self-supervised learning, both in computer vision and machine listening. Some pretext tasks that have shown to be useful are: unshuffling a jigsaw puzzle; regressing when an environmental recording was acquired; and deciding whether a video stream matches a given soundtrack. I will review some current findings from NYU’s Music and Audio Research Lab as regards self-supervised machine listening. I will offer ideas for potential pretext tasks, with downstream applications in beat tracking, key estimation estimation, and tempo estimation.

Lastly, I will discuss some open challenges facing the massive adoption of self-supervision in machine learning. In particular, the formulation of a bona fide pretext task deserves careful consideration. Indeed, random manipulations of natural data may coincide with geometrical invariants: for example, swapping two identical jigsaw pieces leaves the overall scene unchanged, thereby making the pretext task inherently ambiguous. I propose a methodological workaround in order to adapt, on a per-sample basis, the formulation of the pretext task to the occasional presence of multistable percepts. Despite this caveat, I firmly believe that the self-supervision paradigm has a strong potential in machine listening. As such, it deserves further inquiry, both from theoretical and applied standpoints.