Learning Transformations and Invariances with Artificial Neural Networks
By S. Lattner

Stefan Lattner explains to ADASP the mechanisms of transformation and invariance learning for symbolic music and audio.

Abstract

In this talk, I will explain the mechanisms of transformation and invariance learning for symbolic music and audio, and I will describe different models that are based on this principle. Transformation Learning (TL) provides us with a novel way of musical representation learning. To that end, we do not aim to learn the musical patterns themselves, but some “rules’’ defining how a given pattern can be transformed into another pattern. TL was initially proposed for image processing and had not yet been applied to music. In this talk, I summarize our experiments in TL for music. The models used throughout our work are based on Gated Autoencoders (GAE) which learn orthogonal transformations between data pairs. We show that a GAE can learn chromatic transposition, tempo-change, and the retrograde movement in music, but also more complex musical transformations, like diatonic transposition. Transformation Learning (TL) provides us with a different view on music data, and yields features complementary to other music descriptors (e.g., such as obtained by autoencoder learning or hand-crafted features). There are different possible research directions regarding TL in music. They involve using the transformation features themselves, using transformation-invariant features computed from TL models, and using TL models for music generation. I will particularly focus on DrumNet, a convolutional variant of a Gated Autoencoder, and will show how TL leads to time and tempo-invariant representations of rhythm. Importantly, learning transformations and learning invariances are two sides of the same coin (as specific invariances are defined with respect to specific transformations). I will introduce the Complex Autoencoder, a model derived from a Gated Autoencoder, which learns both a transformation-invariant, and a transformation-variant feature space. Using transposition- and time-shift invariant features, we obtain improved performance for audio alignment tasks.