Paper
Ondřej Cífka, Alexey Ozerov, Umut Şimşekli and Gaël Richard. "Self-Supervised VQ-VAE for One-Shot Music Style Transfer." Accepted to the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
Abstract
Neural style transfer, allowing to apply the artistic style of one image to another, has become one of the most widely showcased computer vision applications shortly after its introduction. In contrast, related tasks in the music audio domain remained, until recently, largely untackled. While several style conversion methods tailored to musical signals have been proposed, most lack the ‘one-shot’ capability of classical image style transfer algorithms. On the other hand, the results of existing one-shot audio style transfer methods on musical inputs are not as compelling. In this work, we are specifically interested in the problem of one-shot timbre transfer. We present a novel method for this task, based on an extension of the vector-quantized variational autoencoder (VQ-VAE), along with a simple self-supervised learning strategy designed to obtain disentangled representations of timbre and pitch. We evaluate the method using a set of objective metrics and show that it is able to outperform selected baselines.
Contents
Examples
Artificial inputs
The following is a random sample of the synthetic test set with outputs of our model and the two baselines (musaicing and U+L). For U+L (Ulyanov and Lebedev), we include both the tuned version () and a version with a higher style weight ().
Content input | Style input | Target | VQ-VAE (ours) | Musaicing | U+L () | U+L () |
---|---|---|---|---|---|---|
Tinkle Bell | Voice Oohs | Voice Oohs | ||||
Flute | Tremolo Strings | Tremolo Strings | ||||
Church Organ | Banjo | Banjo | ||||
Electric Bass (pick) | Electric Guitar (clean) | Electric Guitar (clean) | ||||
Honky-tonk Piano | Synth Brass 1 | Synth Brass 1 | ||||
Lead 1 (square) | Electric Guitar (jazz) | Electric Guitar (jazz) | ||||
Vibraphone | Alto Sax | Alto Sax | ||||
Acoustic Bass | Harpsichord | Harpsichord | ||||
Xylophone | Banjo | Banjo | ||||
Glockenspiel | Pad 3 (polysynth) | Pad 3 (polysynth) |
‘Good’ outputs
Here, we show a sample of the best outputs of our system (below the 5th percentile) according to the LSD metric.
Content input | Style input | Target | Output | LSDT | TimbreT | PitchT |
---|---|---|---|---|---|---|
Acoustic Guitar (nylon) | Orchestral Harp | Orchestral Harp | 6.5472 | 0.0302 | 0.1604 | |
Honky-tonk Piano | Distortion Guitar | Distortion Guitar | 6.7790 | 0.0881 | 0.7856 | |
Lead 8 (bass + lead) | Bright Acoustic Piano | Bright Acoustic Piano | 6.6822 | 0.0746 | 0.4663 | |
Synth Bass 1 | Clavinet | Clavinet | 6.7523 | 0.0292 | 0.4499 | |
Acoustic Guitar (nylon) | Tango Accordion | Tango Accordion | 7.1450 | 0.1637 | 0.2526 |
‘Bad’ outputs
Similarly, here is a sample of the worst outputs (above the 95th percentile) according to the LSD metric.
Content input | Style input | Target | Output | LSDT | TimbreT | PitchT |
---|---|---|---|---|---|---|
Acoustic Guitar (steel) | Whistle | Whistle | 22.7329 | 0.4901 | 0.5881 | |
Pad 3 (polysynth) | Koto | Koto | 20.8195 | 0.4434 | 0.6820 | |
Orchestral Harp | Harmonica | Harmonica | 22.0352 | 0.3142 | 0.5058 | |
FX 8 (sci-fi) | Guitar Harmonics | Guitar Harmonics | 21.1952 | 0.1221 | 0.4858 | |
Glockenspiel | Vibraphone | Vibraphone | 19.3475 | 0.3258 | 0.6435 |
Real inputs
The following are outputs on the ‘Mixing Secrets’ test set, first some cherry-picked ones and then a random sample.
Selection
Content input | Style input | VQ-VAE (ours) | Musaicing | U+L () | U+L () |
---|---|---|---|---|---|
ElecGtr2CloseMic3 | Keys Organ Active DI | ||||
Synth | PianoMics1 | ||||
RhodesDI | Acoustic Guitar Lead Ela M 251 | ||||
Bass Amp M82 | Bass bip | ||||
SynthFX1 | SlecGtr3a Close | ||||
NBATG Rhodes DI | SD KEYS DI GRACE | ||||
Dulcimer2 | Strings SectionMic Vln2 |
Random sample
Content input | Style input | VQ-VAE (ours) | Musaicing | U+L () | U+L () |
---|---|---|---|---|---|
Mellotron | AC GUITAR 3 CU29 SHADOWHILL R | ||||
Fiddle2 | Violins | ||||
UPRIGHT BASS ELA M 260 Neve 33102 | Taiko | ||||
Guitar 2 | OUTRO ALTO 251E SSL6000E | ||||
BassCloseMic2 | ELE Guitars Ignater M81 | ||||
Bells | Bass Mic 647 |
Additional information
This section contains details omitted from the paper for brevity.
Artificial test set
The artificial test set was created from the Lakh MIDI Dataset using a set of files held out from the training set. The audio was synthesized using the Timbres Of Heaven SoundFont, which was not used for the training set.
We randomly drew 721 content-style input pairs and generated a corresponding ground-truth target for each pair by synthesizing the content input using the instrument (MIDI program) of the style input. To avoid pairs of extremely different inputs (e.g. bass line + piccolo duet) for which the task would make little sense, we sorted all instrument parts into 4 bins using two median splits: on the average pitch and on the average number of voices (simultaneous notes); we then formed each pair by drawing two examples from the same bin. To obtain a balanced distribution of instruments, we limited the total number of examples per MIDI program to 4.
‘Real’ test set
The ‘real data’ test set was created from the Mixing Secrets collection. We used filename matching to exclude most drum, vocal and multi-instrument tracks and to balance the distribution of the remaining instruments (dominated by electric guitar and bass). To form the input pairs, we performed the same binning procedure as for the artificial test set, using the multi-pitch MELODIA algorithm to estimate the average pitch and number of voices.
Timbre dissimilarity metric
The metric uses a sequence of MFCC vectors (only coefficients 2–13) as input and is trained using the triplet loss (using the code from the ISMIR 2020 metric learning tutorial). The training dataset consists of 7381 triplets (anchor, positive, negative) extracted from the Mixing Secrets data so that the anchor and the positive example are from the same file and the negative example is from a different file. The aim is to make the metric good at discriminating between different instruments, but largely pitch-independent.