Self-Supervised VQ-VAE for One-Shot Music Style Transfer

Paper

Ondřej Cífka, Alexey Ozerov, Umut Şimşekli and Gaël Richard. "Self-Supervised VQ-VAE for One-Shot Music Style Transfer." Accepted to the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.

Abstract

Neural style transfer, allowing to apply the artistic style of one image to another, has become one of the most widely showcased computer vision applications shortly after its introduction. In contrast, related tasks in the music audio domain remained, until recently, largely untackled. While several style conversion methods tailored to musical signals have been proposed, most lack the ‘one-shot’ capability of classical image style transfer algorithms. On the other hand, the results of existing one-shot audio style transfer methods on musical inputs are not as compelling. In this work, we are specifically interested in the problem of one-shot timbre transfer. We present a novel method for this task, based on an extension of the vector-quantized variational autoencoder (VQ-VAE), along with a simple self-supervised learning strategy designed to obtain disentangled representations of timbre and pitch. We evaluate the method using a set of objective metrics and show that it is able to outperform selected baselines.

Examples
- Artificial inputs
- Real inputs
Additional information

Examples

Artificial inputs

The following is a random sample of the synthetic test set with outputs of our model and the two baselines (musaicing and U+L). For U+L (Ulyanov and Lebedev), we include both the tuned version ( $\lambda_s=10^{-2.1}\lambda_c$ ) and a version with a higher style weight ( $\lambda_s=10\lambda_c$ ).

Content input	Style input	Target
Tinkle Bell	Voice Oohs	Voice Oohs
Flute	Tremolo Strings	Tremolo Strings
Church Organ	Banjo	Banjo
Electric Bass (pick)	Electric Guitar (clean)	Electric Guitar (clean)
Honky-tonk Piano	Synth Brass 1	Synth Brass 1
Lead 1 (square)	Electric Guitar (jazz)	Electric Guitar (jazz)
Vibraphone	Alto Sax	Alto Sax
Acoustic Bass	Harpsichord	Harpsichord
Xylophone	Banjo	Banjo
Glockenspiel	Pad 3 (polysynth)	Pad 3 (polysynth)

‘Good’ outputs

Here, we show a sample of the best outputs of our system (below the 5th percentile) according to the LSD metric.

Content input	Style input	Target	LSD_T	Timbre_T	Pitch_T
Acoustic Guitar (nylon)	Orchestral Harp	Orchestral Harp	6.5472	0.0302	0.1604
Honky-tonk Piano	Distortion Guitar	Distortion Guitar	6.7790	0.0881	0.7856
Lead 8 (bass + lead)	Bright Acoustic Piano	Bright Acoustic Piano	6.6822	0.0746	0.4663
Synth Bass 1	Clavinet	Clavinet	6.7523	0.0292	0.4499
Acoustic Guitar (nylon)	Tango Accordion	Tango Accordion	7.1450	0.1637	0.2526

‘Bad’ outputs

Similarly, here is a sample of the worst outputs (above the 95th percentile) according to the LSD metric.

Content input	Style input	Target	LSD_T	Timbre_T	Pitch_T
Acoustic Guitar (steel)	Whistle	Whistle	22.7329	0.4901	0.5881
Pad 3 (polysynth)	Koto	Koto	20.8195	0.4434	0.6820
Orchestral Harp	Harmonica	Harmonica	22.0352	0.3142	0.5058
FX 8 (sci-fi)	Guitar Harmonics	Guitar Harmonics	21.1952	0.1221	0.4858
Glockenspiel	Vibraphone	Vibraphone	19.3475	0.3258	0.6435

Real inputs

The following are outputs on the ‘Mixing Secrets’ test set, first some cherry-picked ones and then a random sample.

Selection

Content input	Style input	VQ-VAE (ours)	Musaicing	U+L ( $\lambda_s=10^{-2.1}\lambda_c$ )	U+L ( $\lambda_s=10\lambda_c$ )
ElecGtr2CloseMic3	Keys Organ Active DI
Synth	PianoMics1
RhodesDI	Acoustic Guitar Lead Ela M 251
Bass Amp M82	Bass bip
SynthFX1	SlecGtr3a Close
NBATG Rhodes DI	SD KEYS DI GRACE
Dulcimer2	Strings SectionMic Vln2

Random sample

Content input	Style input	VQ-VAE (ours)	Musaicing	U+L ( $\lambda_s=10^{-2.1}\lambda_c$ )	U+L ( $\lambda_s=10\lambda_c$ )
Mellotron	AC GUITAR 3 CU29 SHADOWHILL R
Fiddle2	Violins
UPRIGHT BASS ELA M 260 Neve 33102	Taiko
Guitar 2	OUTRO ALTO 251E SSL6000E
BassCloseMic2	ELE Guitars Ignater M81
Bells	Bass Mic 647

Additional information

This section contains details omitted from the paper for brevity.

Artificial test set

The artificial test set was created from the Lakh MIDI Dataset using a set of files held out from the training set. The audio was synthesized using the Timbres Of Heaven SoundFont, which was not used for the training set.

We randomly drew 721 content-style input pairs and generated a corresponding ground-truth target for each pair by synthesizing the content input using the instrument (MIDI program) of the style input. To avoid pairs of extremely different inputs (e.g. bass line + piccolo duet) for which the task would make little sense, we sorted all instrument parts into 4 bins using two median splits: on the average pitch and on the average number of voices (simultaneous notes); we then formed each pair by drawing two examples from the same bin. To obtain a balanced distribution of instruments, we limited the total number of examples per MIDI program to 4.

‘Real’ test set

The ‘real data’ test set was created from the Mixing Secrets collection. We used filename matching to exclude most drum, vocal and multi-instrument tracks and to balance the distribution of the remaining instruments (dominated by electric guitar and bass). To form the input pairs, we performed the same binning procedure as for the artificial test set, using the multi-pitch MELODIA algorithm to estimate the average pitch and number of voices.

Timbre dissimilarity metric

The metric uses a sequence of MFCC vectors (only coefficients 2–13) as input and is trained using the triplet loss (using the code from the ISMIR 2020 metric learning tutorial). The training dataset consists of 7381 triplets (anchor, positive, negative) extracted from the Mixing Secrets data so that the anchor and the positive example are from the same file and the negative example is from a different file. The aim is to make the metric good at discriminating between different instruments, but largely pitch-independent.