Ondřej Cífka, Alexey Ozerov, Umut Şimşekli and Gaël Richard. "Self-Supervised VQ-VAE for One-Shot Music Style Transfer." Accepted to the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.

GitHub repo stars


Neural style transfer, allowing to apply the artistic style of one image to another, has become one of the most widely showcased computer vision applications shortly after its introduction. In contrast, related tasks in the music audio domain remained, until recently, largely untackled. While several style conversion methods tailored to musical signals have been proposed, most lack the ‘one-shot’ capability of classical image style transfer algorithms. On the other hand, the results of existing one-shot audio style transfer methods on musical inputs are not as compelling. In this work, we are specifically interested in the problem of one-shot timbre transfer. We present a novel method for this task, based on an extension of the vector-quantized variational autoencoder (VQ-VAE), along with a simple self-supervised learning strategy designed to obtain disentangled representations of timbre and pitch. We evaluate the method using a set of objective metrics and show that it is able to outperform selected baselines.



Artificial inputs

The following is a random sample of the synthetic test set with outputs of our model and the two baselines (musaicing and U+L). For U+L (Ulyanov and Lebedev), we include both the tuned version (λs=102.1λc\lambda_s=10^{-2.1}\lambda_c) and a version with a higher style weight (λs=10λc\lambda_s=10\lambda_c).

Content input Style input Target VQ-VAE (ours) Musaicing U+L (λs=102.1λc\lambda_s=10^{-2.1}\lambda_c) U+L (λs=10λc\lambda_s=10\lambda_c)
Tinkle Bell Voice Oohs Voice Oohs
Flute Tremolo Strings Tremolo Strings
Church Organ Banjo Banjo
Electric Bass (pick) Electric Guitar (clean) Electric Guitar (clean)
Honky-tonk Piano Synth Brass 1 Synth Brass 1
Lead 1 (square) Electric Guitar (jazz) Electric Guitar (jazz)
Vibraphone Alto Sax Alto Sax
Acoustic Bass Harpsichord Harpsichord
Xylophone Banjo Banjo
Glockenspiel Pad 3 (polysynth) Pad 3 (polysynth)

‘Good’ outputs

Here, we show a sample of the best outputs of our system (below the 5th percentile) according to the LSD metric.

Content input Style input Target Output LSDT TimbreT PitchT
Acoustic Guitar (nylon) Orchestral Harp Orchestral Harp 6.5472 0.0302 0.1604
Honky-tonk Piano Distortion Guitar Distortion Guitar 6.7790 0.0881 0.7856
Lead 8 (bass + lead) Bright Acoustic Piano Bright Acoustic Piano 6.6822 0.0746 0.4663
Synth Bass 1 Clavinet Clavinet 6.7523 0.0292 0.4499
Acoustic Guitar (nylon) Tango Accordion Tango Accordion 7.1450 0.1637 0.2526

‘Bad’ outputs

Similarly, here is a sample of the worst outputs (above the 95th percentile) according to the LSD metric.

Content input Style input Target Output LSDT TimbreT PitchT
Acoustic Guitar (steel) Whistle Whistle 22.7329 0.4901 0.5881
Pad 3 (polysynth) Koto Koto 20.8195 0.4434 0.6820
Orchestral Harp Harmonica Harmonica 22.0352 0.3142 0.5058
FX 8 (sci-fi) Guitar Harmonics Guitar Harmonics 21.1952 0.1221 0.4858
Glockenspiel Vibraphone Vibraphone 19.3475 0.3258 0.6435

Real inputs

The following are outputs on the ‘Mixing Secrets’ test set, first some cherry-picked ones and then a random sample.


Content input Style input VQ-VAE (ours) Musaicing U+L (λs=102.1λc\lambda_s=10^{-2.1}\lambda_c) U+L (λs=10λc\lambda_s=10\lambda_c)
ElecGtr2CloseMic3 Keys Organ Active DI
Synth PianoMics1
RhodesDI Acoustic Guitar Lead Ela M 251
Bass Amp M82 Bass bip
SynthFX1 SlecGtr3a Close
Dulcimer2 Strings SectionMic Vln2

Random sample

Content input Style input VQ-VAE (ours) Musaicing U+L (λs=102.1λc\lambda_s=10^{-2.1}\lambda_c) U+L (λs=10λc\lambda_s=10\lambda_c)
Fiddle2 Violins
UPRIGHT BASS ELA M 260 Neve 33102 Taiko
Guitar 2 OUTRO ALTO 251E SSL6000E
BassCloseMic2 ELE Guitars Ignater M81
Bells Bass Mic 647

Additional information

This section contains details omitted from the paper for brevity.

Artificial test set

The artificial test set was created from the Lakh MIDI Dataset using a set of files held out from the training set. The audio was synthesized using the Timbres Of Heaven SoundFont, which was not used for the training set.

We randomly drew 721 content-style input pairs and generated a corresponding ground-truth target for each pair by synthesizing the content input using the instrument (MIDI program) of the style input. To avoid pairs of extremely different inputs (e.g. bass line + piccolo duet) for which the task would make little sense, we sorted all instrument parts into 4 bins using two median splits: on the average pitch and on the average number of voices (simultaneous notes); we then formed each pair by drawing two examples from the same bin. To obtain a balanced distribution of instruments, we limited the total number of examples per MIDI program to 4.

‘Real’ test set

The ‘real data’ test set was created from the Mixing Secrets collection. We used filename matching to exclude most drum, vocal and multi-instrument tracks and to balance the distribution of the remaining instruments (dominated by electric guitar and bass). To form the input pairs, we performed the same binning procedure as for the artificial test set, using the multi-pitch MELODIA algorithm to estimate the average pitch and number of voices.

Timbre dissimilarity metric

The metric uses a sequence of MFCC vectors (only coefficients 2–13) as input and is trained using the triplet loss (using the code from the ISMIR 2020 metric learning tutorial). The training dataset consists of 7381 triplets (anchor, positive, negative) extracted from the Mixing Secrets data so that the anchor and the positive example are from the same file and the negative example is from a different file. The aim is to make the metric good at discriminating between different instruments, but largely pitch-independent.