Starting 2025 [37 publications]

2026

Conference Articles

S-PRESSO: Ultra Low Bitrate Sound Effect Compression With Diffusion Autoencoders And Offline Quantization
Zineb Lahrichi, Gaëtan Hadjeres, Gaël Richard, Geoffroy Peeters
International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelone, Spain, May 2026.
```
@inproceedings{lahrichi:hal-05492477,
  address = {Barcelone, Spain},
  author = {Lahrichi, Zineb and Hadjeres, Ga{\"e}tan and Richard, Ga{\"e}l and Peeters, Geoffroy},
  booktitle = {{International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}},
  hal_id = {hal-05492477},
  hal_version = {v3},
  keywords = {Low Bitrates ; Diffusion Autoencoders ; Audio Codecs},
  month = may,
  pdf = {https://hal.science/hal-05492477v3/file/camera_ready_hal.pdf},
  title = {{S-PRESSO: Ultra Low Bitrate Sound Effect Compression With Diffusion Autoencoders And Offline Quantization}},
  url = {https://hal.science/hal-05492477},
  year = {2026}
}
```
Neural audio compression models have recently achieved extreme compression rates, enabling efficient latent generative modeling. Conversely, latent generative models have been applied to compression, pushing the limits of continuous and discrete approaches. However, existing methods remain constrained to low-resolution audio and degrade substantially at very low bitrates, where audible artifacts are prominent. In this paper, we present S-PRESSO, a 48kHz sound effect compression model that produces both continuous and discrete embeddings at ultra-low bitrates, down to 0.096 kbps, via offline quantization. Our model relies on a pretrained latent diffusion model to decode compressed audio embeddings learned by a latent encoder. Leveraging the generative priors of the diffusion decoder, we achieve extremely low frame rates, down to 1Hz (750x compression rate), producing convincing and realistic reconstructions at the cost of exact fidelity. Despite operating at high compression rates, we demonstrate that S-PRESSO outperforms both continuous and discrete baselines in audio quality, acoustic similarity and reconstruction metrics.
PHYSICS-INFORMED LEARNING OF NEURAL SCATTERING FIELDS TOWARDS MEASUREMENT-FREE MESH-TO-HRTF ESTIMATION
Tancrède Martinez, Diego Di Carlo, Aditya Arie Nugraha, Mathieu Fontaine, Kazuyoshi Yoshii
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026), Barcelone, Spain, May 2026.
```
@inproceedings{martinez:hal-05476015,
  address = {Barcelone, Spain},
  author = {Martinez, Tancr{\`e}de and Carlo, Diego Di and Nugraha, Aditya Arie and Fontaine, Mathieu and Yoshii, Kazuyoshi},
  booktitle = {{Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026)}},
  hal_id = {hal-05476015},
  hal_version = {v1},
  keywords = {Scattering field ; physics-informed neural network ; perfectly matched layer ; low-rank adaptation ; head related transfer function},
  month = may,
  organization = {{IEEE Signal Processing Society}},
  pdf = {https://hal.science/hal-05476015v1/file/2026_ICASSP_Tancrede_scattering_PINN.pdf},
  series = {Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026)},
  title = {{PHYSICS-INFORMED LEARNING OF NEURAL SCATTERING FIELDS TOWARDS MEASUREMENT-FREE MESH-TO-HRTF ESTIMATION}},
  url = {https://hal.science/hal-05476015},
  year = {2026}
}
```
This paper describes neural simulation of the scattered pressure field from a plane wave around a scattering object in both continuous 2D and 3D domains. This task has typically been treated as a regression problem that aims to train a physicsinformed neural network (PINN) using pressure measurements at discrete positions. This approach, however, needs to train the whole network for each incident wave direction. To address this, we propose a measurement-free simulator based on a PINN purely driven by the Helmholtz equation with the Robin boundary condition and the Sommerfeld radiation condition with the aid of the perfectly matched layer (PML) framework. More specifically, we design a physics-informed scattering hypernetwork (PHISK) that can generalize to incident waves from any direction via low-rank adaptation (LoRA) of a PINN trained for a specific configuration. The experiment shows that the proposed method accurately simulated sound scattering around various objects, adapting to unseen incident wave directions with minimal performance loss, and realized reasonable simulation of head-related transfer functions (HRTFs) from complex mesh data of a human head.
SIRUP: A DIFFUSION-BASED VIRTUAL UPMIXER OF STEERING VECTORS FOR HIGHLY-DIRECTIVE SPATIALIZATION WITH FIRST-ORDER AMBISONICS
Emilio Picard, Diego Di Carlo, Aditya Arie Nugraha, Mathieu Fontaine, Kazuyoshi Yoshii
ICASSP, Barcelone, Spain, May 2026.
```
@inproceedings{picard:hal-05516730,
  address = {Barcelone, Spain},
  author = {Picard, Emilio and Carlo, Diego Di and Nugraha, Aditya Arie and Fontaine, Mathieu and Yoshii, Kazuyoshi},
  booktitle = {{ICASSP}},
  hal_id = {hal-05516730},
  hal_version = {v1},
  keywords = {Steering vectors ; virtual upmixing ; latent diffusion model ; sound source localization ; beamforming},
  month = may,
  pdf = {https://hal.science/hal-05516730v1/file/2026_ICASSP_Emilio.pdf},
  title = {{SIRUP: A DIFFUSION-BASED VIRTUAL UPMIXER OF STEERING VECTORS FOR HIGHLY-DIRECTIVE SPATIALIZATION WITH FIRST-ORDER AMBISONICS}},
  url = {https://hal.science/hal-05516730},
  year = {2026}
}
```
This paper presents virtual upmixing of steering vectors captured by a fewer-channel spherical microphone array. This challenge has conventionally been addressed by recovering the directions and signals of sound sources from first-order ambisonics (FOA) data, and then rendering the higher-order ambisonics (HOA) data using a physics-based acoustic simulator. This approach, however, struggles to handle the mutual dependency between the spatial directivity of source estimation and the spatial resolution of FOA ambisonics data. Our method, named SIRUP, employs a latent diffusion model architecture. Specifically, a variational autoencoder (VAE) is used to learn a compact encoding of the HOA data in a latent space and a diffusion model is then trained to generate the HOA embeddings, conditioned by the FOA data. Experimental results showed that SIRUP achieved a significant improvement compared to FOA systems for steering vector upmixing, source localization, and speech denoising.

Journal Articles

Statistical wave field theory: Anisotropic wave fields under Neumann’s boundary condition
Roland Badeau
Journal of the Acoustical Society of America, March 2026.
```
@article{badeau:hal-05549135,
  author = {Badeau, Roland},
  doi = {10.1121/10.0042450},
  hal_id = {hal-05549135},
  hal_version = {v1},
  journal = {{Journal of the Acoustical Society of America}},
  keywords = {wave equation ; Helmholtz equation ; quantum billiards ; Statistical physics},
  month = mar,
  number = {3},
  pages = {2265-2280},
  pdf = {https://telecom-paris.hal.science/hal-05549135v1/file/Badeau-JASA-2026-preprint1.pdf},
  publisher = {{Acoustical Society of America}},
  title = {{Statistical wave field theory: Anisotropic wave fields under Neumann's boundary condition}},
  url = {https://telecom-paris.hal.science/hal-05549135},
  volume = {159},
  year = {2026}
}
```
The statistical wave field theory mathematically establishes the statistical laws of the solutions to the wave equation in a bounded domain. It provides the closed-form expressions of the power distribution and the correlations of the wave field jointly over time, frequency, and space, which hold at high frequency and after many reflections, in terms of the geometry and the specific admittance of the boundary surface. This theory was originally developed in the particular case of mixing rooms, which are characterized by a diffuse wave field, based on the theory of dynamical billiards and on Weyl-like asymptotic laws. Then it was extended to the finite family of special polyhedra, where the wave field is anisotropic, based on a simpler geometric approach related to mathematical crystallography. In this paper, we introduce a unified version of the theory dedicated to a class of semi-mixing billiards. In the case of Neumann’s boundary condition, we show that the wave field is stationary, but it is generally anisotropic. In particular, the correlation between two spatial positions at a given frequency is different from the well-known cardinal sine formula that characterizes diffuse acoustic fields.
The Inverse Drum Machine: Source Separation Through Joint Transcription and Analysis-by-Synthesis
Bernardo Torres, Geoffroy Peeters, Gaël Richard
IEEE Transactions on Audio, Speech and Language Processing, 2026.
```
@article{torres:hal-05056592,
  author = {Torres, Bernardo and Peeters, Geoffroy and Richard, Ga{\"e}l},
  doi = {10.1109/TASLPRO.2025.3629286},
  hal_id = {hal-05056592},
  hal_version = {v2},
  journal = {{IEEE Transactions on Audio, Speech and Language Processing}},
  keywords = {Audio synthesis ; Audio signal processing ; Music source separation ; Source separation},
  pages = {84-95},
  pdf = {https://hal.science/hal-05056592v2/file/main.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{The Inverse Drum Machine: Source Separation Through Joint Transcription and Analysis-by-Synthesis}},
  url = {https://hal.science/hal-05056592},
  volume = {34},
  year = {2026}
}
```
We present the Inverse Drum Machine, a novel approach to Drum Source Separation that leverages an analysis-by-synthesis framework combined with deep learning. Unlike recent supervised methods that require isolated stem recordings for training, our approach is trained on drum mixtures with only transcription annotations. IDM integrates Automatic Drum Transcription and One-shot Drum Sample Synthesis, jointly optimizing these tasks in an end-to-end manner. By convolving synthesized one-shot samples with estimated onsets, akin to a drum machine, we reconstruct the individual drum stems and train a Deep Neural Network on the reconstruction of the mixture. Experiments on the StemGMD dataset demonstrate that IDM achieves separation quality comparable to state-of-the-art supervised methods that require isolated stems data.

2025

Conference Articles

A Maximum Length Sequence-Based Method for Robust Round-Trip Latency Estimation in online Digital Audio Workstations
J M Gil Panal, Gaël Richard, Aurélien David
Web Audio Conference 2025, Paris, France, November 2025.
```
@inproceedings{gilpanal:hal-05154354,
  address = {Paris, France},
  author = {Gil Panal, J M and Richard, Ga{\"e}l and David, Aur{\'e}lien},
  booktitle = {{Web Audio Conference 2025}},
  hal_id = {hal-05154354},
  hal_version = {v2},
  keywords = {Acoustics ; audio track ; MLS ; DAW ; Web audio ; roundtrip latency},
  month = nov,
  pdf = {https://hal.science/hal-05154354v2/file/Gil_Richard_David_MLS_Based_Roundtrip_Latency_Reviewed.pdf},
  title = {{A Maximum Length Sequence-Based Method for Robust Round-Trip Latency Estimation in online Digital Audio Workstations}},
  url = {https://hal.science/hal-05154354},
  year = {2025}
}
```
Accurate estimation of latency when working with digital audio equipment is critical for the precise operation of certain applications. This is particularly true for Digital Audio Workstations (DAWs) and other tools used in the creation and editing of audio, especially music. These systems require exact synchronization or alignment of tracks, which is essential for the mixing process. Latency is an inherent phenomenon in audio capture and restitution. Although it may sometimes be minimal, it is always variable depending on the device, operating system, and audio configuration. The undesired effect introduced by latency—specifically referred to in this context as round-trip latency—manifests as a delay between the audio input and the corresponding output. The most effective way to address this issue is through prior measurement to enable proper compensation. Various methods exist for performing this measurement, generally based on the playback and recording of acoustic signals. This article presents an existing method applied in a novel way within the domain of audio and web browsers, based on the use of a Maximum Length Sequence (MLS) signal. This signal is commonly used in room impulse response characterization. To validate its effectiveness and identify the limitations of the proposed approach, multiple tests and experiments were conducted on different devices. Results were compared across various browsers and operating systems, and the proposed solution was benchmarked against the methods employed by existing online DAWs. The implementation of the proposed method is available as part of the Hi-Audio online platform—an open-source, browser-based DAW—providing a practical demonstration of its applicability and integration in real-world web audio environments.
iKnow-audio: Integrating Knowledge Graphs with Audio-Language Models
Michel Olvera, Changhong Wang, Paraskevas Stamatiadis, Gaël Richard, Slim Essid
The 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, November 2025.
```
@inproceedings{olvera:hal-05288458,
  address = {Suzhou, China},
  author = {Olvera, Michel and Wang, Changhong and Stamatiadis, Paraskevas and Richard, Ga{\"e}l and Essid, Slim},
  booktitle = {{The 2025 Conference on Empirical Methods in Natural Language Processing}},
  hal_id = {hal-05288458},
  hal_version = {v1},
  keywords = {Knowledge graph ; Zero-shot audio classification ; Audio-language models},
  month = nov,
  pdf = {https://hal.science/hal-05288458v1/file/main.pdf},
  title = {{iKnow-audio: Integrating Knowledge Graphs with Audio-Language Models}},
  url = {https://hal.science/hal-05288458},
  year = {2025}
}
```
Contrastive Language–Audio Pretraining (CLAP) models learn by aligning audio and text in a shared embedding space, enabling powerful zero-shot recognition. However, their performance is highly sensitive to prompt formulation and language nuances, and they often inherit semantic ambiguities and spurious correlations from noisy pretraining data. While prior work has explored prompt engineering, adapters, and prefix tuning to address these limitations, the use of structured prior knowledge remains largely unexplored. We present iKnow-audio, a framework that integrates knowledge graphs with audio-language models to provide robust semantic grounding. iKnow-audio builds on the Audio-centric Knowledge Graph (AKG), which encodes ontological relations comprising semantic, causal, and taxonomic connections reflective of everyday sound scenes and events. By training knowlege graph embedding models on the AKG and refining CLAP predictions through this structured knowledge, iKnow-audio improves disambiguation of acoustically similar sounds and reduces reliance on prompt engineering. Comprehensive zero-shot evaluations across six benchmark datasets demonstrate consistent gains over baseline CLAP, supported by embedding-space analyses that highlight improved relational grounding. Resources are publicly available at https://github.com/michelolzam/iknow-audio.
IS^3 : Generic Impulsive–Stationary Sound Separation in Acoustic Scenes using Deep Filtering
Clémentine Berger, Paraskevas Stamatiadis, Roland Badeau, Slim Essid
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2025), Tahoe City, CA, United States, October 2025.
```
@inproceedings{berger:hal-05228563,
  address = {Tahoe City, CA, United States},
  author = {Berger, Cl{\'e}mentine and Stamatiadis, Paraskevas and Badeau, Roland and Essid, Slim},
  booktitle = {{IEEE Workshop on Applications of Signal Processing to Audio and Acoustics  (WASPAA 2025)}},
  hal_id = {hal-05228563},
  hal_version = {v2},
  keywords = {Deep filtering ; Sound separation},
  month = oct,
  organization = {{IEEE}},
  pdf = {https://telecom-paris.hal.science/hal-05228563v2/file/WASPAA2025_paper_template.pdf},
  title = {{IS${}^3$ : Generic Impulsive--Stationary Sound Separation in Acoustic Scenes using Deep Filtering}},
  url = {https://telecom-paris.hal.science/hal-05228563},
  year = {2025}
}
```
We are interested in audio systems capable of performing a differentiated processing of stationary backgrounds and isolated acoustic events within an acoustic scene, whether for applying specific processing methods to each part or for focusing solely on one while ignoring the other. Such systems have applications in real-world scenarios, including robust adaptive audio rendering systems (e.g., EQ or compression), plosive attenuation in voice mixing, noise suppression or reduction, robust acoustic event classification or even bioacoustics. To this end, we introduce IS³, a neural network designed for Impulsive–Stationary Sound Separation, that isolates impulsive acoustic events from the stationary background using a deep filtering approach, that can act as a pre-processing stage for the above-mentioned tasks. To ensure optimal training, we propose a sophisticated data generation pipeline that curates and adapts existing datasets for this task. We demonstrate that a learning-based approach, build on a relatively lightweight neural architecture and trained with well-designed and varied data, is successful in this previously unaddressed task, outperforming the Harmonic–Percussive Sound Separation masking method, adapted from music signal processing research, and wavelet filtering on objective separation metrics.
How to Improve Anomaly Detection for Electric Powertrains in Production?
Anton Emelchenkov, Mathieu Fontaine, Hervé Mahé, François Roueff
International Conference on Acoustics and Audio Engineering for Electric & Hybrid Vehicles (SIA NVH Congress), Le Mans, France, October 2025.
```
@inproceedings{emelchenkov:hal-05460744,
  address = {Le Mans, France},
  author = {Emelchenkov, Anton and Fontaine, Mathieu and Mah{\'e}, Herv{\'e} and Roueff, Fran{\c c}ois},
  booktitle = {{International Conference on Acoustics and Audio Engineering for Electric \& Hybrid Vehicles (SIA NVH Congress)}},
  hal_id = {hal-05460744},
  hal_version = {v1},
  month = oct,
  organization = {{SIA: Soci{\'e}t{\'e} des Ing{\'e}nieurs de l'Automobile}},
  pdf = {https://hal.science/hal-05460744v1/file/SIA_NVH_2025__v2.pdf},
  title = {{How to Improve Anomaly Detection for Electric Powertrains in Production?}},
  url = {https://hal.science/hal-05460744},
  year = {2025}
}
```
Despite the low noise level of an electric powertrain, its tonality concentrated around a few frequencies can make it painful for the end user. The End Of Line Tester (EOLT) for electric powertrains plays a critical role in ensuring NVH quality standards. Today’s industry-standard solutions predominantly rely on order tracking and amplitude estimation to detect potential defects and compliance versus requirements. These techniques often depend on expert intervention and precise hyperparameters tuning, which undermines their robustness and scalability, especially when faced with rapidly evolving non-stationary signals. To reinforce precision and speed, two key innovations are proposed: (1) a high-resolution method for multi-frequency amplitude estimation in highly oscillatory regimes, equipped with automatic hyperparameter tuning to enhance the accuracy and stability of order tracking; and (2) a neural network-based anomaly detection framework that learns directly from raw signal spectrograms, removing the need for handcrafted signal processing. To support this, we introduce and release the first dataset of non-stationary vibration signals collected from an EOLT, specifically designed for anomaly detection. Our approach sets a new benchmark for automated, data-driven diagnostics in electric powertrain manufacturing.
Physically Informed Spatial Regularization for Sound Event Localization and Detection
Haocheng Liu, Diego Di Carlo, Aditya Arie Nugraha, Kazuyoshi Yoshii, Gaël Richard, Mathieu Fontaine
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Tahoe City, California, United States, October 2025.
```
@inproceedings{liu:hal-05244860,
  address = {Tahoe City, California, United States},
  author = {Liu, Haocheng and Di Carlo, Diego and Nugraha, Aditya Arie and Yoshii, Kazuyoshi and Richard, Ga{\"e}l and Fontaine, Mathieu},
  booktitle = {{IEEE Workshop on Applications of Signal Processing to Audio and Acoustics}},
  hal_id = {hal-05244860},
  hal_version = {v1},
  month = oct,
  pdf = {https://hal.science/hal-05244860v1/file/2025_WASPAA_Physically_Informed_Spatial_Regularization_for_SELD.pdf},
  title = {{Physically Informed Spatial Regularization for Sound Event Localization and Detection}},
  url = {https://hal.science/hal-05244860},
  year = {2025}
}
```
Building Sound Event Localization and Detection (SELD) models that are robust to diverse acoustic environments remains one of the major challenges in multichannel signal processing, as reflections and reverberation can significantly confuse both the source direction and event detection. Introducing priors such as microphone geometry or room impulse response (RIR) into the model has proven effective in addressing this issue. Existing methods typically incorporate such priors in a deterministic way, often through data augmentation to enlarge data diversity. However, the uncertainty arising from the complex nature of audio acoustics remains largely underexplored in the SELD literature and naturally call for incorporating a stochastic modeling of acoustic prior. In this paper, we propose regularizing deep learning based SELD models with a physically constructed spatial covariance matrix (SCM) based on the estimated direction of arrival (DOA) and sound event detection (SED).
SHAMaNS: Sound Localization with Hybrid Alpha-Stable Spatial Measure and Neural Steerer
Diego Di Carlo, Mathieu Fontaine, Aditya Arie Nugraha, Yoshiaki Bando, Kazuyoshi Yoshii
2025 33th European Signal Processing Conference (EUSIPCO), Palermo, Italy, September 2025.
```
@inproceedings{dicarlo:hal-05121705,
  address = {Palermo, Italy},
  author = {Di Carlo, Diego and Fontaine, Mathieu and Nugraha, Aditya Arie and Bando, Yoshiaki and Yoshii, Kazuyoshi},
  booktitle = {{2025 33th European Signal Processing Conference (EUSIPCO)}},
  hal_id = {hal-05121705},
  hal_version = {v1},
  keywords = {$\alpha$-stable theory ; steering vectors ; physics-informed deep learning ; sound source localization ; physicsinformed deep learning},
  month = sep,
  pdf = {https://hal.science/hal-05121705v1/file/main.pdf},
  title = {{SHAMaNS: Sound Localization with Hybrid Alpha-Stable Spatial Measure and Neural Steerer}},
  url = {https://hal.science/hal-05121705},
  year = {2025}
}
```
This paper describes a sound source localization (SSL) technique that combines an α-stable model for the observed signal with a neural network-based approach for modeling steering vectors. Specifically, a physics-informed neural network, referred to as Neural Steerer, is used to interpolate measured steering vectors (SVs) on a fixed microphone array. This allows for a more robust estimation of the so-called α-stable spatial measure, which represents the most plausible direction of arrival (DOA) of a target signal. As an α-stable model for the non-Gaussian case (α ∈ (0, 2)) theoretically defines a unique spatial measure, we choose to leverage it to account for residual reconstruction error of the Neural Steerer in the downstream tasks. The objective scores indicate that our proposed technique outperforms state-of-the-art methods in the case of multiple sound sources.
Soft Disentanglement in Frequency Bands for Neural Audio Codecs
Benoît Giniès, Xiaoyu Bie, Olivier Fercoq, Gaël Richard
33rd European Signal Processing Conference (EUSIPCO 2025), Palermo, Italy, September 2025.
```
@inproceedings{ginies:hal-05292884,
  address = {Palermo, Italy},
  author = {Gini{\`e}s, Beno{\^i}t and Bie, Xiaoyu and Fercoq, Olivier and Richard, Ga{\"e}l},
  booktitle = {{33rd European Signal Processing Conference (EUSIPCO 2025)}},
  hal_id = {hal-05292884},
  hal_version = {v1},
  keywords = {Disentanglement ; Frequency Decomposition ; Inpainting ; Neural Audio Codec},
  month = sep,
  pdf = {https://hal.science/hal-05292884v1/file/Soft_disentanglement__Clean_.pdf},
  title = {{Soft Disentanglement in Frequency Bands for Neural Audio Codecs}},
  url = {https://hal.science/hal-05292884},
  year = {2025}
}
```
In neural-based audio feature extraction, ensuring that representations capture disentangled information is crucial for model interpretability. However, existing disentanglement methods often rely on assumptions that are highly dependent on data characteristics or specific tasks. In this work, we introduce a generalizable approach for learning disentangled features within a neural architecture. Our method applies spectral decomposition to time-domain signals, followed by a multibranch audio codec that operates on the decomposed components. Empirical evaluations demonstrate that our approach achieves better reconstruction and perceptual performance compared to a state-of-the-art baseline while also offering potential advantages for inpainting tasks.
QINCODEC: Neural Audio Compression with Implicit Neural Codebooks
Zineb Lahrichi, Gaëtan Hadjeres, Gael Richard, Geoffroy Peeters
33rd European Signal Processing Conference (EUSIPCO 2025), Palermo, Italy, September 2025.
```
@inproceedings{lahrichi:hal-04995360,
  address = {Palermo, Italy},
  author = {Lahrichi, Zineb and Hadjeres, Ga{\"e}tan and Richard, Gael and Peeters, Geoffroy},
  booktitle = {{33rd European Signal Processing Conference (EUSIPCO 2025)}},
  hal_id = {hal-04995360},
  hal_version = {v1},
  keywords = {Neural quantization ; Audio codecs},
  month = sep,
  pdf = {https://hal.science/hal-04995360v1/file/Qincodec%20Neural%20Audio%20Compression%20with%20Neural%20Implicit%20Codebooks.pdf},
  title = {{QINCODEC: Neural Audio Compression with Implicit Neural Codebooks}},
  url = {https://hal.science/hal-04995360},
  year = {2025}
}
```
Neural audio codecs, neural networks which compress a waveform into discrete tokens, play a crucial role in the recent development of audio generative models. State-of-the-art codecs rely on the end-to-end training of an autoencoder and a quantization bottleneck. However, this approach restricts the choice of the quantization methods as it requires to define how gradients propagate through the quantizer and how to update the quantization parameters online. In this work, we revisit the common practice of joint training and propose to quantize the latent representations of a pre-trained autoencoder offline, followed by an optional finetuning of the decoder to mitigate degradation from quantization. This strategy allows to consider any off-the-shelf quantizer, especially state-of-the-art trainable quantizers with implicit neural codebooks such as QINCO2. We demonstrate that with the latter, our proposed codec termed QINCODEC, is competitive with baseline codecs while being notably simpler to train. Finally, our approach provides a general framework that amortizes the cost of autoencoder pretraining, and enables more flexible codec design.
Adding temporal musical controls on top of pretrained generative models
Sarah Nabi, Nils Demerlé, Geoffroy Peeters, Frédéric Bevilacqua, Philippe Esling
Proceeding of the 26th International Society for Music Information Retrieval Conference (ISMIR 2025), Daejeon, South Korea, September 2025.
```
@inproceedings{nabi:hal-05495076,
  address = {Daejeon, South Korea},
  author = {Nabi, Sarah and Demerl{\'e}, Nils and Peeters, Geoffroy and Bevilacqua, Fr{\'e}d{\'e}ric and Esling, Philippe},
  booktitle = {{Proceeding of the 26th International Society for Music Information Retrieval Conference (ISMIR 2025)}},
  hal_id = {hal-05495076},
  hal_version = {v1},
  month = sep,
  pdf = {https://hal.science/hal-05495076v1/file/_ISMIR25_temporal_contols_on_pretrained_audio_generative_models_camera_ready.pdf},
  title = {{Adding temporal musical controls on top of pretrained generative models}},
  url = {https://hal.science/hal-05495076},
  year = {2025}
}
```
Recent advances in deep generative modeling have enabled high-quality models for musical audio synthesis. However, these approaches remain difficult to control, confined to simple, static attributes and, most importantly, entail retraining a different computationally-heavy architecture for each new control. This is inefficient and impractical as it requires substantial computational resources. In this paper, we propose a novel approach allowing to add time-varying musical controls on top of any pretrained generative models with an exposed latent space (e.g. neural audio codecs), without retraining or finetuning. Our method supports both discrete and continuous attributes by adapting a rectified flow approach with a latent diffusion transformer. We learn an invertible mapping between pretrained latent variables and a new space disentangling explicit control attributes and style variables that capture the remaining factors of variation. This enables both feature extraction from an input, but also editing those features to generate transformed audio samples. Finally, this also introduces the ability to perform synthesis directly from the audio descriptors. We validate our method with 4 datasets going from different musical instruments up to full music recordings, on which we outperform state-of-the-art taskspecific baselines in terms of both generation quality and accuracy of the control by inferring transferred attributes.
Audio processor parameters: estimating distributions instead of deterministic values
Côme Peladeau, Dominique Fourer, Geoffroy Peeters
Proceedings of the 28th International Conference on Digital Audio Effects (DAFx25), Ancône, Italy, September 2025.
```
@inproceedings{peladeau:hal-05253950,
  address = {Anc{\^o}ne, Italy},
  author = {Peladeau, C{\^o}me and Fourer, Dominique and Peeters, Geoffroy},
  booktitle = {{Proceedings of the 28th International Conference on Digital Audio Effects (DAFx25)}},
  hal_id = {hal-05253950},
  hal_version = {v1},
  keywords = {Deep Learning ; Audio effects and instruments ; Normalizing Flows ; Differentiable Digital Signal Processing},
  month = sep,
  pages = {275-282},
  pdf = {https://hal.science/hal-05253950v1/file/DAFx25_paper_19.pdf},
  title = {{Audio processor parameters: estimating distributions instead of deterministic values}},
  url = {https://hal.science/hal-05253950},
  year = {2025}
}
```
Audio effects and sound synthesizers are widely used processors in popular music. Their parameters control the quality of the output sound. Multiple combinations of parameters can lead to the same sound. While recent approaches have been proposed to estimate these parameters given only the output sound, those are deterministic, i.e. they only estimate a single solution among the many possible parameter configurations. In this work, we propose to model the parameters as probability distributions instead of deterministic values. To learn the distributions, we optimize two objectives: (1) we minimize the reconstruction error between the ground truth output sound and the one generated using the estimated parameters, as is it usually done, but also (2) we maximize the parameter diversity, using entropy. We evaluate our approach through two numerical audio experiments to show its effectiveness. These results show how our approach effectively outputs multiple combinations of parameters to match one sound.
Translation-Equivariant Self-Supervised Learning for Pitch Estimation with Optimal Transport
Bernardo Torres, Alain Riou, Gaël Richard, Geoffroy Peeters
Extended Abstracts for the Late-Breaking Demo Session of the 26th International Society for Music Information Retrieval Conference (ISMIR), Daejon, South Korea, September 2025.
```
@inproceedings{torres:hal-05321597,
  address = {Daejon, South Korea},
  author = {Torres, Bernardo and Riou, Alain and Richard, Ga{\"e}l and Peeters, Geoffroy},
  booktitle = {{Extended Abstracts for the Late-Breaking Demo Session of the 26th International Society for Music Information Retrieval Conference (ISMIR)}},
  hal_id = {hal-05321597},
  hal_version = {v1},
  keywords = {Equivariance ; Pitch estimation ; Machine leaning},
  month = sep,
  pdf = {https://hal.science/hal-05321597v1/file/2508.01493v1.pdf},
  title = {{Translation-Equivariant Self-Supervised Learning for Pitch Estimation with Optimal Transport}},
  url = {https://hal.science/hal-05321597},
  year = {2025}
}
```
In this paper, we propose an Optimal Transport objective for learning one-dimensional translation-equivariant systems and demonstrate its applicability to single pitch estimation. Our method provides a theoretically grounded, more numerically stable, and simpler alternative for training state-of-the-art self-supervised pitch estimators.
Déréverbération non-supervisée de la parole par modèle hybride
Louis Bahrman, Mathieu Fontaine, Gaël Richard
XXXe Colloque Francophone de Traitement du Signal et des Images (GRETSI 2025), Strasbourg, France, August 2025.
```
@inproceedings{bahrman:hal-05305957,
  address = {Strasbourg, France},
  author = {Bahrman, Louis and Fontaine, Mathieu and Richard, Ga{\"e}l},
  booktitle = {{XXXe Colloque Francophone de Traitement du Signal et des Images (GRETSI 2025)}},
  hal_id = {hal-05305957},
  hal_version = {v1},
  keywords = {D{\'e}r{\'e}verb{\'e}ration ; Apprentissage profond hybride ; R{\'e}verb{\'e}ration ; Apprentissage non supervis{\'e}},
  month = aug,
  organization = {{GRETSI}},
  pages = {1-4},
  pdf = {https://hal.science/hal-05305957v1/file/gretsi.pdf},
  title = {{D{\'e}r{\'e}verb{\'e}ration non-supervis{\'e}e de la parole par mod{\`e}le hybride}},
  url = {https://hal.science/hal-05305957},
  year = {2025}
}
```
Cet article introduit une nouvelle stratégie d’apprentissage pour améliorer des systèmes de déréverbération de la parole de manière non-supervisée en n’utilisant que des signaux réverbérants. La plupart des algorithmes existants nécessitent des paires de signaux (sec, réverbérant), qui sont difficiles à obtenir. Notre approche utilise en revanche des informations acoustiques limitées, comme le temps de réverbération (RT60), pour entraîner un système de déréverbération. Les résultats expérimentaux démontrent que notre méthode permet d’obtenir des performances plus cohérentes que l’état de l’art sur différentes mesures objectives.
Désentrelacement fréquentiel doux pour les codecs audio neuronaux
Benoît Giniès, Xiaoyu Bie, Olivier Fercoq, Gaël Richard
XXXe Colloque Francophone de Traitement du Signal et des Images (GRETSI 2025), Strasbourg, France, August 2025.
```
@inproceedings{ginies:hal-05293271,
  address = {Strasbourg, France},
  author = {Gini{\`e}s, Beno{\^i}t and Bie, Xiaoyu and Fercoq, Olivier and Richard, Ga{\"e}l},
  booktitle = {{XXXe Colloque Francophone de Traitement du Signal et des Images (GRETSI 2025)}},
  hal_id = {hal-05293271},
  hal_version = {v1},
  month = aug,
  pdf = {https://hal.science/hal-05293271v1/file/GRETSI_2025__Clean_.pdf},
  title = {{D{\'e}sentrelacement fr{\'e}quentiel doux pour les codecs audio neuronaux}},
  url = {https://hal.science/hal-05293271},
  year = {2025}
}
```
Bien que les modèles basés sur les réseaux de neurones aient permis des avancées significatives dans l’extraction de représentations audio, l’interprétabilité des représentations apprises reste un défi majeur. Pour y remédier, des techniques de désentrelacement ont été intégrées dans les codecs audio neuronaux discrets afin d’imposer une structure aux tokens extraits. Cependant, ces approches sont souvent fortement dépendantes de tâches ou d’ensembles de données spécifiques. Dans ce travail, nous proposons un codec audio neuronal désentrelacé qui tire parti de la décomposition spectrale des signaux temporels pour améliorer l’interprétabilité de la représentation. Des évaluations expérimentales démontrent que notre méthode surpasse un modèle de référence en termes de fidélité de reconstruction et de qualité perceptuelle.
Unified Variational and Physics-aware Model for Room Impulse Response Estimation
Louis Lalay, Mathieu Fontaine, Roland Badeau
Interspeech: 26th edition of the Interspeech Conference, Rotterdam (NL), Netherlands, August 2025. Accepted to ....
```
@inproceedings{lalay:hal-05100922,
  address = {Rotterdam (NL), Netherlands},
  author = {Lalay, Louis and Fontaine, Mathieu and Badeau, Roland},
  booktitle = {{Interspeech: 26th edition of the Interspeech Conference}},
  hal_id = {hal-05100922},
  hal_version = {v1},
  keywords = {Reverberation ; Room impulse response ; Variational theory},
  month = aug,
  note = {Accepted to Interspeech 2025},
  pdf = {https://hal.science/hal-05100922v1/file/Camera%20Ready%20-%20Louis%20Lalay%20Interspeech%202025.pdf},
  title = {{Unified Variational and Physics-aware Model for Room Impulse Response Estimation}},
  url = {https://hal.science/hal-05100922},
  year = {2025}
}
```
Room impulse response estimation is essential for tasks like speech dereverberation, which improves automatic speech recognition. Most existing methods rely on either statistical signal processing or deep neural networks designed to replicate signal processing principles. However, combining statistical and physical modeling for RIR estimation remains largely unexplored. This paper\footnoteThis paper was submitted to Interspeech 2025 proposes a novel approach integrating both aspects through a theoretically grounded model. The RIR is decomposed into interpretable parameters: white Gaussian noise filtered by a frequency-dependent exponential decay (e.g. modeling wall absorption) and an autoregressive filter (e.g. modeling microphone response). A variational free-energy cost function enables practical parameter estimation. As a proof of concept, we show that given dry and reverberant speech signals, the proposed method outperforms classical deconvolution in noisy environments, as validated by objective metrics.
Modèle physique variationnel pour l’estimation de réponses impulsionnelles de salles
Louis Lalay, Mathieu Fontaine, Roland Badeau
GRETSI 2025 : XXXe Colloque Francophone de Traitement du Signal et des Images, Strasbourg (67000), France, August 2025.
```
@inproceedings{lalay:hal-05150993,
  address = {Strasbourg (67000), France},
  author = {Lalay, Louis and Fontaine, Mathieu and Badeau, Roland},
  booktitle = {{GRETSI 2025 : XXXe Colloque Francophone de Traitement du Signal et des Images}},
  hal_id = {hal-05150993},
  hal_version = {v1},
  keywords = {reverberation ; room impulse response ; variational theory ; r{\'e}verb{\'e}ration ; r{\'e}ponse impulsionnelle de salle ; m{\'e}thode variationelle},
  month = aug,
  pdf = {https://hal.science/hal-05150993v1/file/GRETSI_2025_LALAY_FONTAINE_BADEAU_HAL.pdf},
  title = {{Mod{\`e}le physique variationnel pour l'estimation de r{\'e}ponses impulsionnelles de salles}},
  url = {https://hal.science/hal-05150993},
  year = {2025}
}
```
Estimer la réponse impulsionnelle d’une salle est essentiel pour des tâches comme la déréverbération, qui améliore la reconnaissance automatique de la parole. La plupart des méthodes existantes reposent soit sur du traitement du signal statistique, soit sur des réseaux de neurones profonds s’inspirant du traitement du signal. Cependant, la combinaison des modélisations statistique et physique reste largement inexploré en estimation de réponse impulsionnelle de salle. Cet article propose une approche novatrice intégrant les deux aspects à travers un modèle physique. La réponse de salle est décomposée en paramètres interprétables : un bruit blanc gaussien modulé par une décroissance exponentielle dépendante de la fréquence (modélisant l’absorption des murs) et un filtre autorégressif (modélisant par exemple la réponse du microphone). L’optimisation d’une fonction d’énergie libre variationnelle permet une estimation pratique des paramètres. Nous montrons que, connaissant les signaux secs et réverbérants, la méthode proposée surpasse la déconvolution classique dans des environnements bruités, comme le confirment les mesures objectives.
Winner-takes-all for Multivariate Probabilistic Time Series Forecasting
Adrien Cortés, Rémi Rehm, Victor Letzelter
ICML 2025 : The 42nd International Conference on Machine Learning, Vancouver (CA), Canada, July 2025.
```
@inproceedings{cortes:hal-05100125,
  address = {Vancouver (CA), Canada},
  author = {Cort{\'e}s, Adrien and Rehm, R{\'e}mi and Letzelter, Victor},
  booktitle = {{ICML 2025 : The 42nd International Conference on Machine Learning}},
  hal_id = {hal-05100125},
  hal_version = {v2},
  keywords = {Multiple Choice Learning ; Diversity ; Winner-takes-all ; Conditional Distribution Estimation ; Probabilistic methods ; Time-series forecasting},
  month = jul,
  pdf = {https://hal.science/hal-05100125v2/file/main_icml.pdf},
  title = {{Winner-takes-all for Multivariate Probabilistic Time Series Forecasting}},
  url = {https://hal.science/hal-05100125},
  year = {2025}
}
```
We introduce TimeMCL, a method leveraging the Multiple Choice Learning (MCL) paradigm to forecast multiple plausible time series futures. Our approach employs a neural network with multiple heads and utilizes the Winner-Takes-All (WTA) loss to promote diversity among predictions. MCL has recently gained attention due to its simplicity and ability to address ill-posed and ambiguous tasks. We propose an adaptation of this framework for time-series forecasting, presenting it as an efficient method to predict diverse futures, which we relate to its implicit quantization objective. We provide insights into our approach using synthetic data and evaluate it on real-world time series, demonstrating its promising performance at a light computational cost.
Annealed Winner-Takes-All for Motion Forecasting
Yihong Xu, Victor Letzelter, Mickaël Chen, Éloi Zablocki, Matthieu Cord
IEEE International Conference on Robotics & Automation (ICRA), Atlanta, United States, May 2025. 7 pages, 6 f....
```
@inproceedings{xu:hal-05079079,
  address = {Atlanta, United States},
  author = {Xu, Yihong and Letzelter, Victor and Chen, Micka{\"e}l and Zablocki, {\'E}loi and Cord, Matthieu},
  booktitle = {{IEEE International Conference on Robotics \& Automation (ICRA)}},
  hal_id = {hal-05079079},
  hal_version = {v1},
  month = may,
  note = {7 pages, 6 figures, Accepted to ICRA2025},
  pdf = {https://hal.science/hal-05079079v1/file/2409.11172v3.pdf},
  title = {{Annealed Winner-Takes-All for Motion Forecasting}},
  url = {https://hal.science/hal-05079079},
  year = {2025}
}
```
In autonomous driving, motion prediction aims at forecasting the future trajectories of nearby agents, helping the ego vehicle to anticipate behaviors and drive safely. A key challenge is generating a diverse set of future predictions, commonly addressed using data-driven models with Multiple Choice Learning (MCL) architectures and Winner-Takes-All (WTA) training objectives. However, these methods face initialization sensitivity and training instabilities. Additionally, to compensate for limited performance, some approaches rely on training with a large set of hypotheses, requiring a post-selection step during inference to significantly reduce the number of predictions. To tackle these issues, we take inspiration from annealed MCL, a recently introduced technique that improves the convergence properties of MCL methods through an annealed Winner-Takes-All loss (aWTA). In this paper, we demonstrate how the aWTA loss can be integrated with state-of-the-art motion forecasting models to enhance their performance using only a minimal set of hypotheses, eliminating the need for the cumbersome post-selection step. Our approach can be easily incorporated into any trajectory prediction model normally trained using WTA and yields significant improvements. To facilitate the application of our approach to future motion forecasting models, the code is made publicly available: https://github.com/valeoai/MF_aWTA.
F-StrIPE: Fast Structure-Informed Positional Encoding for Symbolic Music Generation
Manvi Agarwal, Changhong Wang, Gael Richard
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, April 2025.
```
@inproceedings{agarwal:hal-04935674,
  address = {Hyderabad, India},
  author = {Agarwal, Manvi and Wang, Changhong and Richard, Gael},
  booktitle = {{IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  hal_id = {hal-04935674},
  hal_version = {v1},
  keywords = {music generation ; symbolic music ; transformers ; positional encoding ; kernels},
  month = apr,
  pdf = {https://hal.science/hal-04935674v1/file/IEEE-conference-template-062824.pdf},
  title = {{F-StrIPE: Fast Structure-Informed Positional Encoding for Symbolic Music Generation}},
  url = {https://hal.science/hal-04935674},
  year = {2025}
}
```
While music remains a challenging domain for generative models like Transformers, recent progress has been made by exploiting suitable musically-informed priors. One technique to leverage information about musical structure in Transformers is inserting such knowledge into the positional encoding (PE) module. However, Transformers carry a quadratic cost in sequence length. In this paper, we propose F-StrIPE, a structure-informed PE scheme that works in linear complexity. Using existing kernel approximation techniques based on random features, we show that F-StrIPE is a generalization of Stochastic Positional Encoding (SPE). We illustrate the empirical merits of F-StrIPE using melody harmonization for symbolic music.
Théorie ondulatoire statistique
Roland Badeau
CFA 2025 - 17e Congrès Français d’Acoustique, Paris, France, April 2025.
```
@inproceedings{badeau:hal-04930346,
  address = {Paris, France},
  author = {Badeau, Roland},
  booktitle = {{CFA 2025 - 17e Congr{\`e}s Fran{\c c}ais d'Acoustique}},
  hal_id = {hal-04930346},
  hal_version = {v1},
  month = apr,
  title = {{Th{\'e}orie ondulatoire statistique}},
  url = {https://telecom-paris.hal.science/hal-04930346},
  year = {2025}
}
```
La théorie ondulatoire statistique établit formellement les lois statistiques vérifiées par les solutions de l’équation des ondes, dans un domaine connexe et borné de l’espace. Elle constitue ainsi la solution mathématique d’un problème très ancien en acoustique des salles, qui a fait couler beaucoup d’encre depuis les travaux pionniers de Wallace Clement Sabine à la fin du XIXe siècle : l’étude du phénomène de réverbération.Elle fournit notamment l’expression analytique de la distribution de puissance et des corrélations du champ acoustique par rapport au temps, la fréquence et la position dans l’espace, en fonction de la géométrie de la salle et des conditions aux limites. Elle nous permet par exemple de retrouver et d’améliorer, dans le cas particulier d’un champ acoustique isotrope, les formules du temps de réverbération originalement établies par Sabine et Eyring, ainsi que la formule de corrélation spatiale. Mais elle s’applique également à des formes géométriques pouvant engendrer un champ acoustique anisotrope.Notre objectif sera ici de présenter cette théorie de la manière la plus simple et intuitive possible, en l’abordant sous un angle purement géométrique. Nous montrerons ainsi que deux chemins mathématiques très différents, le premier basé sur l’asymptotique de Weyl et les billards mathématiques, le second basé sur la géométrie et la cristallographie, convergent vers les mêmes conclusions, ce qui nous rend extrêmement confiants quant à la validité scientifique de cette théorie. Nous fournirons également la confirmation expérimentale de certaines prédictions de la théorie, qui vont au-delà des propriétés statistiques déjà connues de la réverbération.
A Hybrid Model for Weakly-Supervised Speech Dereverberation
Louis Bahrman, Mathieu Fontaine, Gael Richard
ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, April 2025.
```
@inproceedings{bahrman:hal-04931672,
  address = {Hyderabad, India},
  author = {Bahrman, Louis and Fontaine, Mathieu and Richard, Gael},
  booktitle = {{ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  hal_id = {hal-04931672},
  hal_version = {v1},
  keywords = {Speech processing ; Reverberation modeling ; Hybrid deep learning ; Speech dereverberation},
  month = apr,
  pdf = {https://hal.science/hal-04931672v1/file/camera_ready.pdf},
  title = {{A Hybrid Model for Weakly-Supervised Speech Dereverberation}},
  url = {https://hal.science/hal-04931672},
  year = {2025}
}
```
This paper introduces a new training strategy to improve speech dereverberation systems using minimal acoustic information and reverberant (wet) speech. Most existing algorithms rely on paired dry/wet data, which is difficult to obtain, or on target metrics that may not adequately capture reverberation characteristics and can lead to poor results on non-target metrics. Our approach uses limited acoustic information, like the reverberation time (RT60), to train a dereverberation system. The system’s output is resynthesized using a generated room impulse response and compared with the original reverberant speech, providing a novel reverberation matching loss replacing the standard target metrics. During inference, only the trained dereverberation model is used. Experimental results demonstrate that our method achieves more consistent performance across various objective metrics used in speech dereverberation than the state-of-the-art.
Perceptual Noise-Masking with Music through Deep Spectral Envelope Shaping
Clémentine Berger, Roland Badeau, Slim Essid
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, April 2025.
```
@inproceedings{berger:hal-04959656,
  address = {Hyderabad, India},
  author = {Berger, Cl{\'e}mentine and Badeau, Roland and Essid, Slim},
  booktitle = {{IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  hal_id = {hal-04959656},
  hal_version = {v1},
  keywords = {Ambient noise masking ; Deep filtering ; Psychoacoustics},
  month = apr,
  organization = {{IEEE}},
  pdf = {https://telecom-paris.hal.science/hal-04959656v1/file/conference_101719.pdf},
  title = {{Perceptual Noise-Masking with Music through Deep Spectral Envelope Shaping}},
  url = {https://telecom-paris.hal.science/hal-04959656},
  year = {2025}
}
```
People often listen to music in noisy environments, seeking to isolate themselves from ambient sounds. Indeed, a music signal can mask some of the noise’s frequency components due to the effect of simultaneous masking. In this article, we propose a neural network based on a psychoacoustic masking model, designed to enhance the music’s ability to mask ambient noise by reshaping its spectral envelope with predicted filter frequency responses. The model is trained with a perceptual loss function that balances two constraints: effectively masking the noise while preserving the original music mix and the user’s chosen listening level. We evaluate our approach on simulated data replicating a user’s experience of listening to music with headphones in a noisy environment. The results, based on defined objective metrics, demonstrate that our system improves the state of the art.
Investigating the Sensitivity of Pre-trained Audio Embeddings to Common Effects
Victor Deng, Changhong Wang, Gael Richard, Brian McFee
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hyderabad, India, April 2025.
```
@inproceedings{deng:hal-04904470,
  address = {Hyderabad, India},
  author = {Deng, Victor and Wang, Changhong and Richard, Gael and McFee, Brian},
  booktitle = {{Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}},
  hal_id = {hal-04904470},
  hal_version = {v2},
  keywords = {Foundation models ; Audio embeddings ; Transfer learning ; Audio effects},
  month = apr,
  pdf = {https://hal.science/hal-04904470v2/file/main.pdf},
  title = {{Investigating the Sensitivity of Pre-trained Audio Embeddings to Common Effects}},
  url = {https://hal.science/hal-04904470},
  year = {2025}
}
```
In recent years, foundation models have significantly advanced data-driven systems across various domains. Yet, their underlying properties, especially when functioning as feature extractors, remain under-explored. In this paper, we investigate the sensitivity to audio effects of audio embeddings extracted from widely-used foundation models, including OpenL3, PANNs, and CLAP. We focus on audio effects as the source of sensitivity due to their prevalent presence in large audio datasets. By applying parameterized audio effects (gain, low-pass filtering, reverberation, and bitcrushing), we analyze the correlation between the deformation trajectories and the effect strength in the embedding space. We propose to quantify the dimensionality and linearizability of the deformation trajectories induced by audio effects using canonical correlation analysis. We find that there exists a direction along which the embeddings move monotonically as the audio effect strength increases, but that the subspace containing the displacements is generally high-dimensional. This shows that pre-trained audio embeddings do not globally linearize the effects. Our empirical results on instrument classification downstream tasks confirm that projecting out the estimated deformation directions cannot generally improve the robustness of pre-trained embeddings to audio effects.
O-EENC-SD: Efficient Online End-to-End Neural Clustering for Speaker Diarization
Elio Gruttadauria, Mathieu Fontaine, Jonathan Le Roux, Slim Essid
IEEE International Conference on Acoustics, Speech, and Signal Processing, Hyderabad, India, India, April 2025.
```
@inproceedings{gruttadauria:hal-05418832,
  address = {Hyderabad, India, India},
  author = {Gruttadauria, Elio and Fontaine, Mathieu and Le Roux, Jonathan and Essid, Slim},
  booktitle = {{IEEE International Conference on Acoustics, Speech, and Signal Processing}},
  hal_id = {hal-05418832},
  hal_version = {v1},
  keywords = {Online inference ; Speaker diarization},
  month = apr,
  pdf = {https://hal.science/hal-05418832v1/file/main.pdf},
  title = {{O-EENC-SD: Efficient Online End-to-End Neural Clustering for Speaker Diarization}},
  url = {https://hal.science/hal-05418832},
  year = {2025}
}
```
We introduce O-EENC-SD: an end-to-end online speaker diarization system based on EEND-EDA, featuring a novel RNN-based stitching mechanism for online prediction. In particular, we develop a novel centroid refinement decoder whose usefulness is assessed through a rigorous ablation study. Our system provides key advantages over existing methods: a hyperparameter-free solution compared to unsupervised clustering approaches, and a more efficient alternative to current online end-to-end methods, which are computationally costly. We demonstrate that O-EENC-SD is competitive with the state of the art in the two-speaker conversational telephone speech domain, as tested on the CallHome dataset. Our results show that O-EENC-SD provides a great trade-off between DER and complexity, even when working on independent chunks with no overlap, making the system extremely efficient.
Twenty-Five Years of MIR Research: Achievements, Practices, Evaluations, and Future Challenges
Geoffroy Peeters, Zafar Rafii, Magdalena Fuentes, Zhiyao Duan, Emmanouil Benetos, Juhan Nam, Yuki Mitsufuji
ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, April 2025.
```
@inproceedings{peeters:hal-04993672,
  address = {Hyderabad, India},
  author = {Peeters, Geoffroy and Rafii, Zafar and Fuentes, Magdalena and Duan, Zhiyao and Benetos, Emmanouil and Nam, Juhan and Mitsufuji, Yuki},
  booktitle = {{ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  doi = {10.1109/ICASSP49660.2025.10888947},
  hal_id = {hal-04993672},
  hal_version = {v1},
  keywords = {Cultural differences ; review ; MIR ; Faces ; Informatics ; Speech processing ; Music information retrieval ; Fuels ; Multiple signal classification ; Engineering profession ; Reviews ; Industries},
  month = apr,
  organization = {{IEEE}},
  pages = {1-5},
  pdf = {https://hal.science/hal-04993672v1/file/25MIR_text.pdf},
  publisher = {{IEEE}},
  title = {{Twenty-Five Years of MIR Research: Achievements, Practices, Evaluations, and Future Challenges}},
  url = {https://hal.science/hal-04993672},
  year = {2025}
}
```
In this paper, we trace the evolution of Music Information Retrieval (MIR) over the past 25 years. While MIR gathers all kinds of research related to music informatics, a large part of it focuses on signal processing techniques for music data, fostering a close relationship with the IEEE Audio and Acoustic Signal Processing Technical Committee. In this paper, we reflect the main research achievements of MIR along the three EDICS related to music analysis, processing and generation. We then review a set of successful practices that fuel the rapid development of MIR research. One practice is the annual research benchmark, the Music Information Retrieval Evaluation eXchange, where participants compete on a set of research tasks. Another practice is the pursuit of reproducible and open research. The active engagement with industry research and products is another key factor for achieving large societal impacts and motivating younger generations of students to join the field. Last but not the least, the commitment to diversity, equity and inclusion ensures MIR to be a vibrant and open community where various ideas, methodologies, and career pathways collide. We finish by providing future challenges MIR will have to face.
Multiple Choice Learning for Efficient Speech Separation with Many Speakers
David Perera, Francois Derrida, Théo Mariotte, Gael Richard, Slim Essid
ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, April 2025.
```
@inproceedings{perera:hal-04981264,
  address = {Hyderabad, India},
  author = {Perera, David and Derrida, Francois and Mariotte, Th{\'e}o and Richard, Gael and Essid, Slim},
  booktitle = {{ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  doi = {10.1109/ICASSP49660.2025.10888528},
  hal_id = {hal-04981264},
  hal_version = {v1},
  keywords = {WSJ0-mix ; Standards ; Time complexity ; Speech processing ; Complexity theory ; Acoustics ; Benchmark testing ; Signal processing ; Predictive models ; Measurement ; Training ; PIT ; Librimix ; Cocktail party ; Multiple choice learning ; Speech separation},
  month = apr,
  pdf = {https://telecom-paris.hal.science/hal-04981264v1/file/icassp_2025.pdf},
  publisher = {{IEEE}},
  series = {ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title = {{Multiple Choice Learning for Efficient Speech Separation with Many Speakers}},
  url = {https://telecom-paris.hal.science/hal-04981264},
  year = {2025}
}
```
Training speech separation models in the supervised setting raises a permutation problem: finding the best assignation between the model predictions and the ground truth separated signals. This inherently ambiguous task is customarily solved using Permutation Invariant Training (PIT). In this article, we instead consider using the Multiple Choice Learning (MCL) framework, which was originally introduced to tackle ambiguous tasks. We demonstrate experimentally on the popular WSJ0-mix and LibriMix benchmarks that MCL matches the performances of PIT, while being computationally advantageous. This opens the door to a promising research direction, as MCL can be naturally extended to handle a variable number of speakers, or to tackle speech separation in the unsupervised setting.
Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning
Aurian Quelennec, Pierre Chouteau, Geoffroy Peeters, Slim Essid
ICASSP, Hyderabad (IN), India, April 2025.
```
@inproceedings{quelennec:hal-04921274,
  address = {Hyderabad (IN), India},
  author = {Quelennec, Aurian and Chouteau, Pierre and Peeters, Geoffroy and Essid, Slim},
  booktitle = {{ICASSP}},
  doi = {10.1109/ICASSP49660.2025.10887666},
  hal_id = {hal-04921274},
  hal_version = {v1},
  keywords = {audio representation learning ; audio spectrogram transformers ; self-supervised ; self-supervised audio representation learning audio spectrogram transformers},
  month = apr,
  organization = {{IEEE}},
  pages = {1-5},
  pdf = {https://hal.science/hal-04921274v1/file/Icassp_2025___Camera_ready.pdf},
  title = {{Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning}},
  url = {https://hal.science/hal-04921274},
  year = {2025}
}
```
Recently, self-supervised learning methods based on masked latent prediction have proven to encode input data into powerful representations. However, during training, the learned latent space can be further transformed to extract higher-level information that could be more suited for downstream classification tasks. Therefore, we propose a new method: MAsked latenT Prediction And Classification (MATPAC), which is trained with two pretext tasks solved jointly. As in previous work, the first pretext task is a masked latent prediction task, ensuring a robust input representation in the latent space. The second one is unsupervised classification, which utilises the latent representations of the first pretext task to match probability distributions between a teacher and a student. We validate the MATPAC method by comparing it to other state-of-the-art proposals and conducting ablations studies. MATPAC reaches state-of-the-art self-supervised learning results on reference audio classification datasets such as OpenMIC, GTZAN, ESC-50 and US8K and outperforms comparable supervised methods’ results for musical auto-tagging on Magna-tag-a-tune.
AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder
Samir Sadok, Simon Leglaive, Laurent Girin, Gaël Richard, Xavier Alameda-Pineda
ICASSP 2025 - IEEE International Conference on Acoustics, Speech, and Signal Processing, Hyderabad, India, April 2025.
```
@inproceedings{sadok:hal-04891286,
  address = {Hyderabad, India},
  author = {Sadok, Samir and Leglaive, Simon and Girin, Laurent and Richard, Ga{\"e}l and Alameda-Pineda, Xavier},
  booktitle = {{ICASSP 2025 - IEEE International Conference on Acoustics, Speech, and Signal Processing}},
  hal_id = {hal-04891286},
  hal_version = {v1},
  keywords = {Pitch estimation and modification ; Speech enhancement ; Masked autoencoder ; Speech analysis/transformation/synthesis},
  month = apr,
  pages = {1-5},
  pdf = {https://hal.science/hal-04891286v1/file/2501.05332v1.pdf},
  publisher = {{IEEE}},
  title = {{AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder}},
  url = {https://hal.science/hal-04891286},
  year = {2025}
}
```
This article introduces AnCoGen, a novel method that leverages a masked autoencoder to unify the analysis, control, and generation of speech signals within a single model. AnCoGen can analyze speech by estimating key attributes, such as speaker identity, pitch, content, loudness, signal-to-noise ratio, and clarity index. In addition, it can generate speech from these attributes and allow precise control of the synthesized speech by modifying them. Extensive experiments demonstrated the effectiveness of AnCoGen across speech analysisresynthesis, pitch estimation, pitch modification, and speech enhancement. Code and audio examples are available online.
Contrastive Knowledge Distillation for Embedding Refinement in Personalized Speech Enhancement
Thomas Serre, Mathieu Fontaine, Éric Benhaim, Slim Essid
ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, France, April 2025.
```
@inproceedings{serre:hal-05467995,
  address = {Hyderabad, France},
  author = {Serre, Thomas and Fontaine, Mathieu and Benhaim, {\'E}ric and Essid, Slim},
  booktitle = {{ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  doi = {10.1109/icassp49660.2025.10887609},
  hal_id = {hal-05467995},
  hal_version = {v1},
  keywords = {knowledge distillation ; speaker embedding ; personalized speech enhancement ; Target speaker extraction},
  month = apr,
  pages = {1-5},
  pdf = {https://telecom-paris.hal.science/hal-05467995v1/file/paper.pdf},
  publisher = {{IEEE}},
  title = {{Contrastive Knowledge Distillation for Embedding Refinement in Personalized Speech Enhancement}},
  url = {https://telecom-paris.hal.science/hal-05467995},
  year = {2025}
}
```
Personalized speech enhancement (PSE) has shown convincing results when it comes to extracting a known target voice among interfering ones. The corresponding systems usually incorporate a representation of the target voice within the enhancement system, which is extracted from an enrollment clip of the target voice with upstream models. Those models are generally heavy as the speaker embedding’s quality directly affects PSE performances. Yet, embeddings generated beforehand cannot account for the variations of the target voice during inference time. In this paper, we propose to perform on-thefly refinement of the speaker embedding using a tiny speaker encoder. We first introduce a novel contrastive knowledge distillation methodology in order to train a 150k-parameter encoder from complex embeddings. We then use this encoder within the enhancement system during inference and show that the proposed method greatly improves PSE performances while maintaining a low computational load.

Journal Articles

PESTO: Real‑Time Pitch Estimation with Self‑Supervised Transposition‑Equivariant Objective
Alain Riou, Bernardo Torres, Ben Hayes, Stefan Lattner, Gaëtan Hadjeres, Gaël Richard, Geoffroy Peeters
Transactions of the International Society for Music Information Retrieval (TISMIR), September 2025.
```
@article{riou:hal-05321595,
  author = {Riou, Alain and Torres, Bernardo and Hayes, Ben and Lattner, Stefan and Hadjeres, Ga{\"e}tan and Richard, Ga{\"e}l and Peeters, Geoffroy},
  doi = {10.5334/tismir.251},
  hal_id = {hal-05321595},
  hal_version = {v1},
  journal = {{Transactions of the International Society for Music Information Retrieval (TISMIR)}},
  keywords = {Variable-Q transform ; Pitch estimation ; Self-supervised learning ; Equivariance ; Real-time ; Toeplitz matrix ; Music information retrieval ; f0 estimation ; Lightweight ; Streamable convolutions},
  month = sep,
  number = {1},
  pages = {334-352},
  pdf = {https://hal.science/hal-05321595v1/file/68c0346a4b785.pdf},
  publisher = {{Ubiquity Press}},
  title = {{PESTO: Real‑Time Pitch Estimation with Self‑Supervised Transposition‑Equivariant Objective}},
  url = {https://hal.science/hal-05321595},
  volume = {8},
  year = {2025}
}
```
In this paper, we introduce PESTO, a self-supervised learning approach for single-pitch estimation using a Siamese architecture. Our model processes individual frames of a Variable-Q Transform (VQT) and predicts pitch distributions. The neural network is designed to be equivariant to translations, notably thanks to a Toeplitz fully-connected layer. In addition, we construct pitch-shifted pairs by translating and cropping the VQT frames and train our model with a novel class-based transposition-equivariant objective, eliminating the need for annotated data. Thanks to this architecture and training objective, our model achieves remarkable performance while being very lightweight (130 k parameters). Evaluations on music and speech datasets (MIR-1K, MDB-stem-synth, and PTDB) demonstrate that PESTO not only outperforms self-supervised baselines but also competes with supervised methods, exhibiting superior cross-dataset generalization. Finally, we enhance PESTO’s practical utility by developing a streamable VQT implementation using cached convolutions. Combined with our model’s low latency (less than 10 ms) and minimal parameter count, this makes PESTO particularly suitable for real-time applications.
On the spectral decomposition of the complex Robin Laplacian
Roland Badeau
Journal of the Acoustical Society of America, July 2025.
```
@article{badeau:hal-05192136,
  author = {Badeau, Roland},
  doi = {10.1121/10.0037233},
  hal_id = {hal-05192136},
  hal_version = {v2},
  journal = {{Journal of the Acoustical Society of America}},
  keywords = {Statistical wave field theory ; Green's function ; Helmholtz equation ; Robin Laplacian},
  month = jul,
  number = {1},
  pages = {838-848},
  pdf = {https://telecom-paris.hal.science/hal-05192136v2/file/Badeau-JASA-2025-preprint3.pdf},
  publisher = {{Acoustical Society of America}},
  title = {{On the spectral decomposition of the complex Robin Laplacian}},
  url = {https://telecom-paris.hal.science/hal-05192136},
  volume = {158},
  year = {2025}
}
```
The mathematical properties of the Laplacian on a bounded domain are well-known when the boundary condition is of the first type (Dirichlet) or second type (Neumann). In both cases, this operator is self-adjoint and, therefore, diagonalizable, its spectrum is discrete, and the set of eigenfunctions can be chosen to form an orthonormal basis of the Hilbert space of square-integrable functions on the domain. However, in the case of the third type (Robin) boundary condition, the same is true only when the parameter is real-valued. On the contrary, when this parameter is complex-valued, the Laplacian may not even be diagonalizable. In this paper, the spectral decomposition of the complex Robin Laplacian is investigated in the most general case possible, and a formula that decomposes any square-integrable function on the set of its (generalized) eigenfunctions is provided. This result is applied to the Green’s function of the Helmholtz equation, whose existence, unicity, and closed-form expression are established in this general setting, and the statistical wave field theory, which provides the statistical laws of waves propagating in a bounded domain.
Statistical wave field theory: Curvature term
Roland Badeau
Journal of the Acoustical Society of America, March 2025.
```
@article{badeau:hal-04985263,
  author = {Badeau, Roland},
  doi = {10.1121/10.0036053},
  hal_id = {hal-04985263},
  hal_version = {v2},
  journal = {{Journal of the Acoustical Society of America}},
  keywords = {reverberation ; Helmholtz equation ; wave equation ; Statistical physics},
  month = mar,
  number = {3},
  pages = {1650-1664},
  pdf = {https://telecom-paris.hal.science/hal-04985263v2/file/Badeau-JASA-2025-preprint1.pdf},
  publisher = {{Acoustical Society of America}},
  title = {{Statistical wave field theory: Curvature term}},
  url = {https://telecom-paris.hal.science/hal-04985263},
  volume = {157},
  year = {2025}
}
```
In a recent research paper, we introduced the statistical wave field theory, which establishes the statistical laws of waves propagating in a bounded volume. These laws hold after many reflections on the boundary surface and at high frequency. The statistical wave field theory is the first statistical theory of reverberation that provides the closed-form expression of the power distribution and the correlations of the wave field jointly over time, frequency and space, in terms of the geometry and the specific admittance of the boundary surface. In this paper, we refine the theory predictions, by investigating the impact of a curved boundary surface on the wave field statistics. In particular, we provide an improved closed-form expression of the reverberation time in room acoustics that holds at lower frequency.
Statistical wave field theory: Special polyhedra
Roland Badeau
Journal of the Acoustical Society of America, March 2025.
```
@article{badeau:hal-05010187,
  author = {Badeau, Roland},
  doi = {10.1121/10.0036254},
  hal_id = {hal-05010187},
  hal_version = {v2},
  journal = {{Journal of the Acoustical Society of America}},
  keywords = {Statistical physics ; Wave equation ; Helmholtz equation ; Reverberation},
  month = mar,
  number = {3},
  pages = {2263-2278},
  pdf = {https://telecom-paris.hal.science/hal-05010187v2/file/Badeau-JASA-2025-preprint2.pdf},
  publisher = {{Acoustical Society of America}},
  title = {{Statistical wave field theory: Special polyhedra}},
  url = {https://telecom-paris.hal.science/hal-05010187},
  volume = {157},
  year = {2025}
}
```
The statistical wave field theory establishes mathematically the statistical laws of the solutions to the wave equation in a bounded volume. It provides the closed-form expression of the power distribution and the correlations of the wave field jointly over time, frequency, and space, in terms of the geometry and the specific admittance of the boundary surface. In a recent paper, we presented a mathematical approach to this theory based on the Sturm-Liouville theory and the theory of dynamical billiards. We focused on mixing billiards that generate an isotropic wave field, and we retrieved the well-known statistical properties of reverberation in room acoustics. In the present paper, we introduce a simpler geometric approach, dedicated to a particular class of non-ergodic billiards. Though limited to only a few polyhedra, this approach offers a precious insight into various aspects of the theory, including the first examples of anisotropic wave fields, whose statistical properties are related to mathematical crystallography. We also show that the formulas that we obtain in this anisotropic case are closely related to those of the mixing case, albeit based on a different mathematical approach.

2020-2024 [144 publications]

2024

Conference Articles

Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising
Yoto Fujita, Aditya Arie Nugraha, Diego Di Carlo, Yoshiaki Bando, Mathieu Fontaine, Kazuyoshi Yoshii
2024 APSIPA : Asia-Pacific Signal and Information Processing Association, Macau China, China, December 2024.
```
@inproceedings{fujita:hal-04736454,
  address = {Macau China, China},
  author = {Fujita, Yoto and Nugraha, Aditya Arie and Carlo, Diego Di and Bando, Yoshiaki and Fontaine, Mathieu and Yoshii, Kazuyoshi},
  booktitle = {{2024 APSIPA : Asia-Pacific Signal and Information Processing Association}},
  hal_id = {hal-04736454},
  hal_version = {v1},
  keywords = {speech enhancement ; dereverberation ; neural beamforming ; blind source separation},
  month = dec,
  pdf = {https://hal.science/hal-04736454v1/file/apsipa2024_fujita.pdf},
  title = {{Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising}},
  url = {https://hal.science/hal-04736454},
  year = {2024}
}
```
This paper describes speech enhancement for realtime automatic speech recognition (ASR) in real environments. A standard approach to this task is to use neural beamforming that can work efficiently in an online manner. It estimates the masks of clean dry speech from a noisy echoic mixture spectrogram with a deep neural network (DNN) and then computes a enhancement filter used for beamforming. The performance of such a supervised approach, however, is drastically degraded under mismatched conditions. This calls for run-time adaptation of the DNN. Although the ground-truth speech spectrogram required for adaptation is not available at run time, blind dereverberation and separation methods such as weighted prediction error (WPE) and fast multichannel nonnegative matrix factorization (FastMNMF) can be used for generating pseudo groundtruth data from a mixture. Based on this idea, a prior work proposed a dual-process system based on a cascade of WPE and minimum variance distortionless response (MVDR) beamforming asynchronously fine-tuned by block-online FastMNMF. To integrate the dereverberation capability into neural beamforming and make it fine-tunable at run time, we propose to use weighted power minimization distortionless response (WPD) beamforming, a unified version of WPE and minimum power distortionless response (MPDR), whose joint dereverberation and denoising filter is estimated using a DNN. We evaluated the impact of run-time adaptation under various conditions with different numbers of speakers, reverberation times, and signal-to-noise ratios (SNRs).
Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing
David Perera, Victor Letzelter, Théo Mariotte, Adrien Cortés, Mickael Chen, Slim Essid, Gaël Richard
NeurIPS 2024 : 38th Conference on Neural Information Processing Systems, Vancouver, Canada, December 2024.
```
@inproceedings{perera:hal-04762097,
  address = {Vancouver, Canada},
  author = {Perera, David and Letzelter, Victor and Mariotte, Th{\'e}o and Cort{\'e}s, Adrien and Chen, Mickael and Essid, Slim and Richard, Ga{\"e}l},
  booktitle = {{NeurIPS 2024 : 38th Conference on Neural Information Processing Systems}},
  hal_id = {hal-04762097},
  hal_version = {v2},
  keywords = {Winner-takes-all ; Uncertainty Quantification ; Deterministic Annealing ; Multiple Choice Learning},
  month = dec,
  pdf = {https://hal.science/hal-04762097v2/file/main.pdf},
  title = {{Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing}},
  url = {https://hal.science/hal-04762097},
  year = {2024}
}
```
We introduce Annealed Multiple Choice Learning (aMCL) which combines simulated annealing with MCL. MCL is a learning framework handling ambiguous tasks by predicting a small set of plausible hypotheses. These hypotheses are trained using the Winner-takes-all (WTA) scheme, which promotes the diversity of the predictions. However, this scheme may converge toward an arbitrarily suboptimal local minimum, due to the greedy nature of WTA. We overcome this limitation using annealing, which enhances the exploration of the hypothesis space during training. We leverage insights from statistical physics and information theory to provide a detailed description of the model training trajectory. Additionally, we validate our algorithm by extensive experiments on synthetic datasets, on the standard UCI benchmark, and on speech separation.
Using Pairwise Link Prediction and Graph Attention Networks for Music Structure Analysis
Morgan Buisson, Brian Mcfee, Slim Essid
25th International Society for Music Information Retrieval (ISMIR) (2024), San Francisco (CA), United States, November 2024.
```
@inproceedings{buisson:hal-04665063,
  address = {San Francisco (CA), United States},
  author = {Buisson, Morgan and Mcfee, Brian and Essid, Slim},
  booktitle = {{25th International Society for Music Information Retrieval (ISMIR) (2024)}},
  hal_id = {hal-04665063},
  hal_version = {v2},
  month = nov,
  pdf = {https://hal.science/hal-04665063v2/file/ISMIR_24_camera_ready.pdf},
  title = {{Using Pairwise Link Prediction and Graph Attention Networks for Music Structure Analysis}},
  url = {https://hal.science/hal-04665063},
  year = {2024}
}
```
The task of music structure analysis has been mostly addressed as a sequential problem, by relying on the internal homogeneity of musical sections or their repetitions. In this work, we instead regard it as a pairwise link prediction task. If for any pair of time instants in a track, one can successfully predict whether they belong to the same structural entity or not, then the underlying structure can be easily recovered. Building upon this assumption, we propose a method that first learns to classify pairwise links between time frames as belonging to the same section (or segment) or not. The resulting link features, along with node-specific information, are combined through a graph attention network. The latter is regularized with a graph partitioning training objective and outputs boundary locations between musical segments and section labels. The overall system is lightweight and performs competitively with previous methods. The evaluation is done on two standard datasets for music structure analysis and an ablation study is conducted in order to gain insight on the role played by its different components.
A Contrastive Self-Supervised Learning scheme for beat tracking amenable to few-shot learning
Antonin Gagnere, Geoffroy Peeters, Slim Essid
ISMIR 2024 : 25th International Society for Music Information Retrieval (ISMIR) Conference Onsite and Virtual, San Francisco, Californ, United States, November 2024.
```
@inproceedings{gagnere:hal-04768296,
  address = {San   Francisco, Californ, United States},
  author = {Gagnere, Antonin and Peeters, Geoffroy and Essid, Slim},
  booktitle = {{ISMIR 2024 : 25th International Society for Music Information Retrieval (ISMIR) Conference Onsite and Virtual}},
  hal_id = {hal-04768296},
  hal_version = {v1},
  keywords = {Self supervised learning ; Music infor-mation retrieval ; Beat-tracking ; Contrastive learning ; Few shot Learning},
  month = nov,
  pdf = {https://hal.science/hal-04768296v1/file/ISMIR2024_template.pdf},
  title = {{A Contrastive Self-Supervised Learning scheme for beat tracking amenable to few-shot learning}},
  url = {https://hal.science/hal-04768296},
  year = {2024}
}
```
In this paper, we propose a novel Self-Supervised-Learning scheme to train rhythm analysis systems and instantiate it for few-shot beat tracking. Taking inspiration from the Contrastive Predictive Coding paradigm, we propose to train a Log-Mel-Spectrogram Transformer encoder to contrast observations at times separated by hypothesized beat intervals from those that are not. We do this without the knowledge of ground-truth tempo or beat positions, as we rely on the local maxima of a Predominant Local Pulse function, considered as a proxy for Tatum positions, to define candidate anchors, candidate positives (located at a distance of a power of two from the anchor) and negatives (remaining time positions). We show that a model pre-trained using this approach on the unlabeled FMA, MTT and MTG-Jamendo datasets can successfully be fine-tuned in the few-shot regime, i.e. with just a few annotated examples to get a competitive beat-tracking performance.
A SOUND DESCRIPTION: EXPLORING PROMPT TEMPLATES AND CLASS DESCRIPTIONS TO ENHANCE ZERO-SHOT AUDIO CLASSIFICATION
Michel Olvera, Paraskevas Stamatiadis, Slim Essid
DCASE 2024 - 9th Workshop on Detection and Classification of Acoustic Scenes and Events, Tokyo, Japan, October 2024.
```
@inproceedings{olvera:hal-04701759,
  address = {Tokyo, Japan},
  author = {Olvera, Michel and Stamatiadis, Paraskevas and Essid, Slim},
  booktitle = {{DCASE 2024 - 9th Workshop on Detection and Classification of Acoustic Scenes and Events}},
  hal_id = {hal-04701759},
  hal_version = {v1},
  keywords = {Zero-shot audio classification ; audio-text models ; contrastive language-audio pretraining ; in-context learning},
  month = oct,
  pdf = {https://hal.science/hal-04701759v1/file/main.pdf},
  title = {{A SOUND DESCRIPTION: EXPLORING PROMPT TEMPLATES AND CLASS DESCRIPTIONS TO ENHANCE ZERO-SHOT AUDIO CLASSIFICATION}},
  url = {https://hal.science/hal-04701759},
  year = {2024}
}
```
Audio-text models trained via contrastive learning offer a practical approach to perform audio classification through natural language prompts, such as "this is a sound of" followed by category names. In this work, we explore alternative prompt templates for zero-shot audio classification, demonstrating the existence of higher-performing options. First, we find that the formatting of the prompts significantly affects performance so that simply prompting the models with properly formatted class labels performs competitively with optimized prompt templates and even prompt ensembling. Moreover, we look into complementing class labels by audio-centric descriptions. By leveraging large language models, we generate textual descriptions that prioritize acoustic features of sound events to disambiguate between classes, without extensive prompt engineering. We show that prompting with class descriptions leads to state-of-the-art results in zero-shot audio classification across major ambient sound datasets. Remarkably, this method requires no additional training and remains fully zero-shot.
SALT: STANDARDIZED AUDIO EVENT LABEL TAXONOMY
Paraskevas Stamatiadis, Michel Olvera, Slim Essid
DCASE, Tokyo, Japan, October 2024.
```
@inproceedings{stamatiadis:hal-04695595,
  address = {Tokyo, Japan},
  author = {Stamatiadis, Paraskevas and Olvera, Michel and Essid, Slim},
  booktitle = {{DCASE}},
  hal_id = {hal-04695595},
  hal_version = {v1},
  keywords = {Machine listening DCASE sound taxonomy sound categorization data aggregation ; Machine listening ; DCASE ; sound taxonomy ; sound categorization ; data aggregation},
  month = oct,
  pdf = {https://hal.science/hal-04695595v1/file/main.pdf},
  title = {{SALT: STANDARDIZED AUDIO EVENT LABEL TAXONOMY}},
  url = {https://hal.science/hal-04695595},
  year = {2024}
}
```
Machine listening systems often rely on fixed taxonomies to organize and label audio data, key for training and evaluating deep neural networks (DNNs) and other supervised algorithms. However, such taxonomies face significant constraints: they are composed of application-dependent predefined categories, which hinders the integration of new or varied sounds, and exhibits limited cross-dataset compatibility due to inconsistent labeling standards. To overcome these limitations, we introduce SALT: Standardized Audio event Label Taxonomy. Building upon the hierarchical structure of AudioSet’s ontology, our taxonomy extends and standardizes labels across 24 publicly available environmental sound datasets, allowing the mapping of class labels from diverse datasets to a unified system. Our proposal comes with a new Python package designed for navigating and utilizing this taxonomy, easing cross-dataset label searching and hierarchical exploration. Notably, our package allows effortless data aggregation from diverse sources, hence easy experimentation with combined datasets.
Speech dereverberation constrained on room impulse response characteristics
Louis Bahrman, Mathieu Fontaine, Jonathan Le Roux, Gaël Richard
INTERSPEECH, Kos Island, Greece, September 2024.
```
@inproceedings{bahrman:hal-04640068,
  address = {Kos Island, Greece},
  author = {Bahrman, Louis and Fontaine, Mathieu and Le Roux, Jonathan and Richard, Ga{\"e}l},
  booktitle = {{INTERSPEECH}},
  hal_id = {hal-04640068},
  hal_version = {v1},
  keywords = {Speech dereverberation ; hybrid deep learning ; room acoustics ; acoustic matching ; speech processing},
  month = sep,
  pdf = {https://telecom-paris.hal.science/hal-04640068v1/file/camera_ready.pdf},
  title = {{Speech dereverberation constrained on room impulse response characteristics}},
  url = {https://telecom-paris.hal.science/hal-04640068},
  year = {2024}
}
```
Single-channel speech dereverberation aims at extracting a dry speech signal from a recording affected by the acoustic reflections in a room. However, most current deep learning-based approaches for speech dereverberation are not interpretable for room acoustics, and can be considered as black-box systems in that regard. In this work, we address this problem by regularizing the training loss using a novel physical coherence loss which encourages the room impulse response (RIR) induced by the dereverberated output of the model to match the acoustic properties of the room in which the signal was recorded. Our investigation demonstrates the preservation of the original dereverberated signal alongside the provision of a more physically coherent RIR.
WaveTransfer: A Flexible End-to-end Multi-instrument Timbre Transfer with Diffusion
Teysir Baoueb, Xiaoyu Bie, Hicham Janati, Gael Richard
2024 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2024), London (UK), United Kingdom, September 2024. Accepted at ....
```
@inproceedings{baoueb:hal-04685184,
  address = {London (UK), United Kingdom},
  author = {Baoueb, Teysir and Bie, Xiaoyu and Janati, Hicham and Richard, Gael},
  booktitle = {{2024 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2024)}},
  hal_id = {hal-04685184},
  hal_version = {v1},
  keywords = {Multi-instrumental timbre transfer ; diffusion models ; music transformation ; generative AI ; Multi-instrumental timbre transfer diffusion models music transformation generative AI},
  month = sep,
  note = {Accepted at MLSP 2024},
  pdf = {https://hal.science/hal-04685184v1/file/MLSP_2024%20_WaveTransfer.pdf},
  title = {{WaveTransfer: A Flexible End-to-end Multi-instrument Timbre Transfer with Diffusion}},
  url = {https://hal.science/hal-04685184},
  year = {2024}
}
```
As diffusion-based deep generative models gain prevalence, researchers are actively investigating their potential applications across various domains, including music synthesis and style alteration. Within this work, we are interested in timbre transfer, a process that involves seamlessly altering the instrumental characteristics of musical pieces while preserving essential musical elements. This paper introduces WaveTransfer, an end-to-end diffusion model designed for timbre transfer. We specifically employ the bilateral denoising diffusion model (BDDM) for noise scheduling search. Our model is capable of conducting timbre transfer between audio mixtures as well as individual instruments. Notably, it exhibits versatility in that it accommodates multiple types of timbre transfer between unique instrument pairs in a single model, eliminating the need for separate model training for each pairing. Furthermore, unlike recent works limited to 16 kHz, WaveTransfer can be trained at various sampling rates, including the industry-standard 44.1 kHz, a feature of particular interest to the music community.
RIR-in-a-Box: Estimating Room Acoustics from 3D Mesh Data through Shoebox Approximation
Liam Kelley, Diego Di Carlo, Aditya Arie Nugraha, Mathieu Fontaine, Yoshiaki Bando, Kazuyoshi Yoshii
INTERSPEECH, Kos International Convention Center, Kos Island, Greece, September 2024.
```
@inproceedings{kelley:hal-04632526,
  address = {Kos International Convention Center, Kos Island, Greece},
  author = {Kelley, Liam and Carlo, Diego Di and Nugraha, Aditya Arie and Fontaine, Mathieu and Bando, Yoshiaki and Yoshii, Kazuyoshi},
  booktitle = {{INTERSPEECH}},
  hal_id = {hal-04632526},
  hal_version = {v1},
  keywords = {Spatial audio ; Room acoustics ; 3D mesh data ; Physical models ; DDSP},
  month = sep,
  pdf = {https://telecom-paris.hal.science/hal-04632526v1/file/Interspeech_RIR_in_a_Box.pdf},
  title = {{RIR-in-a-Box: Estimating Room Acoustics from 3D Mesh Data through Shoebox Approximation}},
  url = {https://telecom-paris.hal.science/hal-04632526},
  year = {2024}
}
```
This paper describes a method for estimating the room impulse response (RIR) for a microphone and a sound source located at arbitrary positions from the 3D mesh data of the room. Simulat- ing realistic RIRs with pure physics-driven methods often fails the balance between physical consistency and computational ef- ficiency, hindering application to real-time speech processing. Alternatively, one can use MESH2IR, a fast black-box estima- tor that consists of an encoder extracting latent code from mesh data with a graph convolutional network (GCN) and a decoder generating the RIR from the latent code. Combining these two approaches, we propose a fast yet physically coherent estimator with interpretable latent code based on differentiable digital sig- nal processing (DDSP). Specifically, the encoder estimates a vir- tual shoebox room scene that acoustically approximates the real scene, accelerating physical simulation with the differentiable image-source model in the decoder. Our experiments showed that our method outperformed MESH2IR for real mesh data ob- tained with the depth scanner of Microsoft HoloLens 2, and can provide correct spatial consistency for binaural RIRs.
Explainable by-design Audio Segmentation through Non-Negative Matrix Factorization and Probing
Martin Lebourdais, Théo Mariotte, Antonio Almudévar, Marie Tahon, Alfonso Ortega
Interspeech 2024, Kos / Greece, France, September 2024.
```
@inproceedings{lebourdais:hal-04617131,
  address = {Kos / Greece, France},
  author = {Lebourdais, Martin and Mariotte, Th{\'e}o and Almud{\'e}var, Antonio and Tahon, Marie and Ortega, Alfonso},
  booktitle = {{Interspeech 2024}},
  hal_id = {hal-04617131},
  hal_version = {v1},
  keywords = {Audio segmentation ; NMF ; explainability ; probing},
  month = sep,
  organization = {{International Speech Communication Association (ISCA)}},
  pdf = {https://univ-lemans.hal.science/hal-04617131/file/nmf_probing2024-4.pdf},
  title = {{Explainable by-design Audio Segmentation through Non-Negative Matrix Factorization and Probing}},
  url = {https://univ-lemans.hal.science/hal-04617131},
  year = {2024}
}
```
Audio segmentation is a key task for many speech technologies, most of which are based on neural networks, usually considered as black boxes, with high-level performances. However, in many domains, among which health or forensics, there is not only a need for good performance but also for explanations about the output decision. Explanations derived directly from latent representations need to satisfy "good" properties, such as informativeness, compactness, or modularity, to be interpretable. In this article, we propose an explainable-by-design audio segmentation model based on non-negative matrix factorization (NMF) which is a good candidate for the design of interpretable representations. This paper shows that our model reaches good segmentation performances, and presents deep analyses of the latent representation extracted from the non-negative matrix. The proposed approach opens new perspectives toward the evaluation of interpretable representations according to "good" properties.
Invariant Audio Prints for Music Indexing and Alignment
Rémi Mignot, Geoffroy Peeters
21st International Conference on Content-based Multimedia Indexing, Reykjavik, Iceland, September 2024.
```
@inproceedings{mignot:hal-04927568,
  address = {Reykjavik, Iceland},
  author = {Mignot, R{\'e}mi and Peeters, Geoffroy},
  booktitle = {{21st International Conference on Content-based Multimedia Indexing}},
  doi = {10.1109/CBMI62980.2024.10859214},
  hal_id = {hal-04927568},
  hal_version = {v1},
  keywords = {Content Analysis and Indexing Signal processing Machine learning ; Content Analysis and Indexing ; Signal processing ; Machine learning},
  month = sep,
  pages = {1-7},
  pdf = {https://hal.science/hal-04927568v1/file/MIGNOT_CBMI_2024.pdf},
  publisher = {{IEEE}},
  series = {2024 International Conference on Content-Based Multimedia Indexing (CBMI)},
  title = {{Invariant Audio Prints for Music Indexing and Alignment}},
  url = {https://hal.science/hal-04927568},
  year = {2024}
}
```
This work deals with music indexing and alignment using audio codes designed to be representative of the music content and robust to sound modifications. First, based on properties of the Fourier Transform and of the logarithm, high-dimensional audio descriptors are designed. Then, a dimension reduction is learned with criteria based on sound discrimination and invariance to transformations. Finally, a binarization is computed to derive codes (integers). This last process allows a fast searching for large catalogs with a hash table, and a Hamming distance on codes makes possible the time alignment using an adapted "Dynamic Time Warping". The contributions of this paper are tested for two different tasks. The goal of the first task is to identify the segments of music medleys with the audio indexing process, and to accurately find the corresponding original time positions. The goal of the second task is to measure the accuracy of the time-alignment with synthesized MIDI files, where the tempo continuously varies, and with modified pitches and instruments. Additionally, the audio indexing is also tested for these data, in order to exhibit some properties of the used audio prints.
EPISODIC FINE-TUNING PROTOTYPICAL NETWORKS FOR OPTIMIZATION-BASED FEW-SHOT LEARNING: APPLICATION TO AUDIO CLASSIFICATION
Xuanyu Zhuang, Geoffroy Peeters, Gaël Richard
2024 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2024), London (UK), United Kingdom, September 2024. Accepted at ....
```
@inproceedings{zhuang:hal-04720291,
  address = {London (UK), United Kingdom},
  author = {Zhuang, Xuanyu and Peeters, Geoffroy and Richard, Ga{\"e}l},
  booktitle = {{2024 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2024)}},
  hal_id = {hal-04720291},
  hal_version = {v1},
  keywords = {Few-shot learning ; Audio classification ; Prototypical Network ; Model-Agnostic Meta-Learning ; Meta-Curvature},
  month = sep,
  note = {Accepted at MLSP 2024},
  pdf = {https://hal.science/hal-04720291v1/file/Proto-MAML_MLSP_2024.pdf},
  title = {{EPISODIC FINE-TUNING PROTOTYPICAL NETWORKS FOR OPTIMIZATION-BASED FEW-SHOT LEARNING: APPLICATION TO AUDIO CLASSIFICATION}},
  url = {https://hal.science/hal-04720291},
  year = {2024}
}
```
The Prototypical Network (ProtoNet) has emerged as a popular choice in Few-shot Learning (FSL) scenarios due to its remarkable performance and straightforward implementation. Building upon such success, we first propose a simple (yet novel) method to fine-tune a ProtoNet on the (labeled) support set of the test episode of a C-way-K-shot test episode (without using the query set which is only used for evaluation). We then propose an algorithmic framework that combines ProtoNet with optimization-based FSL algorithms (MAML and Meta-Curvature) to work with such a fine-tuning method. Since optimization-based algorithms endow the target learner model with the ability to fast adaption to only a few samples, we utilize ProtoNet as the target model to enhance its fine-tuning performance with the help of a specifically designed episodic fine-tuning strategy. The experimental results confirm that our proposed models, MAML-Proto and MC-Proto, combined with our unique fine-tuning method, outperform regular ProtoNet by a large margin in few-shot audio classification tasks on the ESC-50 and Speech Commands v2 datasets. We note that although we have only applied our model to the audio domain, it is a general method and can be easily extended to other domains.
Multifrequency Highly Oscillating Aperiodic Amplitude Estimation for Nonlinear Chirp Signal
Anton Emelchenkov, Mathieu Fontaine, Yves Grenier, Hervé Mahé, François Roueff
European Signal Processing Conference (EUSIPCO), Lyon, France, August 2024.
```
@inproceedings{emelchenkov:hal-04614241,
  address = {Lyon, France},
  author = {Emelchenkov, Anton and Fontaine, Mathieu and Grenier, Yves and Mah{\'e}, Herv{\'e} and Roueff, Fran{\c c}ois},
  booktitle = {{European Signal Processing Conference (EUSIPCO)}},
  hal_id = {hal-04614241},
  hal_version = {v1},
  keywords = {chirp signal ; amplitude estimation ; locally stationary process ; filtering ; hyperparameters estimation ; nonlinear chirp signal},
  month = aug,
  pdf = {https://hal.science/hal-04614241v1/file/EUSIPCO_2024____Anton_Emelchenkov_Capon_Campbell_MLE_final.pdf},
  title = {{Multifrequency Highly Oscillating Aperiodic Amplitude Estimation for Nonlinear Chirp Signal}},
  url = {https://hal.science/hal-04614241},
  year = {2024}
}
```
This paper addresses the challenge of estimating multiple highly oscillating amplitudes within the nonlinear chirp signal model. The problem is analogous to the mode detection task with fixed instantaneous frequencies, where the oscillating amplitudes signify mechanical vibrations concealing crucial information for predictive maintenance. Existing methods often focus on single-frequency estimation, employ simple amplitude functions, or impose strong noise assumptions. Furthermore, these methods frequently rely on arbitrarily chosen hyperparameters, leading to sub-optimal generalization for a diverse range of amplitudes. To address these limitations, our approach introduces two estimators, based on Capon filters and negative log-likelihood approaches respectively, that leverage locally stationary assumptions and incorporate hyperparameters estimation. The results demonstrate that, even under challenging conditions, these estimators yield competitive outcomes across various noisy scenarios, mitigating the drawbacks associated with existing methods.
Using Random Codebooks for Audio Neural AutoEncoders
Benoît Giniès, Xiaoyu Bie, Olivier Fercoq, Gaël Richard
EUROPEAN SIGNAL PROCESSING CONFERENCE 2024 [EUSIPCO], Lyon, France, August 2024.
```
@inproceedings{ginies:hal-04705811,
  address = {Lyon, France},
  author = {Gini{\`e}s, Beno{\^i}t and Bie, Xiaoyu and Fercoq, Olivier and Richard, Ga{\"e}l},
  booktitle = {{EUROPEAN SIGNAL PROCESSING CONFERENCE 2024 [EUSIPCO]}},
  hal_id = {hal-04705811},
  hal_version = {v1},
  keywords = {feature extraction ; quantization ; random codebooks ; audio reconstruction},
  month = aug,
  pdf = {https://hal.science/hal-04705811v1/file/EUSIPCO___RANDOM_QUANTIZATION%20%281%29.pdf},
  title = {{Using Random Codebooks for Audio Neural AutoEncoders}},
  url = {https://hal.science/hal-04705811},
  year = {2024}
}
```
Latent representation learning has been an active field of study for decades in numerous applications. Inspired among others by the tokenization from Natural Language Processing and motivated by the research of a simple data representation, recent works have introduced a quantization step into the feature extraction. In this work, we propose a novel strategy to build the neural discrete representation by means of random codebooks. These codebooks are obtained by randomly sampling a large, predefined fixed codebook. We experimentally show the merits and potential of our approach in a task of audio compression and reconstruction.
Invariance-based layer regularization for sound event detection
David Perera, Slim Essid, Richard Gaël
European Signal Processing Conference, Lyon, France, August 2024.
```
@inproceedings{perera:hal-04645968,
  address = {Lyon, France},
  author = {Perera, David and Essid, Slim and Ga{\"e}l, Richard},
  booktitle = {{European Signal Processing Conference}},
  hal_id = {hal-04645968},
  hal_version = {v1},
  keywords = {DCASE task 4 invariance-based learning semisupervised learning ; DCASE task 4 ; invariance-based learning ; semisupervised learning},
  month = aug,
  pdf = {https://hal.science/hal-04645968v1/file/eusipco.pdf},
  title = {{Invariance-based layer regularization for sound event detection}},
  url = {https://hal.science/hal-04645968},
  year = {2024}
}
```
Experimental and theoretical evidences suggest that invariance constraints can improve the performance and generalization capabilities of a classification model. While invariance-based regularization has become part of the standard tool-belt of machine learning practitioners, this regularization is usually applied near the decision layers or at the end of the feature extracting layers of a deep classification network. However, the optimal placement of invariance constraints inside a deep classifier is yet an open question. In particular, it would be beneficial to link it to the structural properties of the network (e.g. its architecture), or its dynamical properties (e.g. the effectively used volume of its latent spaces). The purpose of this article is to initiate an investigation on these aspects. We use the experimental framework of the DCASE 2023 Task 4A challenge, which considers the training of a sound event classifier in a semi-supervised manner. We show that the optimal placement of invariance constraints improves the performance of the standard baseline for this task.
Winner-takes-all learners are geometry-aware conditional density estimators
Victor Letzelter, David Perera, Cédric Rommel, Mathieu Fontaine, Slim Essid, Gael Richard, Patrick Pérez
International Conference on Machine Learning, Vienne (Autriche), Austria, July 2024.
```
@inproceedings{letzelter:hal-04574640,
  address = {Vienne (Autriche), Austria},
  author = {Letzelter, Victor and Perera, David and Rommel, C{\'e}dric and Fontaine, Mathieu and Essid, Slim and Richard, Gael and P{\'e}rez, Patrick},
  booktitle = {{International Conference on Machine Learning}},
  hal_id = {hal-04574640},
  hal_version = {v1},
  keywords = {Conditional density estimation ; Voronoi Tesselation ; Multiple Choice Learning ; Winner-takes-all ; Uncertainty Quantification},
  month = jul,
  pdf = {https://hal.science/hal-04574640v1/file/main_paper.pdf},
  title = {{Winner-takes-all learners are geometry-aware conditional density estimators}},
  url = {https://hal.science/hal-04574640},
  year = {2024}
}
```
Winner-takes-all training is a simple learning paradigm, which handles ambiguous tasks by predicting a set of plausible hypotheses. Recently, a connection was established between Winner-takes-all training and centroidal Voronoi tessellations, showing that, once trained, hypotheses should quantize optimally the shape of the conditional distribution to predict. However, the best use of these hypotheses for uncertainty quantification is still an open question. In this work, we show how to leverage the appealing geometric properties of the Winner-takes-all learners for conditional density estimation, without modifying its original training scheme. We theoretically establish the advantages of our novel estimator both in terms of quantization and density estimation, and we demonstrate its competitiveness on synthetic and real-world datasets, including audio data.
Collaborating Foundation Models for Domain Generalized Semantic Segmentation
Yasser Benigmim, Subhankar Roy, Slim Essid, Vicky Kalogeiton, Stéphane Lathuilière
The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024, Seattle, WA, United States, June 2024. https://gith....
```
@inproceedings{benigmim:hal-04611902,
  address = {Seattle, WA, United States},
  author = {Benigmim, Yasser and Roy, Subhankar and Essid, Slim and Kalogeiton, Vicky and Lathuili{\`e}re, St{\'e}phane},
  booktitle = {{The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024}},
  doi = {10.1109/CVPR52733.2024.00300},
  hal_id = {hal-04611902},
  hal_version = {v1},
  keywords = {Training ; Adaptation models ; Semantic segmentation ; Clouds ; Collaboration ; Predictive models ; Benchmark testing ; Domain Adaptation ; Domain Generalization ; Semantic Segmentation ; Foundation Models ; Computer Vision ; Deep Learning},
  month = jun,
  note = {https://github.com/yasserben/CLOUDS ; Accepted to CVPR 2024},
  pages = {3108-3119},
  pdf = {https://hal.science/hal-04611902v1/file/Benigmim_Collaborating_Foundation_Models_for_Domain_Generalized_Semantic_Segmentation_CVPR_2024_paper.pdf},
  publisher = {{IEEE}},
  series = {2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  title = {{Collaborating Foundation Models for Domain Generalized Semantic Segmentation}},
  url = {https://hal.science/hal-04611902},
  year = {2024}
}
```
Domain Generalized Semantic Segmentation (DGSS) deals with training a model on a labeled source domain with the aim of generalizing to unseen domains during inference. Existing DGSS methods typically effectuate robust features by means of Domain Randomization (DR). Such an approach is often limited as it can only account for style diversification and not content. In this work, we take an orthogonal approach to DGSS and propose to use an assembly of CoLlaborative FOUndation models for Domain Generalized Semantic Segmentation (CLOUDS). In detail, CLOUDS is a framework that integrates FMs of various kinds: (i) CLIP backbone for its robust feature representation, (ii) generative models to diversify the content, thereby covering various modes of the possible target distribution, and (iii) Segment Anything Model (SAM) for iteratively refining the predictions of the segmentation model. Extensive experiments show that our CLOUDS excels in adapting from synthetic to real DGSS benchmarks and under varying weather conditions, notably outperforming prior methods by 5.6% and 6.7% on averaged miou, respectively. The code is available at : https://github.com/yasserben/CLOUDS

Embodied exploration of deep latent spaces in interactive dance-music performance
Sarah Nabi, Philippe Esling, Geoffroy Peeters, Frédéric Bevilacqua
9th International Conference on Movement and Computing (MOCO ’24), Utrecht, Netherlands, May 2024.

@inproceedings{nabi:hal-04602229,
  address = {Utrecht, Netherlands},
  author = {Nabi, Sarah and Esling, Philippe and Peeters, Geoffroy and Bevilacqua, Fr{\'e}d{\'e}ric},
  booktitle = {{9th International Conference on Movement and Computing (MOCO '24)}},
  doi = {10.1145/3658852.3659072},
  hal_id = {hal-04602229},
  hal_version = {v1},
  keywords = {Human-centered computing $\rightarrow$ Sound-based input / output Gestural input Auditory feedback Collaborative interaction $\bullet$ Applied computing $\rightarrow$ Sound and music computing $\bullet$ Computing methodologies $\rightarrow$ Machine learning dance-music-AI performance ; HCI ; motion-sound interaction ; deep learning ; generative models ; embodied exploration ; latent space ; Human-centered computing $\rightarrow$ Sound-based input / output ; Gestural input ; Auditory feedback ; Collaborative interaction ; $\bullet$ Applied computing $\rightarrow$ Sound and music computing ; $\bullet$ Computing methodologies $\rightarrow$ Machine learning dance-music-AI performance},
  month = may,
  pdf = {https://hal.science/hal-04602229v1/file/_MOCO24__PHD_SARAH__embodied_exploration_of_deep_latent_spaces_performance.pdf},
  title = {{Embodied exploration of deep latent spaces in interactive dance-music performance}},
  url = {https://hal.science/hal-04602229},
  year = {2024}
}

Structure-informed Positional Encoding for Music Generation
Manvi Agarwal, Changhong Wang, Gaël Richard
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, South Korea, April 2024.
```
@inproceedings{agarwal:hal-04432659,
  address = {Seoul, South Korea},
  author = {Agarwal, Manvi and Wang, Changhong and Richard, Ga{\"e}l},
  booktitle = {{IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  hal_id = {hal-04432659},
  hal_version = {v3},
  keywords = {symbolic music generation ; Transformers ; music structure ; positional encoding ; symbolic music generation Transformers music structure positional encoding},
  month = apr,
  pdf = {https://hal.science/hal-04432659v3/file/svbwdvrdnrztpzxgdsckkhqxkjbjpfzx.pdf},
  title = {{Structure-informed Positional Encoding for Music Generation}},
  url = {https://hal.science/hal-04432659},
  year = {2024}
}
```
Music generated by deep learning methods often suffers from a lack of coherence and long-term organization. Yet, multi-scale hierarchical structure is a distinctive feature of music signals. To leverage this information, we propose a structure-informed positional encoding framework for music generation with Transformers. We design three variants in terms of absolute, relative and non-stationary positional information. We comprehensively test them on two symbolic music generation tasks: next-timestep prediction and accompaniment generation. As a comparison, we choose multiple baselines from the literature and demonstrate the merits of our methods using several musically-motivated evaluation metrics. In particular, our methods improve the melodic and structural consistency of the generated pieces.
SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis
Teysir Baoueb, Haocheng Liu, Mathieu Fontaine, Jonathan Le Roux, Gael Richard
2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024), Seoul (Korea), South Korea, April 2024. Accepted at ....
```
@inproceedings{baoueb:hal-04423979,
  address = {Seoul (Korea), South Korea},
  author = {Baoueb, Teysir and Liu, Haocheng and Fontaine, Mathieu and Le Roux, Jonathan and Richard, Gael},
  booktitle = {{2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024)}},
  hal_id = {hal-04423979},
  hal_version = {v1},
  keywords = {Generative adversarial network (GAN) ; diffusion process ; deep audio synthesis ; spectral envelope},
  month = apr,
  note = {Accepted at ICASSP 2024},
  pdf = {https://hal.science/hal-04423979v1/file/ICASSP_2024_SpecDiff_GAN___Preprint.pdf},
  title = {{SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis}},
  url = {https://hal.science/hal-04423979},
  year = {2024}
}
```
Generative adversarial network (GAN) models can synthesize highquality audio signals while ensuring fast sample generation. However, they are difficult to train and are prone to several issues including mode collapse and divergence. In this paper, we introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN, which was initially devised for speech synthesis from mel spectrogram. In our model, the training stability is enhanced by means of a forward diffusion process which consists in injecting noise from a Gaussian distribution to both real and fake samples before inputting them to the discriminator. We further improve the model by exploiting a spectrally-shaped noise distribution with the aim to make the discriminator’s task more challenging. We then show the merits of our proposed model for speech and music synthesis on several datasets. Our experiments confirm that our model compares favorably in audio quality and efficiency compared to several baselines.
Neural Steerer: Novel steering vector synthesis with a causal neural field over frequency and direction
Diego Di Carlo, Aditya Arie Nugraha, Mathieu Fontaine, Yoshiaki Bando, Kazuyoshi Yoshii
ICASSP, Seoul (Korea), South Korea, April 2024.
```
@inproceedings{carlo:hal-04479188,
  address = {Seoul (Korea), South Korea},
  author = {Carlo, Diego Di and Nugraha, Aditya Arie and Fontaine, Mathieu and Bando, Yoshiaki and Yoshii, Kazuyoshi},
  booktitle = {{ICASSP}},
  hal_id = {hal-04479188},
  hal_version = {v1},
  keywords = {Steering vector neural field spatial audio interpolation representation learning ; Steering vector ; neural field ; spatial audio ; interpolation ; representation learning},
  month = apr,
  pdf = {https://hal.science/hal-04479188v1/file/_2024_HSCMA24__Neural_Steerer.pdf},
  title = {{Neural Steerer: Novel steering vector synthesis with a causal neural field over frequency and direction}},
  url = {https://hal.science/hal-04479188},
  year = {2024}
}
```
We address the problem of accurately interpolating measured anechoic steering vectors with a deep learning framework called the neural field. This task plays a pivotal role in reducing the resourceintensive measurements required for precise sound source separation and localization, essential as the front-end of speech recognition. Classical approaches to interpolation rely on linear weighting of nearby measurements in space on a fixed, discrete set of frequencies. Drawing inspiration from the success of neural fields for novel view synthesis in computer vision, we introduce the neural steerer, a continuous complex-valued function that takes both frequency and direction as input and produces the corresponding steering vector. Importantly, it incorporates inter-channel phase difference information and a regularization term enforcing filter causality, essential for accurate steering vector modeling. Our experiments, conducted using a dataset of real measured steering vectors, demonstrate the effectiveness of our resolution-free model in interpolating such measurements.
Adapting Pitch-Based Self Supervised Learning Models for Tempo Estimation
Antonin Gagneré, Slim Essid, Geoffroy Peeters
ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, South Korea, April 2024.
```
@inproceedings{gagnere:hal-04544157,
  address = {Seoul, South Korea},
  author = {Gagner{\'e}, Antonin and Essid, Slim and Peeters, Geoffroy},
  booktitle = {{ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  doi = {10.1109/ICASSP48485.2024.10447129},
  hal_id = {hal-04544157},
  hal_version = {v1},
  keywords = {Tempo estimation ; Self-supervised-learning ; Training ; Adaptation models ; Zero-shot learning ; Estimation ; Training data ; Self-supervised learning ; Transforms},
  month = apr,
  pages = {956-960},
  pdf = {https://hal.science/hal-04544157v1/file/icassp__USING_PITCH_BASED_SUPERVISED_LEARNING_MODEL_FOR_TEMPO_ESTIMATION.pdf},
  publisher = {{IEEE}},
  title = {{Adapting Pitch-Based Self Supervised Learning Models for Tempo Estimation}},
  url = {https://hal.science/hal-04544157},
  year = {2024}
}
```
Tempo estimation is the task of estimating the periodicity of the dominant rhythm pulse of a music audio signal. It has therefore a close relationship with dominant pitch estimation. Recently, both tasks have been addressed in a ssl fashion so as to leverage unlabelled data for training. In this work, we study the applicability of two successful pitch-based ssl models, SPICE and PESTO, for the purpose of tempo estimation. Both successfully exploit Siamese networks with a pitch-shifting view generation between the two branches. To apply these models for tempo estimation, we represent the audio signal by the cqt of its onset-strength-function and adapt their view generation using time-stretching (instead of pitch shifting), which is efficiently implemented by shifting the cqt. In a large experiment, we show that simply adapting PESTO in this way yields superior results than the previous ssl approach to tempo estimation for most datasets used in the reference benchmark. Further, since PESTO is light-weight, requiring only a few training data, we study a new learning scheme where the downstream datasets are processed directly in a ssl fashion (without access to labels) showing that this is an interesting alternative further improving the performance for some datasets.
ONLINE SPEAKER DIARIZATION OF MEETINGS GUIDED BY SPEECH SEPARATION
Elio Gruttadauria, Mathieu Fontaine, Slim Essid
2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024), Seoul (Korea), South Korea, April 2024. Accepted at ....
```
@inproceedings{gruttadauria:hal-04419041,
  address = {Seoul (Korea), South Korea},
  author = {Gruttadauria, Elio and Fontaine, Mathieu and Essid, Slim},
  booktitle = {{2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024)}},
  hal_id = {hal-04419041},
  hal_version = {v1},
  keywords = {Speaker Diarization ; Source separation ; Online inference ; Overlapped speech ; AMI dataset ; Speaker embedding},
  month = apr,
  note = {Accepted at ICASSP 2024},
  pdf = {https://hal.science/hal-04419041v1/file/ICASSP_2024_ELIO_GRUTTADAURIA-final.pdf},
  title = {{ONLINE SPEAKER DIARIZATION OF MEETINGS GUIDED BY SPEECH SEPARATION}},
  url = {https://hal.science/hal-04419041},
  year = {2024}
}
```
Overlapped speech is notoriously problematic for speaker diarization systems. Consequently, the use of speech separation has recently been proposed to improve their performance. Although promising, speech separation models struggle with realistic data because they are trained on simulated mixtures with a fixed number of speakers. In this work, we introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings with a variable number of speakers, as present in the AMI corpus. We envisage ConvTasNet and DPRNN as alternatives for the separation networks, with two or three output sources. To obtain the speaker diarization result, voice activity detection is applied on each estimated source. The final model is fine-tuned end-to-end, after first adapting the separation to real data using AMI. The system operates on short segments, and inference is performed by stitching the local predictions using speaker embeddings and incremental clustering. The results show that our system improves the state-of-the-art on the AMI headset mix, using no oracle information and under full evaluation (no collar and including overlapped speech). Finally, we show the strength of our system particularly on overlapped speech sections.
GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model
Haocheng Liu, Teysir Baoueb, Mathieu Fontaine, Jonathan Le Roux, Gael Richard
2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024), Seoul (Korea), South Korea, April 2024. Accepted at ....
```
@inproceedings{liu:hal-04424100,
  address = {Seoul (Korea), South Korea},
  author = {Liu, Haocheng and Baoueb, Teysir and Fontaine, Mathieu and Le Roux, Jonathan and Richard, Gael},
  booktitle = {{2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024)}},
  hal_id = {hal-04424100},
  hal_version = {v1},
  keywords = {Diffusion models ; speech generation ; Griffin-Lim algorithm ; domain adaptation},
  month = apr,
  note = {Accepted at ICASSP 2024},
  pdf = {https://hal.science/hal-04424100v1/file/ICASSP_2024_GLA_Grad___Preprint.pdf},
  title = {{GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model}},
  url = {https://hal.science/hal-04424100},
  year = {2024}
}
```
Diffusion models are receiving a growing interest for a variety of signal generation tasks such as speech or music synthesis. WaveGrad, for example, is a successful diffusion model that conditionally uses the mel spectrogram to guide a diffusion process for the generation of high-fidelity audio. However, such models face important challenges concerning the noise diffusion process for training and inference, and they have difficulty generating high-quality speech for speakers that were not seen during training. With the aim of minimizing the conditioning error and increasing the efficiency of the noise diffusion process, we propose in this paper a new scheme called GLA-Grad, which consists in introducing a phase recovery algorithm such as the Griffin-Lim algorithm (GLA) at each step of the regular diffusion process. Furthermore, it can be directly applied to an already-trained waveform generation model, without additional training or fine-tuning. We show that our algorithm outperforms state-of-the-art diffusion models for speech generation, especially when generating speech for a previously unseen target speaker.
Blind estimation of audio effects using an auto-encoder approach and differentiable digital signal processing
Côme Peladeau, Geoffroy Peeters
ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, South Korea, April 2024.
```
@inproceedings{peladeau:hal-04539329,
  address = {Seoul, South Korea},
  author = {Peladeau, C{\^o}me and Peeters, Geoffroy},
  booktitle = {{ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  doi = {10.1109/ICASSP48485.2024.10448301},
  hal_id = {hal-04539329},
  hal_version = {v1},
  keywords = {Audio effects ; Differentiable digital signal processing ; Neural proxy ; Deep learning},
  month = apr,
  organization = {{IEEE}},
  pages = {856-860},
  pdf = {https://hal.science/hal-04539329v1/file/Peladeau%20-%20ICASSP2024%20-%20Hal%20Version.pdf},
  publisher = {{IEEE}},
  title = {{Blind estimation of audio effects using an auto-encoder approach and differentiable digital signal processing}},
  url = {https://hal.science/hal-04539329},
  year = {2024}
}
```
Blind Estimation of Audio Effects (BE-AFX) aims at estimating the audio effects (AFXs) applied to an original, unprocessed audio sample solely based on the processed audio sample. To train such a system traditional approaches optimize a loss between ground truth and estimated AFX parameters. This involves knowing the exact implementation of the AFXs used for the process. In this work, we propose an alternative solution that eliminates the requirement for knowing this implementation. Instead, we introduce an auto-encoder approach, which optimizes an audio quality metric. We explore, suggest, and compare various implementations of commonly used mastering AFXs, using differential signal processing or neural approximations. Our findings demonstrate that our auto-encoder approach yields superior estimates of the audio quality produced by a chain of AFXs, compared to the traditional parameter-based approach, even if the latter provides a more accurate parameter estimation.
ON THE CHOICE OF THE OPTIMAL TEMPORAL SUPPORT FOR AUDIO CLASSIFICATION WITH PRE-TRAINED EMBEDDINGS
Aurian Quelennec, Michel Olvera, Geoffroy Peeters, Slim Essid
ICASSP, Séoul, South Korea, April 2024.
```
@inproceedings{quelennec:hal-04360221,
  address = {S{\'e}oul, South Korea},
  author = {Quelennec, Aurian and Olvera, Michel and Peeters, Geoffroy and Essid, Slim},
  booktitle = {{ICASSP}},
  hal_id = {hal-04360221},
  hal_version = {v1},
  keywords = {audio embeddings ; acoustic scene classification ; instrument recognition ; temporal support ; transformers ; Representation Model},
  month = apr,
  organization = {{IEEE}},
  pdf = {https://hal.science/hal-04360221v1/file/Pre_print_ICASSP_Paper.pdf},
  title = {{ON THE CHOICE OF THE OPTIMAL TEMPORAL SUPPORT FOR AUDIO CLASSIFICATION WITH PRE-TRAINED EMBEDDINGS}},
  url = {https://hal.science/hal-04360221},
  year = {2024}
}
```
Current state-of-the-art audio analysis systems rely on pretrained embedding models, often used off-the-shelf as (frozen) feature extractors. Choosing the best one for a set of tasks is the subject of many recent publications. However, one aspect often overlooked in these works is the influence of the duration of audio input considered to extract an embedding, which we refer to as Temporal Support (TS). In this work, we study the influence of the TS for well-established or emerging pre-trained embeddings, chosen to represent different types of architectures and learning paradigms. We conduct this evaluation using both musical instrument and environmental sound datasets, namely OpenMIC, TAU Urban Acoustic Scenes 2020 Mobile, and ESC-50. We especially highlight that Audio Spectrogram Transformer-based systems (PaSST and BEATs) remain effective with smaller TS, which therefore allows for a drastic reduction in memory and computational cost. Moreover, we show that by choosing the optimal TS we reach competitive results across all tasks. In particular, we improve the state-of-the-art results on OpenMIC, using BEATs and PaSST without any fine-tuning.
A fully differentiable model for unsupervised singing voice separation
Gael Richard, Pierre Chouteau, Bernardo Torres
IEEE International Conference on Acoustics, Speech, and Signal Processing, Seoul, South Korea, April 2024.
```
@inproceedings{richard:hal-04356813,
  address = {Seoul, South Korea},
  author = {Richard, Gael and Chouteau, Pierre and Torres, Bernardo},
  booktitle = {{IEEE International Conference on Acoustics, Speech, and Signal Processing}},
  hal_id = {hal-04356813},
  hal_version = {v2},
  keywords = {Unsupervised source separation ; multiple singing voices ; differentiable models ; deep learning ; Unsupervised source separation multiple singing voices differentiable models deep learning},
  month = apr,
  pdf = {https://telecom-paris.hal.science/hal-04356813v2/file/main.pdf},
  title = {{A fully differentiable model for unsupervised singing voice separation}},
  url = {https://telecom-paris.hal.science/hal-04356813},
  year = {2024}
}
```
A novel model was recently proposed by Schulze-Forster et al. in [1] for unsupervised music source separation. This model allows to tackle some of the major shortcomings of existing source separation frameworks. Specifically, it eliminates the need for isolated sources during training, performs efficiently with limited data, and can handle homogeneous sources (such as singing voice). But, this model relies on an external multipitch estimator and incorporates an Ad hoc voice assignment procedure. In this paper, we propose to extend this framework and to build a fully differentiable model by integrating a multipitch estimator and a novel differentiable assignment module within the core model. We show the merits of our approach through a set of experiments, and we highlight in particular its potential for processing diverse and unseen data.
A LIGHTWEIGHT DUAL-STAGE FRAMEWORK FOR PERSONALIZED SPEECH ENHANCEMENT BASED ON DEEPFILTERNET2
Thomas Serre, Mathieu Fontaine, Éric Benhaim, Geoffroy Dutour, Slim Essid
ICASSP, Seoul (Korea), South Korea, April 2024. Accepted at ....
```
@inproceedings{serre:hal-04541350,
  address = {Seoul (Korea), South Korea},
  author = {Serre, Thomas and Fontaine, Mathieu and Benhaim, {\'E}ric and Dutour, Geoffroy and Essid, Slim},
  booktitle = {{ICASSP}},
  hal_id = {hal-04541350},
  hal_version = {v1},
  keywords = {Target speech extraction ; speech enhancement ; real-time},
  month = apr,
  note = {Accepted at HSCMA24, Satellite workshop of ICASSP24},
  pdf = {https://telecom-paris.hal.science/hal-04541350v1/file/main.pdf},
  title = {{A LIGHTWEIGHT DUAL-STAGE FRAMEWORK FOR PERSONALIZED SPEECH ENHANCEMENT BASED ON DEEPFILTERNET2}},
  url = {https://telecom-paris.hal.science/hal-04541350},
  year = {2024}
}
```
Isolating the desired speaker’s voice amidst multiple speakers in a noisy acoustic context is a challenging task. Per- sonalized speech enhancement (PSE) endeavours to achieve this by leveraging prior knowledge of the speaker’s voice. Recent research efforts have yielded promising PSE mod- els, albeit often accompanied by computationally intensive architectures, unsuitable for resource-constrained embedded devices. In this paper, we introduce a novel method to per- sonalize a lightweight dual-stage Speech Enhancement (SE) model and implement it within DeepFilterNet2, a SE model renowned for its state-of-the-art performance. We seek an optimal integration of speaker information within the model, exploring different positions for the integration of the speaker embeddings within the dual-stage enhancement architec- ture. We also investigate a tailored training strategy when adapting DeepFilterNet2 to a PSE task. We show that our personalization method greatly improves the performances of DeepFilterNet2 while preserving minimal computational overhead.
Unsupervised Harmonic Parameter Estimation Using Differentiable DSP and Spectral Optimal Transport
Bernardo Torres, Geoffroy Peeters, Gaël Richard
IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, South Korea, April 2024. Accepted in ....
```
@inproceedings{torres:hal-04358467,
  address = {Seoul, South Korea},
  author = {Torres, Bernardo and Peeters, Geoffroy and Richard, Ga{\"e}l},
  booktitle = {{IEEE International Conference on Acoustics, Speech and Signal Processing}},
  hal_id = {hal-04358467},
  hal_version = {v3},
  keywords = {Machine leaning ; Frequency estimation ; Optimal transport ; Differentiable Digital Signal Processing ; differentiable signal processing},
  month = apr,
  note = {Accepted in ICASSP 2024},
  pdf = {https://hal.science/hal-04358467v3/file/ICASSP_2024_camera_ready_preprint_v3.pdf},
  title = {{Unsupervised Harmonic Parameter Estimation Using Differentiable DSP and Spectral Optimal Transport}},
  url = {https://hal.science/hal-04358467},
  year = {2024}
}
```
In neural audio signal processing, pitch conditioning has been used to enhance the performance of synthesizers. However, jointly training pitch estimators and synthesizers is a challenge when using standard audio-to-audio reconstruction loss, leading to reliance on external pitch trackers. To address this issue, we propose using a spectral loss function inspired by optimal transportation theory that minimizes the displacement of spectral energy. We validate this approach through an unsupervised autoencoding task that fits a harmonic template to harmonic signals. We jointly estimate the fundamental frequency and amplitudes of harmonics using a lightweight encoder and reconstruct the signals using a differentiable harmonic synthesizer. The proposed approach offers a promising direction for improving unsupervised parameter estimation in neural audio applications.
HI-AUDIO ONLINE PLATFORM: OPPORTUNITIES AND CHALLENGES OF COLLECTING VARIED MUSIC DATA ON THE WEB
José Manuel Gil Panal, Aurélien David, Gael Richard
International Society for Music Information Retrieval Conference (ISMIR), Late-Breaking Demo, San Francisco CA, United States, 2024.
```
@inproceedings{gilpanal:hal-04809004,
  address = {San Francisco  CA, United States},
  author = {Gil Panal, Jos{\'e} Manuel and David, Aur{\'e}lien and Richard, Gael},
  booktitle = {{International Society for Music Information Retrieval Conference (ISMIR), Late-Breaking Demo}},
  hal_id = {hal-04809004},
  hal_version = {v1},
  pdf = {https://telecom-paris.hal.science/hal-04809004v1/file/448_lbd.pdf},
  title = {{HI-AUDIO ONLINE PLATFORM: OPPORTUNITIES AND CHALLENGES OF COLLECTING VARIED MUSIC DATA ON THE WEB}},
  url = {https://telecom-paris.hal.science/hal-04809004},
  year = {2024}
}
```
We present in this paper the extended online HI-AUDIO platform which relies on a distributed and iterative music recording paradigm to asynchronously record musicians localised at different remote individual sites. The major goal of this platform is to become a key enabling tool for building a large, varied, multi-genre, multi-track, multiinstrument music dataset, to be ultimately publicly distributed for MIR research purposes. We describe in this paper the main characteristics of the web platform and discuss some of the major challenges for collecting music data on the web. The platform will be demonstrated on site with local and distant access and illustrate its merits for recording collaborative compositions.

Journal Articles

An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment
Hugo Malard, Michel Olvera, Stéphane Lathuiliere, Slim Essid
Advances in Neural Information Processing Systems, October 2024.
```
@article{malard:hal-04729913,
  author = {Malard, Hugo and Olvera, Michel and Lathuiliere, St{\'e}phane and Essid, Slim},
  hal_id = {hal-04729913},
  hal_version = {v1},
  journal = {{Advances in Neural Information Processing Systems}},
  keywords = {Multimodal learning ; Audio captioning},
  month = oct,
  pdf = {https://hal.science/hal-04729913v1/file/2410.05997v1.pdf},
  publisher = {{Morgan Kaufmann Publishers}},
  title = {{An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment}},
  url = {https://hal.science/hal-04729913},
  year = {2024}
}
```
Multimodal large language models have fueled progress in image captioning. These models, fine-tuned on vast image datasets, exhibit a deep understanding of semantic concepts. In this work, we show that this ability can be re-purposed for audio captioning, where the joint image-language decoder can be leveraged to describe auditory content associated with image sequences within videos featuring audiovisual content. This can be achieved via multimodal alignment. Yet, this multimodal alignment task is non-trivial due to the inherent disparity between audible and visible elements in real-world videos. Moreover, multimodal representation learning often relies on contrastive learning, facing the challenge of the so-called modality gap which hinders smooth integration between modalities. In this work, we introduce a novel methodology for bridging the audiovisual modality gap by matching the distributions of tokens produced by an audio backbone and those of an image captioner. Our approach aligns the audio token distribution with that of the image tokens, enabling the model to perform zero-shot audio captioning in an unsupervised fashion while keeping the initial image captioning component unaltered. This alignment allows for the use of either audio or audiovisual input by combining or substituting the image encoder with the aligned audio encoder. Our method achieves significantly improved performances in zero-shot audio captioning, compared to existing approaches.
Statistical wave field theory
Roland Badeau
Journal of the Acoustical Society of America, July 2024.
```
@article{badeau:hal-04655069,
  author = {Badeau, Roland},
  doi = {10.1121/10.0027914},
  hal_id = {hal-04655069},
  hal_version = {v1},
  journal = {{Journal of the Acoustical Society of America}},
  keywords = {Statistical physics ; Wave equation ; Helmholtz equation ; reverberation},
  month = jul,
  number = {1},
  pages = {573 - 599},
  pdf = {https://telecom-paris.hal.science/hal-04655069v1/file/Badeau-JASA-2024-preprint.pdf},
  publisher = {{Acoustical Society of America}},
  title = {{Statistical wave field theory}},
  url = {https://telecom-paris.hal.science/hal-04655069},
  volume = {156},
  year = {2024}
}
```
In this paper, we introduce the foundations of the Statistical Wave Field Theory. This theory establishes the statistical laws of waves propagating in a closed bounded volume, that are mathematically implied by the boundary-value problem of the wave equation. These laws are derived from the Sturm-Liouville theory and the mathematical theory of dynamical billiards. They hold after many reflections on the boundary surface, and at high frequency. This is the first statistical theory of reverberation which provides the closed-form expression of the power distribution and the correlations of the wave field jointly over time, frequency, and space inside the bounded volume, in terms of the geometry and the specific admittance of its boundary surface. The Statistical Wave Field Theory may find applications in various science fields, including room acoustics, electromagnetic theory, and nuclear physics.
Electroencephalography Response during an Incremental Test According to the V̇O2max Plateau Incidence
Véronique Billat, Christian Berthomier, Michel Clémençon, Marie Brandewinder, Slim Essid, Cécilia Damon, François Rigaud, Alexis Bénichoux, Emmanuel Maby, Lesly Fornoni, Patrick Bouchet, Pascal Beers, Bertrand Massot, Patrice Revol, Luc Poinsard, Thomas Creveaux, Christian Collet, Jérémie Mattout, Vincent Pialoux
Applied Sciences, June 2024.
```
@article{billat:hal-04688068,
  author = {Billat, V{\'e}ronique and Berthomier, Christian and Cl{\'e}men{\c c}on, Michel and Brandewinder, Marie and Essid, Slim and Damon, C{\'e}cilia and Rigaud, Fran{\c c}ois and B{\'e}nichoux, Alexis and Maby, Emmanuel and Fornoni, Lesly and Bouchet, Patrick and van Beers, Pascal and Massot, Bertrand and Revol, Patrice and Poinsard, Luc and Creveaux, Thomas and Collet, Christian and Mattout, J{\'e}r{\'e}mie and Pialoux, Vincent},
  doi = {10.3390/app14135411},
  hal_id = {hal-04688068},
  hal_version = {v1},
  journal = {{Applied Sciences}},
  keywords = {EEG ; exhausting exercise ; maximal oxygen consumption ; fatigue ; central governor ; endurance ; cycling},
  month = jun,
  number = {13},
  pages = {5411},
  pdf = {https://hal.science/hal-04688068v1/file/applsci-14-05411.pdf},
  publisher = {{Multidisciplinary digital publishing institute (MDPI)}},
  title = {{Electroencephalography Response during an Incremental Test According to the V̇O2max Plateau Incidence}},
  url = {https://hal.science/hal-04688068},
  volume = {14},
  year = {2024}
}
```
V̇O2max is recognized as a key measure in exercise physiology and sports medicine. However, only 20–50% of maximal incremental exercise tests (IET) result in a plateau of V̇O2 (V̇O2pl). To our knowledge, no study has yet examined the possible difference in brain activity during an IET, in V̇O2pl and non-plateau athletes with the same V̇O2max and age. This study aimed to shed light on the central governor hypothesis, namely that the inability to reach a V̇O2pl may be dictated by the brain rather than by a peripheral physical limit. This hypothesis can now be explored using electroencephalography (EEG) during IET, measuring concomitant power in specific frequency bands. Forty-two athletes were divided into two groups: those who practiced endurance sports and those who did not, and were asked to perform an IET. EEG signals and gas exchange were recorded. A V̇O2pl was observed in twenty-two subjects (52%). EEG power increased in all subjects during IET, except in the alpha band, which showed variability, but not significantly (64% increase, 34% decrease, p = 0.07). No differences were found between endurance athletes and non-endurance athletes, except for V̇O2max (60.10 ± 6.16 vs. 51.77 ± 6.41, p < 0.001). However, the baseline-corrected ratio of EEG power to V̇O2 was found to decrease in all subjects during IET, in the alpha, beta and theta bands. In conclusion, the presence or absence of a V̇O2pl is not related to the type of EEG response during an IET. Nevertheless, the decline in brain and V̇O2 powers/ratios in all frequency bands suggests that aerobic power may be constrained by brain mobilization.
Absorptive nature of scattering coefficients in stress-energy tensor formalism for room acoustics
Jean-Dominique Polack, Hugo Dujourdy, Roland Badeau
Journal of the Acoustical Society of America, April 2024.
```
@article{polack:hal-04548715,
  author = {Polack, Jean-Dominique and Dujourdy, Hugo and Badeau, Roland},
  doi = {10.1121/10.0025468},
  hal_id = {hal-04548715},
  hal_version = {v2},
  journal = {{Journal of the Acoustical Society of America}},
  month = apr,
  number = {4},
  pages = {2339 - 2346},
  pdf = {https://telecom-paris.hal.science/hal-04548715v2/file/JASAimpedance_preprint.pdf},
  publisher = {{Acoustical Society of America}},
  title = {{Absorptive nature of scattering coefficients in stress-energy tensor formalism for room acoustics}},
  url = {https://telecom-paris.hal.science/hal-04548715},
  volume = {155},
  year = {2024}
}
```
In the stress-energy tensor formalism, the symmetry between absorption and scattering coefficients, as proven by measurements combined with simulations, is counterintuitive. By introducing the wall admittance, we show that the scattering coefficient is partly created by the real part of the wall admittance combined with the active intensity, that is, is partly due to absorption. However, for curved surfaces or finite source distances, it also depends on the imaginary part of the wall admittance in combination with the reactive intensity, which confers its genuine scattering properties inversely proportional to the distances to the sources. Thus, for plane waves impinging on plane boundaries, or purely real admittances, scattering reduces to absorption.
Tackling Interpretability in Audio Classification Networks with Non-negative Matrix Factorization
Jayneel Parekh, Sanjeel Parekh, Pavlo Mozharovskyi, Gael Richard, Florence d’Alché-Buc
IEEE/ACM Transactions on Audio, Speech and Language Processing, January 2024.
```
@article{parekh:hal-04539879,
  author = {Parekh, Jayneel and Parekh, Sanjeel and Mozharovskyi, Pavlo and Richard, Gael and d'Alch{\'e}-Buc, Florence},
  doi = {10.1109/TASLP.2024.3358049},
  hal_id = {hal-04539879},
  hal_version = {v1},
  journal = {{IEEE/ACM Transactions on Audio, Speech and Language Processing}},
  keywords = {Audio interpretability ; Explainability ; By-design interpretable models ; Audio convolutional networks ; Non-negative matrix factorization ; Task analysis ; Dictionaries ; Spectrogram ; Training ; Time-frequency analysis ; Speech processing ; Prototypes},
  month = jan,
  pages = {1392--1405},
  pdf = {https://hal.science/hal-04539879v1/file/L2I_TASLP-4.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Tackling Interpretability in Audio Classification Networks with Non-negative Matrix Factorization}},
  url = {https://hal.science/hal-04539879},
  volume = {32},
  year = {2024}
}
```
This article tackles two major problem settings for interpretability of audio processing networks, post-hoc and by-design interpretation. For post-hoc interpretation, we aim to interpret decisions of a network in terms of high-level audio objects that are also listenable for the end-user. This is extended to present an inherently interpretable model with high performance. To this end, we propose a novel interpreter design that incorporates non-negative matrix factorization (NMF). In particular, an interpreter is trained to generate a regularized intermediate embedding from hidden layers of a target network, learnt as time-activations of a pre-learnt NMF dictionary. Our methodology allows us to generate intuitive audio-based interpretations that explicitly enhance parts of the input signal most relevant for a network’s decision. We demonstrate our method’s applicability on a variety of classification tasks, including multi-label data for real-world audio and music.
Self-Supervised Learning of Multi-level Audio Representations for Music Segmentation
Morgan Buisson, Brian Mcfee, Slim Essid, Hélène Crayencour
IEEE/ACM Transactions on Audio, Speech and Language Processing, 2024.
```
@article{buisson:hal-04485065,
  author = {Buisson, Morgan and Mcfee, Brian and Essid, Slim and Crayencour, H{\'e}l{\`e}ne},
  doi = {10.1109/TASLP.2024.3379894},
  hal_id = {hal-04485065},
  hal_version = {v1},
  journal = {{IEEE/ACM Transactions on Audio, Speech and Language Processing}},
  keywords = {Music structure analysis ; structural segmentation ; representation learning},
  pages = {1-13},
  pdf = {https://hal.science/hal-04485065v1/file/Buisson.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Self-Supervised Learning of Multi-level Audio Representations for Music Segmentation}},
  url = {https://hal.science/hal-04485065},
  year = {2024}
}
```
The task of music structure analysis refers to automatically identifying the location and the nature of musical sections within a song. In the supervised scenario, structural annotations generally result from exhaustive data collection processes, which represents one of the main challenges of this task. Moreover, both the subjectivity of music structure and the hierarchical characteristics it exhibits make the obtained structural annotations not fully reliable, in the sense that they do not convey a "universal ground-truth" unlike other tasks in music information retrieval. On the other hand, the quickly growing quantity of available music data has enabled weakly supervised and self-supervised approaches to achieve impressive results on a wide range of music-related problems. In this work, a self-supervised learning method is proposed to learn robust multi-level music representations prior to structural segmentation using contrastive learning. To this end, sets of frames sampled at different levels of detail are used to train a deep neural network in a disentangled manner. The proposed method is evaluated on both flat and multi-level segmentation. We show that each distinct sub-region of the output embeddings can efficiently account for structural similarity at their own targeted level of detail, which ultimately improves performance of downstream flat and multi-level segmentation. Finally, complementary experiments are carried out to study how the obtained representations can be further adapted to specific datasets using a supervised fine-tuning objective in order to facilitate structure retrieval in domains where human annotations remain scarce.
Model-Based Deep Learning for Music Information Research
Gael Richard, Vincent Lostanlen, Yi-Hsuan Yang, Meinard Müller
IEEE Signal Processing Magazine, 2024.
```
@article{richard:hal-04611461,
  author = {Richard, Gael and Lostanlen, Vincent and Yang, Yi-Hsuan and M{\"u}ller, Meinard},
  hal_id = {hal-04611461},
  hal_version = {v2},
  journal = {{IEEE Signal Processing Magazine}},
  pdf = {https://hal.science/hal-04611461v2/file/2024-Model-based%20Deep%20Learning%20for%20MIR.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Model-Based Deep Learning for Music Information Research}},
  url = {https://hal.science/hal-04611461},
  year = {2024}
}
```
In this article, we investigate the notion of model-based deep learning in the realm of music information research (MIR). Loosely speaking, we refer to the term model-based deep learning for approaches that combine traditional knowledge-based methods with data-driven techniques, especially those based on deep learning, within a diff erentiable computing framework. In music, prior knowledge for instance related to sound production, music perception or music composition theory can be incorporated into the design of neural networks and associated loss functions. We outline three specifi c scenarios to illustrate the application of model-based deep learning in MIR, demonstrating the implementation of such concepts and their potential.

Technical Reports

Degradation-Invariant Music Indexing
Rémi Mignot, Geoffroy Peeters
March 2024.
```
@techreport{mignot:hal-04486105,
  author = {Mignot, R{\'e}mi and Peeters, Geoffroy},
  hal_id = {hal-04486105},
  hal_version = {v1},
  institution = {{STMS - Sciences et Technologies de la Musique et du Son UMR 9912 IRCAM-CNRS-Sorbonne Universit{\'e}}},
  keywords = {Audio indexing ; Audio descriptors},
  month = mar,
  pdf = {https://hal.science/hal-04486105v1/file/Mignot_2024_Degr-Inv_Music_Index.pdf},
  title = {{Degradation-Invariant Music Indexing}},
  url = {https://hal.science/hal-04486105},
  year = {2024}
}
```
For music indexing robust to sound degradations and scalable for big music catalogs, this scientific report presents an approach based on audio descriptors relevant to the music content and invariant to sound transformations (noise addition, distortion, lossy coding, pitch/time transformations, or filtering e.g.). To achieve this task, one of the key point of the proposed method is the definition of high-dimensional audio prints, which are intrinsically (by design) robust to some sound degradations. The high dimensionality of this first representation is then used to learn a linear projection to a sub-space significantly smaller, which reduces again the sensibility to sound degradations using a series of discriminant analyses. Finally, anchoring the analysis times on local maxima of a selected onset function, an approximative hashing is done to provide a better tolerance to bit corruptions, and in the same time to make easier the scaling of the method.

2023

Conference Articles

Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis
Victor Letzelter, Mathieu Fontaine, Mickaël Chen, Patrick Pérez, Slim Essid, Gael Richard
Advances in neural information processing systems, New Orleans, United States, December 2023.
```
@inproceedings{letzelter:hal-04216055,
  address = {New Orleans, United States},
  author = {Letzelter, Victor and Fontaine, Mathieu and Chen, Micka{\"e}l and P{\'e}rez, Patrick and Essid, Slim and Richard, Gael},
  booktitle = {{Advances in neural information processing systems}},
  hal_id = {hal-04216055},
  hal_version = {v1},
  month = dec,
  pdf = {https://hal.science/hal-04216055v1/file/neurips_2023.pdf},
  title = {{Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis}},
  url = {https://hal.science/hal-04216055},
  year = {2023}
}
```
We introduce Resilient Multiple Choice Learning (rMCL), an extension of the MCL approach for conditional distribution estimation in regression settings where multiple targets may be sampled for each training input. Multiple Choice Learning is a simple framework to tackle multimodal density estimation, using the Winner-Takes-All (WTA) loss for a set of hypotheses. In regression settings, the existing MCL variants focus on merging the hypotheses, thereby eventually sacrificing the diversity of the predictions. In contrast, our method relies on a novel learned scoring scheme underpinned by a mathematical framework based on Voronoi tessellations of the output space, from which we can derive a probabilistic interpretation. After empirically validating rMCL with experiments on synthetic data, we further assess its merits on the sound source localization problem, demonstrating its practical usefulness and the relevance of its interpretation.
A Repetition-based Triplet Mining Approach for Music Segmentation
Morgan Buisson, Brian Mcfee, Slim Essid, Helene-Camille Crayencour
Proceedings of ISMIR 2023, Milan, Italy, November 2023.
```
@inproceedings{buisson:hal-04202766,
  address = {Milan, Italy},
  author = {Buisson, Morgan and Mcfee, Brian and Essid, Slim and Crayencour, Helene-Camille},
  booktitle = {{Proceedings of ISMIR 2023}},
  hal_id = {hal-04202766},
  hal_version = {v1},
  keywords = {Music Structure Analysis ; Deep Learning},
  month = nov,
  pdf = {https://hal.science/hal-04202766v1/file/A%20Repetition-Based%20Triplet%20Mining%20Approach%20for%20Music%20Segmentation.pdf},
  title = {{A Repetition-based Triplet Mining Approach for Music Segmentation}},
  url = {https://hal.science/hal-04202766},
  year = {2023}
}
```
Contrastive learning has recently appeared as a well-suited method to find representations of music audio signals that are suitable for structural segmentation. However, most existing unsupervised training strategies omit the notion of repetition and therefore fail at encompassing this essential aspect of music structure. This work introduces a triplet mining method which explicitly considers repeating sequences occurring inside a music track by leveraging common audio descriptors. We study its impact on the learned representations through downstream music segmentation. Because musical repetitions can be of different natures, we give further insight on the role of the audio descriptors employed at the triplet mining stage as well as the trade-off existing between the quality of the triplets mined and the quantity of unlabelled data used for training. We observe that our method requires less non-annotated data while remaining competitive against other unsupervised methods trained on a larger corpus.
THE HI-AUDIO ONLINE PLATFORM FOR DISTRIBUTED MUSIC CROWDSOURCING DATABASE COLLECTION
Jose Manuel Gil Panal, Aurélien David, Gael Richard
Late Breaking Demo - International Society for Music Information Retrieval Conference (ISMIR), Milan (Italie), Italy, November 2023.
```
@inproceedings{gilpanal:hal-04265346,
  address = {Milan (Italie), Italy},
  author = {Gil Panal, Jose Manuel and David, Aur{\'e}lien and Richard, Gael},
  booktitle = {{Late Breaking Demo - International Society for Music Information Retrieval Conference (ISMIR)}},
  hal_id = {hal-04265346},
  hal_version = {v1},
  month = nov,
  pdf = {https://telecom-paris.hal.science/hal-04265346v1/file/Paper_template_for_ISMIR_LBD-5%20%281%29.pdf},
  title = {{THE HI-AUDIO ONLINE PLATFORM FOR DISTRIBUTED MUSIC CROWDSOURCING DATABASE COLLECTION}},
  url = {https://telecom-paris.hal.science/hal-04265346},
  year = {2023}
}
```
We present in this paper the recent development of an online platform for musicians, researchers and an open community of enthusiasts of audio and music with a view to build a public database of music recordings from a wide variety of styles and different cultures. The data generated and collected will primarily be audio data, coming from various sources, including field recordings, existing datasets, and users’ collaboration. The platform aims at gathering a distributed music crowdsourcing database collection where each music piece is built from asynchronous recordings of different tracks at remote sites. The complete tool and databases generated will be openly distributed for research purposes.
Self-Similarity-Based and Novelty-based loss for music structure analysis
Geoffroy Peeters
Conference of the International Society for Music Information Retrieval, Milano, Italy, November 2023.
```
@inproceedings{peeters:hal-04155178,
  address = {Milano, Italy},
  author = {Peeters, Geoffroy},
  booktitle = {{Conference of the International Society for Music Information Retrieval}},
  hal_id = {hal-04155178},
  hal_version = {v1},
  keywords = {Music information retrieval ; Deep learning ; Feature learning},
  month = nov,
  title = {{Self-Similarity-Based and Novelty-based loss for music structure analysis}},
  url = {https://telecom-paris.hal.science/hal-04155178},
  year = {2023}
}
```
Music Structure Analysis (MSA) is the task aiming at identifying musical segments that compose a music track and possibly label them based on their similarity. In this paper we propose a supervised approach for the task of music boundary detection. In our approach we simultaneously learn features and convolution kernels. For this we jointly optimize - a loss based on the Self-Similarity- Matrix (SSM) obtained with the learned features, denoted by SSM-loss, and - a loss based on the novelty score obtained applying the learned kernels to the estimated SSM, denoted by novelty-loss. We also demonstrate that relative feature learning, through self-attention, is beneficial for the task of MSA. Finally, we compare the performances of our approach to previously proposed approaches on the standard RWC-Pop, and various subsets of SALAMI.
PESTO: Pitch Estimation with Self-supervised Transposition-equivariant Objective
Alain Riou, Stefan Lattner, Gaëtan Hadjeres, Geoffroy Peeters
International Society for Music Information Retrieval Conference (ISMIR 2023), Milan, Italy, November 2023.
```
@inproceedings{riou:hal-04260042,
  address = {Milan, Italy},
  author = {Riou, Alain and Lattner, Stefan and Hadjeres, Ga{\"e}tan and Peeters, Geoffroy},
  booktitle = {{International Society for Music Information Retrieval Conference (ISMIR 2023)}},
  doi = {10.48550/arXiv.2309.02265},
  hal_id = {hal-04260042},
  hal_version = {v1},
  keywords = {Audio and Speech Processing (eess.AS) ; Sound (cs.SD) ; Self-supervised learning ; Equivariance ; Pitch estimation ; F0 estimation ; Deep learning},
  month = nov,
  pdf = {https://hal.science/hal-04260042v1/file/PESTO.pdf},
  title = {{PESTO: Pitch Estimation with Self-supervised Transposition-equivariant Objective}},
  url = {https://hal.science/hal-04260042},
  year = {2023}
}
```
In this paper, we address the problem of pitch estimation using Self Supervised Learning (SSL). The SSL paradigm we use is equivariance to pitch transposition, which enables our model to accurately perform pitch estimation on monophonic audio after being trained only on a small unlabeled dataset. We use a lightweight (< 30k parameters) Siamese neural network that takes as inputs two different pitch-shifted versions of the same audio represented by its Constant-Q Transform. To prevent the model from collapsing in an encoder-only setting, we propose a novel class-based transposition-equivariant objective which captures pitch information. Furthermore, we design the architecture of our network to be transposition-preserving by introducing learnable Toeplitz matrices. We evaluate our model for the two tasks of singing voice and musical instrument pitch estimation and show that our model is able to generalize across tasks and datasets while being lightweight, hence remaining compatible with low-resource devices and suitable for real-time applications. In particular, our results surpass self-supervised baselines and narrow the performance gap between self-supervised and supervised methods for pitch estimation.
Singer Identity Representation Learning using Self-Supervised Techniques
Bernardo Torres, Stefan Lattner, Gael Richard
International Society for Music Information Retrieval Conference (ISMIR 2023), Milan, Italy, November 2023.
```
@inproceedings{torres:hal-04186048,
  address = {Milan, Italy},
  author = {Torres, Bernardo and Lattner, Stefan and Richard, Gael},
  booktitle = {{International Society for Music Information Retrieval Conference (ISMIR 2023)}},
  hal_id = {hal-04186048},
  hal_version = {v1},
  month = nov,
  pdf = {https://telecom-paris.hal.science/hal-04186048v1/file/ISMIR_singer_id%20%2832%29.pdf},
  title = {{Singer Identity Representation Learning using Self-Supervised Techniques}},
  url = {https://telecom-paris.hal.science/hal-04186048},
  year = {2023}
}
```
Significant strides have been made in creating voice identity representations using speech data. However, the same level of progress has not been achieved for singing voices. To bridge this gap, we suggest a framework for training singer identity encoders to extract representations suitable for various singing-related tasks, such as singing voice similarity and synthesis. We explore different selfsupervised learning techniques on a large collection of isolated vocal tracks and apply data augmentations during training to ensure that the representations are invariant to pitch and content variations. We evaluate the quality of the resulting representations on singer similarity and identification tasks across multiple datasets, with a particular emphasis on out-of-domain generalization. Our proposed framework produces high-quality embeddings that outperform both speaker verification and wav2vec 2.0 pre-trained baselines on singing voice while operating at 44.1 kHz. We release our code and trained models to facilitate further research on singing voice and related areas.
Transfer Learning and Bias Correction with Pre-trained Audio Embeddings
Changhong Wang, Gaël Richard, Brian Mcfee
Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Milan, Italy, November 2023.
```
@inproceedings{wang:hal-04160013,
  address = {Milan, Italy},
  author = {Wang, Changhong and Richard, Ga{\"e}l and Mcfee, Brian},
  booktitle = {{Proceedings of the International Society for Music Information Retrieval Conference (ISMIR)}},
  hal_id = {hal-04160013},
  hal_version = {v1},
  keywords = {Pre-trained audio embeddings ; Bias correction ; Transfer learning ; Domain adaptation ; music information retrieval},
  month = nov,
  pdf = {https://hal.science/hal-04160013v1/file/ISMIR2023%20Transfer%20Learning%20and%20Bias%20Correction%20with%20Pre-trained%20Audio%20Embeddings.pdf},
  title = {{Transfer Learning and Bias Correction with Pre-trained Audio Embeddings}},
  url = {https://hal.science/hal-04160013},
  year = {2023}
}
```
Deep neural network models have become the dominant approach to a large variety of tasks within music information retrieval (MIR). These models generally require large amounts of (annotated) training data to achieve high accuracy. Because not all applications in MIR have sufficient quantities of training data, it is becoming increasingly common to transfer models across domains. This approach allows representations derived for one task to be applied to another, and can result in high accuracy with less stringent training data requirements for the downstream task. However, the properties of pre-trained audio embeddings are not fully understood. Specifically, and unlike traditionally engineered features, the representations extracted from pre-trained deep networks may embed and propagate biases from the model’s training regime. This work investigates the phenomenon of bias propagation in the context of pre-trained audio representations for the task of instrument recognition. We first demonstrate that three different pre-trained representations (VGGish, OpenL3, and YAMNet) exhibit comparable performance when constrained to a single dataset, but differ in their ability to generalize across datasets (OpenMIC and IRMAS). We then investigate dataset identity and genre distribution as potential sources of bias. Finally, we propose and evaluate post-processing countermeasures to mitigate the effects of bias, and improve generalization across datasets.
Time-Domain Audio Source Separation Based on Gaussian Processes with Deep Kernel Learning
Aditya Arie Nugraha, Diego Di Carlo, Yoshiaki Bando, Mathieu Fontaine, Kazuyoshi Yoshii
WASPAA, New Paltz, France, October 2023.
```
@inproceedings{nugraha:hal-04172863,
  address = {New Paltz, France},
  author = {Nugraha, Aditya Arie and Carlo, Diego Di and Bando, Yoshiaki and Fontaine, Mathieu and Yoshii, Kazuyoshi},
  booktitle = {{WASPAA}},
  hal_id = {hal-04172863},
  hal_version = {v1},
  keywords = {Time-domain audio source separation Gaussian processes deep kernel learning ; Time-domain audio source separation ; Gaussian processes ; deep kernel learning},
  month = oct,
  pdf = {https://hal.science/hal-04172863v1/file/_WASPAA_23__Time_Domain_Audio_Source_Separation_Based_on_Gaussian_Processes_with_Deep_Kernel_Learning-1.pdf},
  title = {{Time-Domain Audio Source Separation Based on Gaussian Processes with Deep Kernel Learning}},
  url = {https://hal.science/hal-04172863},
  year = {2023}
}
```
This paper revisits single-channel audio source separation based on a probabilistic generative model of a mixture signal defined in the continuous time domain. We assume that each source signal follows a non-stationary Gaussian process (GP), i.e., any finite set of sampled points follows a zero-mean multivariate Gaussian distribution whose covariance matrix is governed by a kernel function over time-varying latent variables. The mixture signal composed of such source signals thus follows a GP whose covariance matrix is given by the sum of the source covariance matrices. To estimate the latent variables from the mixture signal, we use a deep neural network with an encoder-separator-decoder architecture (e.g., Conv-TasNet) that separates the latent variables in a pseudo-time-frequency space. The key feature of our method is to feed the latent variables into the kernel function for estimating the source covariance matrices, instead of using the decoder for directly estimating the time-domain source signals. This enables the decomposition of a mixture signal into the source signals with a classical yet powerful Wiener filter that considers the full covariance structure over all samples. The kernel function and the network are trained jointly in the maximum likelihood framework. Comparative experiments using two-speech mixtures under clean, noisy, and noisy-reverberant conditions from the WSJ0-2mix, WHAM!, and WHAMR! benchmark datasets demonstrated that the proposed method performed well and outperformed the baseline method under noisy and noisy-reverberant conditions.
Signal Inpainting from Fourier Magnitudes
Louis Bahrman, Marina Krémé, Paul Magron, Antoine Deleforge
European Signal Processing Conference (EUSIPCO), Helsinki, Finland, September 2023.
```
@inproceedings{bahrman:hal-03832480,
  address = {Helsinki, Finland},
  author = {Bahrman, Louis and Kr{\'e}m{\'e}, Marina and Magron, Paul and Deleforge, Antoine},
  booktitle = {{European Signal Processing Conference (EUSIPCO)}},
  doi = {10.23919/EUSIPCO58844.2023.10289727},
  hal_id = {hal-03832480},
  hal_version = {v3},
  keywords = {Signal inpainting ; phase retrieval ; audio restoration ; convex relaxation ; alternating minimization},
  month = sep,
  pdf = {https://hal.science/hal-03832480v3/file/main.pdf},
  title = {{Signal Inpainting from Fourier Magnitudes}},
  url = {https://hal.science/hal-03832480},
  year = {2023}
}
```
Signal inpainting is the task of restoring degraded or missing samples in a signal. In this paper we address signal inpainting when Fourier magnitudes are observed. We propose a mathematical formulation of the problem that highlights its connection with phase retrieval, and we introduce two methods for solving it. First, we derive an alternating minimization scheme, which shares similarities with the Gerchberg-Saxton algorithm, a classical phase retrieval method. Second, we propose a convex relaxation of the problem, which is inspired by recent approaches that reformulate phase retrieval into a semidefinite program. We assess the potential of these methods for the task of inpainting gaps in speech signals. Our methods exhibit both a high probability of recovering the original signals and robustness to magnitude noise.
Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?
Salah Zaiem, Youcef Kemiche, Titouan Parcollet, Slim Essid, Mirco Ravanelli
INTERSPEECH 2023, Dublin, Ireland, August 2023.
```
@inproceedings{zaiem:hal-04216175,
  address = {Dublin, Ireland},
  author = {Zaiem, Salah and Kemiche, Youcef and Parcollet, Titouan and Essid, Slim and Ravanelli, Mirco},
  booktitle = {{INTERSPEECH 2023}},
  doi = {10.21437/Interspeech.2023-1087},
  hal_id = {hal-04216175},
  hal_version = {v1},
  keywords = {self-supervised learning ; representation learning},
  month = aug,
  pages = {2873-2877},
  pdf = {https://hal.science/hal-04216175v1/file/zaiem23b_interspeech.pdf},
  publisher = {{ISCA}},
  title = {{Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?}},
  url = {https://hal.science/hal-04216175},
  year = {2023}
}
```
Self-supervised learning (SSL) has recently allowed leveraging large datasets of unlabeled speech signals to reach impressive performance on speech tasks using only small amounts of annotated data. The high number of proposed approaches fostered the need and rise of extended benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, and while the number of considered tasks has been growing, most rely upon a single decoding architecture that maps the frozen SSL representations to the downstream labels. This work investigates the robustness of such benchmarking results to changes in the decoder architecture. Interestingly, it appears that varying the architecture of the downstream decoder leads to significant variations in the leaderboards of most tasks. Concerningly, our study reveals that benchmarking using limited decoders may cause a counterproductive increase in the sizes of the developed SSL models.
Automatic Data Augmentation for Domain Adapted Fine-Tuning of Self-Supervised Speech Representations
Salah Zaiem, Titouan Parcollet, Slim Essid
INTERSPEECH 2023, Dublin (Ireland), Ireland, August 2023.
```
@inproceedings{zaiem:hal-04216177,
  address = {Dublin (Ireland), Ireland},
  author = {Zaiem, Salah and Parcollet, Titouan and Essid, Slim},
  booktitle = {{INTERSPEECH 2023}},
  doi = {10.21437/Interspeech.2023-1040},
  hal_id = {hal-04216177},
  hal_version = {v1},
  keywords = {self-supervised learning ; domain adaptation},
  month = aug,
  pages = {67-71},
  pdf = {https://hal.science/hal-04216177v1/file/zaiem23_interspeech.pdf},
  publisher = {{ISCA}},
  title = {{Automatic Data Augmentation for Domain Adapted Fine-Tuning of Self-Supervised Speech Representations}},
  url = {https://hal.science/hal-04216177},
  year = {2023}
}
```
Self-Supervised Learning (SSL) has allowed leveraging large amounts of unlabeled speech data to improve the performance of speech recognition models even with small annotated datasets. Despite this, speech SSL representations may fail while facing an acoustic mismatch between the pretraining and target datasets. To address this issue, we propose a novel supervised domain adaptation method, designed for cases exhibiting such a mismatch in acoustic domains. It consists in applying properly calibrated data augmentations on a large clean dataset, bringing it closer to the target domain, and using it as part of an initial fine-tuning stage. Augmentations are automatically selected through the minimization of a conditional-dependence estimator, based on the target dataset. The approach is validated during an oracle experiment with controlled distortions and on two amateur-collected low-resource domains, reaching better performances compared to the baselines in both cases.
Cosmopolite Sound Monitoring (CoSMo): A Study of Urban Sound Event Detection Systems Generalizing to Multiple Cities
Florian Angulo, Slim Essid, Geoffroy Peeters, Christophe Mietlicki
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, June 2023. Copyright 20....
```
@inproceedings{angulo:hal-04093374,
  address = {Rhodes Island, Greece},
  author = {Angulo, Florian and Essid, Slim and Peeters, Geoffroy and Mietlicki, Christophe},
  booktitle = {{ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  doi = {10.1109/ICASSP49357.2023.10095833},
  hal_id = {hal-04093374},
  hal_version = {v1},
  keywords = {Sound Event Detection (SED) ; Far-field urban audio recordings ; Urban Sound Monitoring},
  month = jun,
  note = {Copyright 2023 IEEE. Published in ICASSP 2023 -- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), scheduled for 4-9 June 2023 in Rhodes Island, Greece. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.},
  pages = {1-5},
  pdf = {https://hal.science/hal-04093374v1/file/CoSMo_camera_ready_ICASSP_Florian-1.pdf},
  publisher = {{IEEE}},
  title = {{Cosmopolite Sound Monitoring (CoSMo): A Study of Urban Sound Event Detection Systems Generalizing to Multiple Cities}},
  url = {https://hal.science/hal-04093374},
  year = {2023}
}
```
Measuring noise in cities and automatically identifying the corresponding sound sources are a crucial challenge for policymakers. Indeed, such information helps addressing noise pollution and improving the well-being of urban dwellers. In recent years, researchers have provided annotated datasets recorded in two major cities to foster the development of urban sound event detection (SED) systems. This paper presents an in-depth study of the behaviour of state-of-the-art SED systems well suited to our problem, combining three far-field real recordings datasets which can be used jointly during training. In our evaluation, we highlight the performance gaps existing between simple and hard recording examples based on the salience of sound events and the polyphony of the recordings. We provide new proximity annotations for this analysis. We evaluate the ability of urban SED systems to generalize across cities with varying degrees of training supervision. We show that such generalization is hindered mostly by the difficulties current urban SED systems have to detect sound events with low salience along with sound events in highly polyphonic soundscapes.
One-shot Unsupervised Domain Adaptation with Personalized Diffusion Models
Yasser Benigmim, Subhankar Roy, Slim Essid, Vicky Kalogeiton, Stéphane Lathuilière
IEEE/CVF Conference on Computer Vision and Pattern Recognition- Workshop on Generative Models for Computer Vision, vancouver, Canada, June 2023. Proceedings ....
```
@inproceedings{benigmim:hal-04205024,
  address = {vancouver, Canada},
  author = {Benigmim, Yasser and Roy, Subhankar and Essid, Slim and Kalogeiton, Vicky and Lathuili{\`e}re, St{\'e}phane},
  booktitle = {{IEEE/CVF Conference on Computer Vision and Pattern Recognition- Workshop on Generative Models for Computer Vision}},
  doi = {10.1109/CVPRW59228.2023.00077},
  hal_id = {hal-04205024},
  hal_version = {v1},
  keywords = {Training ; Adaptation models ; Semantics ; Data augmentation ; Data models ; Pattern recognition ; Task analysis},
  month = jun,
  note = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition- Workshop on Generative Models for Computer Vision (CVPR-W 2023)},
  pages = {698-708},
  publisher = {{IEEE}},
  series = {2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
  title = {{One-shot Unsupervised Domain Adaptation with Personalized Diffusion Models}},
  url = {https://telecom-paris.hal.science/hal-04205024},
  year = {2023}
}
```
Adapting a segmentation model from a labeled source domain to a target domain, where a single unlabeled datum is available, is one the most challenging problems in domain adaptation and is otherwise known as one-shot unsupervised domain adaptation (OSUDA). Most of the prior works have addressed the problem by relying on style transfer techniques, where the source images are stylized to have the appearance of the target domain. Departing from the common notion of transferring only the target “texture” information, we leverage text-to-image diffusion models (e.g., Stable Diffusion) to generate a synthetic target dataset with photo-realistic images that not only faithfully depict the style of the target domain, but are also characterized by novel scenes in diverse contexts. The text interface in our method Data AugmenTation with diffUsion Models (DATUM) endows us with the possibility of guiding the generation of images towards desired semantic concepts while respecting the original spatial context of a single training image, which is not possible in existing OSUDA methods. Extensive experiments on standard benchmarks show that our DATUM surpasses the state-of-the-art OSUDA methods by up to +7.1%. The implementation is available at https://github.com/yasserben/DATUM
LEARNING INTERPRETABLE FILTERS IN WAV-UNET FOR SPEECH ENHANCEMENT
Félix Mathieu, Thomas Courtat, Gael Richard, Geoffroy Peeters
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Rhodes, Greece, June 2023.
```
@inproceedings{mathieu:hal-04048829,
  address = {Rhodes, Greece},
  author = {Mathieu, F{\'e}lix and Courtat, Thomas and Richard, Gael and Peeters, Geoffroy},
  booktitle = {{IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}},
  hal_id = {hal-04048829},
  hal_version = {v1},
  keywords = {Representation learning ; interpretability ; speech enhancement},
  month = jun,
  pdf = {https://telecom-paris.hal.science/hal-04048829v1/file/MATHIEU_ICASSP_2023-2.pdf},
  title = {{LEARNING INTERPRETABLE FILTERS IN WAV-UNET FOR SPEECH ENHANCEMENT}},
  url = {https://telecom-paris.hal.science/hal-04048829},
  year = {2023}
}
```
Due to their performances, deep neural networks have emerged as a major method in nearly all modern audio processing applications. Deep neural networks can be used to estimate some parameters or hyperparameters of a model, or in some cases the entire model in an end-to-end fashion. Although deep learning can lead to state of the art performances, they also suffer from inherent weaknesses as they usually remain complex and non interpretable to a large extent. For instance, the internal filters used in each layers are chosen in an adhoc manner with only a loose relation with the nature of the processed signal. We propose in this paper an approach to learn interpretable filters within a specific neural architecture which allow to better understand the behaviour of the neural network and to reduce its complexity. We validate the approach on a task of speech enhancement and show that the gain in interpretability does not degrade the performance of the model.
Explainable Audio Classification of Playing Techniques with Layer-wise Relevance Propagation
Changhong Wang, Vincent Lostanlen, Mathieu Lagrange
2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Rhodes, Greece, June 2023.
```
@inproceedings{wang:hal-04029145,
  address = {Rhodes, Greece},
  author = {Wang, Changhong and Lostanlen, Vincent and Lagrange, Mathieu},
  booktitle = {{2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}},
  doi = {10.1109/ICASSP49357.2023.10095894},
  hal_id = {hal-04029145},
  hal_version = {v1},
  keywords = {Layer-wise relevance propagation ; scattering transform ; playing technique recognition ; music signal analysis},
  month = jun,
  pages = {1-5},
  pdf = {https://hal.science/hal-04029145v1/file/wang_ICASSP23_final.pdf},
  publisher = {{IEEE}},
  title = {{Explainable Audio Classification of Playing Techniques with Layer-wise Relevance Propagation}},
  url = {https://hal.science/hal-04029145},
  year = {2023}
}
```
Deep convolutional networks (convnets) in the time-frequency domain can learn an accurate and fine-grained categorization of sounds. For example, in the context of music signal analysis, this categorization may correspond to a taxonomy of playing techniques: vibrato, tremolo, trill, and so forth. However, convnets lack an explicit connection with the neurophysiological underpinnings of musical timbre perception. In this article, we propose a data-driven approach to explain audio classification in terms of physical attributes in sound production. We borrow from current literature in "explainable AI" (XAI) to study the predictions of a convnet which achieves an almost perfect score on a challenging task: i.e., the classification of five comparable real-world playing techniques from 30 instruments spanning seven octaves. Mapping the signal into the carrier-modulation domain using scattering transform, we decompose the networks’ predictions over this domain with layer-wise relevance propagation. We find that regions highly-relevant to the predictions localized around the physical attributes with which the playing techniques are performed.
Fine-tuning strategies for faster inference using speech self-supervised models: a comparative study
Salah Zaiem, Robin Algayres, Titouan Parcollet, Slim Essid, Mirco Ravanelli
ICASSP 2023 - International Conference on Acoustics, Speech, and Signal Processing, Rhodes, Greece, June 2023.
```
@inproceedings{zaiem:hal-04076307,
  address = {Rhodes, Greece},
  author = {Zaiem, Salah and Algayres, Robin and Parcollet, Titouan and Essid, Slim and Ravanelli, Mirco},
  booktitle = {{ICASSP 2023 - International Conference on Acoustics, Speech, and Signal Processing}},
  hal_id = {hal-04076307},
  hal_version = {v1},
  keywords = {Speech recognition ; Self-supervised learning},
  month = jun,
  pdf = {https://hal.science/hal-04076307v1/file/2303.06740.pdf},
  title = {{Fine-tuning strategies for faster inference using speech self-supervised models: a comparative study}},
  url = {https://hal.science/hal-04076307},
  year = {2023}
}
```
Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings. In this context, it has been demonstrated that larger selfsupervised feature extractors are crucial for achieving lower downstream ASR error rates. Thus, better performance might be sanctioned with longer inferences. This article explores different approaches that may be deployed during the fine-tuning to reduce the computations needed in the SSL encoder, leading to faster inferences. We adapt a number of existing techniques to common ASR settings and benchmark them, displaying performance drops and gains in inference times. Interestingly, we found that given enough downstream data, a simple downsampling of the input sequences outperforms the other methods with both low performance drops and high computational savings, reducing computations by 61.3% with an WER increase of only 0.81. Finally, we analyze the robustness of the comparison to changes in dataset conditions, revealing sensitivity to dataset size.

Journal Articles

Audio Signal Processing in the 21st Century
Gaël Richard, Paris Smaragdis, Sharon Gannot, Patrick A Naylor, Shoji Makino, Walter Kellermann, Akihiko Sugiyama
IEEE Signal Processing Magazine, July 2023.

@article{richard:hal-04112575,
  author = {Richard, Ga{\"e}l and Smaragdis, Paris and Gannot, Sharon and Naylor, Patrick A and Makino, Shoji and Kellermann, Walter and Sugiyama, Akihiko},
  doi = {10.1109/MSP.2023.3276171},
  hal_id = {hal-04112575},
  hal_version = {v1},
  journal = {{IEEE Signal Processing Magazine}},
  month = jul,
  pdf = {https://telecom-paris.hal.science/hal-04112575v1/file/MSP3276171-2.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Audio Signal Processing in the 21st Century}},
  url = {https://telecom-paris.hal.science/hal-04112575},
  year = {2023}
}

Hi! PARIS: IA et Sciences des données pour la société
Gael Richard, Vieille Nicolas, Moulines Eric
Télécom : revue de l’Association Amicale des ingénieurs de l’Ecole Nationale Supérieure des télécommunications, June 2023.

@article{richard:hal-04168456,
  author = {Richard, Gael and Nicolas, Vieille and Eric, Moulines},
  hal_id = {hal-04168456},
  hal_version = {v1},
  journal = {{T{\'e}l{\'e}com : revue de l'Association Amicale des ing{\'e}nieurs de l'Ecole Nationale Sup{\'e}rieure des t{\'e}l{\'e}communications}},
  month = jun,
  number = {\#209},
  title = {{Hi! PARIS: IA et Sciences des donn{\'e}es pour la soci{\'e}t{\'e}}},
  url = {https://hal.science/hal-04168456},
  year = {2023}
}

Unsupervised Music Source Separation Using Differentiable Parametric Source Models
Kilian Schulze-Forster, Gaël Richard, Liam Kelley, Clement Doire, Roland Badeau
IEEE/ACM Transactions on Audio, Speech and Language Processing, March 2023.
```
@article{schulzeforster:hal-04038023,
  author = {Schulze-Forster, Kilian and Richard, Ga{\"e}l and Kelley, Liam and Doire, Clement and Badeau, Roland},
  doi = {10.1109/TASLP.2023.3252272},
  hal_id = {hal-04038023},
  hal_version = {v1},
  journal = {{IEEE/ACM Transactions on Audio, Speech and Language Processing}},
  month = mar,
  pages = {1276-1289},
  pdf = {https://telecom-paris.hal.science/hal-04038023v1/file/Unsupervised_Music_Source_Separation_Using_Differentiable_Parametric_Source_Models-3.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Unsupervised Music Source Separation Using Differentiable Parametric Source Models}},
  url = {https://telecom-paris.hal.science/hal-04038023},
  volume = {31},
  year = {2023}
}
```
Supervised deep learning approaches to underdetermined audio source separation achieve state-of-the-art performance but require a dataset of mixtures along with their corresponding isolated source signals. Such datasets can be extremely costly to obtain for musical mixtures. This raises a need for unsupervised methods. We propose a novel unsupervised model-based deep learning approach to musical source separation. Each source is modelled with a differentiable parametric source-filter model. A neural network is trained to reconstruct the observed mixture as a sum of the sources by estimating the source models’ parameters given their fundamental frequencies. At test time, soft masks are obtained from the synthesized source signals. The experimental evaluation on a vocal ensemble separation task shows that the proposed method outperforms learning-free methods based on nonnegative matrix factorization and a supervised deep learning baseline. Integrating domain knowledge in the form of source models into a data-driven method leads to high data efficiency: the proposed approach achieves good separation quality even when trained on less than three minutes of audio. This work makes powerful deep learning based separation usable in scenarios where training data with ground truth is expensive or nonexistent

2022

Conference Articles

Learning Multi-Level Representations for Hierarchical Music Structure Analysis
Morgan Buisson, Brian Mcfee, Slim Essid, Helene-Camille Crayencour
Proceedings of ISMIR 2022, Bengaluru, India, December 2022.
```
@inproceedings{buisson:hal-03780032,
  address = {Bengaluru, India},
  author = {Buisson, Morgan and Mcfee, Brian and Essid, Slim and Crayencour, Helene-Camille},
  booktitle = {{Proceedings of ISMIR 2022}},
  hal_id = {hal-03780032},
  hal_version = {v1},
  month = dec,
  pdf = {https://hal.science/hal-03780032v1/file/Morgan_Buisson_ismir.pdf},
  title = {{Learning Multi-Level Representations for Hierarchical Music Structure Analysis}},
  url = {https://hal.science/hal-03780032},
  year = {2022}
}
```
Recent work in music structure analysis has shown the potential of deep features to highlight the underlying structure of music audio signals. Despite promising results achieved by such representations, dealing with the inherent hierarchical aspect of music structure remains a challenging problem. Because different levels of segmentation can be considered as equally valid, specifically designed representations should be optimized to improve hierarchical structure analysis. In this work, unsupervised learning of such representations using a contrastive approach operating at different timescales is explored. The proposed system is evaluated on flat and multi-level music segmentation. By leveraging both time and the hierarchical organization of music structure, we show that the obtained deep embeddings can encode meaningful patterns and improve segmentation at various levels of granularity.
Exploiting device and audio data to tag music with User-Aware listening contexts
Karim M Ibrahim, Elena V. Epure, Geoffroy Peeters, Gael Richard
International Society for Music Information Retrieval Conference (ISMIR 2022), Bengalore, India, December 2022.
```
@inproceedings{ibrahim:hal-03903647,
  address = {Bengalore, India},
  author = {Ibrahim, Karim M and V. Epure, Elena and Peeters, Geoffroy and Richard, Gael},
  booktitle = {{International Society for Music Information Retrieval Conference (ISMIR 2022)}},
  hal_id = {hal-03903647},
  hal_version = {v1},
  month = dec,
  pdf = {https://telecom-paris.hal.science/hal-03903647v1/file/000021-1.pdf},
  title = {{Exploiting device and audio data to tag music with User-Aware listening contexts}},
  url = {https://telecom-paris.hal.science/hal-03903647},
  year = {2022}
}
```
As music has become more available especially on music streaming platforms, people have started to have distinct preferences to fit to their varying listening situations, also known as context. Hence, there has been a growing interest in considering the user’s situation when recommending music to users. Previous works have proposed user-aware autotaggers to infer situation-related tags from music content and user’s global listening preferences. However, in a practical music retrieval system, the autotagger could be only used by assuming that the context class is explicitly provided by the user. In this work, for designing a fully automatised music retrieval system, we propose to disambiguate the user’s listening information from their stream data. Namely, we propose a system which can generate a situational playlist for a user at a certain time 1) by leveraging user-aware music autotaggers, and 2) by automatically inferring the user’s situation from stream data (e.g. device, network) and user’s general profile information (e.g. age). Experiments show that such a context-aware personalized music retrieval system is feasible, but the performance decreases in the case of new users, new tracks or when the number of context classes increases.
SSM-NET: FEATURE LEARNING FOR MUSIC STRUCTURE ANALYSIS USING A SELF-SIMILARITY-MATRIX BASED LOSS
Geoffroy Peeters, Florian Angulo
Late-Breaking/Demo Session of ISMIR (International Society for Music Infor- mation Retrieval), Bengalore, India, December 2022.
```
@inproceedings{peeters:hal-03860497,
  address = {Bengalore, India},
  author = {Peeters, Geoffroy and Angulo, Florian},
  booktitle = {{Late-Breaking/Demo Session of ISMIR (International Society for Music Infor- mation Retrieval)}},
  hal_id = {hal-03860497},
  hal_version = {v1},
  month = dec,
  pdf = {https://telecom-paris.hal.science/hal-03860497v1/file/ISMIR2022_lbd_5.pdf},
  title = {{SSM-NET: FEATURE LEARNING FOR MUSIC STRUCTURE ANALYSIS USING A SELF-SIMILARITY-MATRIX BASED LOSS}},
  url = {https://telecom-paris.hal.science/hal-03860497},
  year = {2022}
}
```
In this paper, we propose a new paradigm to learn audio features for Music Structure Analysis (MSA). We train a deep encoder to learn features such that the Self-Similarity-Matrix (SSM) resulting from those approximates a ground-truth SSM. This is done by minimizing a loss between both SSMs. Since this loss is differentiable w.r.t. its input features we can train the encoder in a straightforward way. We successfully demonstrate the use of this training paradigm using the Area Under the Curve ROC (AUC) on the RWC-Pop dataset.
Latent and Adversarial Data Augmentation for Sound Event Detection and Classification
David Perera, Slim Essid, Gaël Richard
International workshop on Detection and Classiffication of Acoustic Scenes and Events (DCASE), Nancy, France, November 2022.
```
@inproceedings{perera:hal-03782827,
  address = {Nancy, France},
  author = {Perera, David and Essid, Slim and Richard, Ga{\"e}l},
  booktitle = {{International workshop on Detection and Classiffication of Acoustic Scenes and Events (DCASE)}},
  hal_id = {hal-03782827},
  hal_version = {v1},
  keywords = {sound event detection ; data augmentation ; adversarial learning},
  month = nov,
  pdf = {https://hal.science/hal-03782827v1/file/dcase.pdf},
  title = {{Latent and Adversarial Data Augmentation for Sound Event Detection and Classification}},
  url = {https://hal.science/hal-03782827},
  year = {2022}
}
```
Invariance-based learning is a promising approach in deep learning. Among other benefits, it can mitigate the lack of diversity of available datasets and increase the interpretability of trained models. To this end, practitioners often use a consistency cost penalizing the sensitivity of a model to a set of carefully selected data augmentations. However, there is no consensus about how these augmentations should be selected. In this paper, we study the behavior of several augmentation strategies. We consider the task of sound event detection and classification for our experiments. In particular, we show that transformations operating on the internal layers of a deep neural network are beneficial for this task.
The absorptive nature of the scattering coefficient in the stress-energy tensor formalism for room acoustics
Jean-Dominique Polack, Aidan Meacham, Roland Badeau
24th international congress on acoustics (ICA 2022), Gyeongju, South Korea, October 2022.
```
@inproceedings{polack:hal-03671851,
  address = {Gyeongju, South Korea},
  author = {Polack, Jean-Dominique and Meacham, Aidan and Badeau, Roland},
  booktitle = {{24th international congress on acoustics (ICA 2022)}},
  hal_id = {hal-03671851},
  hal_version = {v1},
  month = oct,
  pdf = {https://telecom-paris.hal.science/hal-03671851v1/file/Polack-ICA-2022.pdf},
  title = {{The absorptive nature of the scattering coefficient in the stress-energy tensor formalism for room acoustics}},
  url = {https://telecom-paris.hal.science/hal-03671851},
  year = {2022}
}
```
In the stress-energy tensor formalism, the symmetry between absorption and scattering coefficients, as proven by measurements combined with simulations, is counter-intuitive. By introducing the wall admittance, we show that the scattering coefficient is partly created by the real part of the wall admittance combined with the active intensity, that is, is partly due to absorption. However, it also depends on the imaginary part of the wall admittance in combination with the reactive intensity, which confers it genuine scattering properties. In the case of plane waves impinging on planar boundary, the admittance formalism shows that reactive intensity vanishes in directions parallel to the wall; when the source is at finite distance from the wall, a residual reactive intensity subsists. However, for curved boundaries, the velocity in directions parallel to the wall is no longer proportional to the pressure, and scattering occurs.
Scattering at the angles of polyhedral rooms: application of stress-energy tensor conservation in Riemannian spaces
Jean-Dominique Polack, Aidan Meacham, Roland Badeau, Jean-Christophe Valière
24th international congress on acoustics, Gyeongju, South Korea, October 2022.
```
@inproceedings{polack:hal-03671852,
  address = {Gyeongju, South Korea},
  author = {Polack, Jean-Dominique and Meacham, Aidan and Badeau, Roland and Vali{\`e}re, Jean-Christophe},
  booktitle = {{24th international congress on acoustics}},
  hal_id = {hal-03671852},
  hal_version = {v1},
  keywords = {Riemannian geometry ; Polyhedral rooms ; Scattering ; Stress-energy tensor},
  month = oct,
  organization = {{International Commission for Acoustics (ICA) and The Acoustical Society of Korea (ASK)}},
  pages = {1-9},
  pdf = {https://telecom-paris.hal.science/hal-03671852v1/file/Polack-ICA-2022.pdf},
  title = {{Scattering at the angles of polyhedral rooms: application of stress-energy tensor conservation in Riemannian spaces}},
  url = {https://telecom-paris.hal.science/hal-03671852},
  year = {2022}
}
```
Riemannian spaces with negative curvature constitute the proper setting for the distribution of images created by irregular polyhedral rooms with obtuse angles. The crucial parameter is the excess angle that arises around specific edges, called hinges, when first and second order images are considered, as it pilots the metric tensor of the space and all its geometrical properties. With the use of these geometrical properties, and complementing it with the uncertainty principle, we describe the scattering of wave packets around dihedral angles: it is proportional to the excess angle, and is best described in terms of the conservation of the stress-energy tensor. The basic elements for computing the scattering are given.
Apprentissage de bancs de filtres pour la séparation aveugle de sources sonores
Félix Mathieu, Thomas Courtat, Gael Richard, Geoffroy Peeters
Colloque Francophone de Traitement du Signal et des Images (GRETSI), Nancy, France, September 2022.
```
@inproceedings{mathieu:hal-03759647,
  address = {Nancy, France},
  author = {Mathieu, F{\'e}lix and Courtat, Thomas and Richard, Gael and Peeters, Geoffroy},
  booktitle = {{Colloque Francophone de Traitement du Signal et des Images (GRETSI)}},
  hal_id = {hal-03759647},
  hal_version = {v1},
  month = sep,
  pdf = {https://telecom-paris.hal.science/hal-03759647v1/file/mathieu818-1.pdf},
  title = {{Apprentissage de bancs de filtres pour la s{\'e}paration aveugle de sources sonores}},
  url = {https://telecom-paris.hal.science/hal-03759647},
  year = {2022}
}
```
L’utilisation d’encodeurs audio paramétrés s’est révélée être une piste encourageante pour améliorer l’interprétabilité et les performances des modèles de séparation de sources bout-à-bout. Nous présentons des propriétés d’intérêt nécessaires à l’apprentissage des filtres de ces encodeurs ; et proposons une paramétrisation pour contraindre ces filtres. Sur la base de la transformée de Hilbert et du théorème de Bedrosian, nous proposons de construire un ensemble de filtres déphasés en modulant des sinusoïdes à travers des filtres passe-bas appris librement. Ces filtres permettent d’obtenir des invariances pour des décalages temporels, des décalages de phases tout en évitant l’utilisation de réseaux de neurones complexes grâce à une astuce de sur-paramétrisation de la phase pour une forme d’onde donnée.
Impact de perturbations internes sur l’entraînement de réseaux profonds pour la détection d’évènements sonores
David Perera, Slim Essid, Gael Richard
Colloque Francophone de Traitement du Signal et des Images (GRETSI), Nancy, France, September 2022.
```
@inproceedings{perera:hal-03759651,
  address = {Nancy, France},
  author = {Perera, David and Essid, Slim and Richard, Gael},
  booktitle = {{Colloque Francophone de Traitement du Signal et des Images (GRETSI)}},
  hal_id = {hal-03759651},
  hal_version = {v1},
  month = sep,
  pdf = {https://telecom-paris.hal.science/hal-03759651v1/file/perera927.pdf},
  title = {{Impact de perturbations internes sur l'entra{\^i}nement de r{\'e}seaux profonds pour la d{\'e}tection d'{\'e}v{\`e}nements sonores}},
  url = {https://telecom-paris.hal.science/hal-03759651},
  year = {2022}
}
```
L’apprentissage d’invariants est une méthode d’entraînement prometteuse pour les réseaux de neurones profonds, puisqu’elle permet à la fois de pallier le manque de diversité des bases de données disponibles, et de rendre les modèles entraînés plus interprétables. En pratique, l’apprentissage d’invariants passe souvent par l’utilisation d’augmentations de données et de coûts de consistance pénalisant la sensibilité d’un modèle à ces augmentations. Il n’existe cependant pas de consensus concernant la sélection de ces augmentations pour une tâche cible. Cet article étudie l’impact de plusieurs types d’augmentations sur l’entraînement d’un modèle de l’état de l’art, dans le cadre de la détection et de la classification d’évènements sonores. Nous montrons en particulier que la perturbation des représentations internes d’un réseau de neurones profond est bénéfique pour cette tâche.
Automatic Data Augmentation Selection and Parametrization in Contrastive Self-Supervised Speech Representation Learning
Salah Zaiem, Titouan Parcollet, Slim Essid
Interspeech 2022, Incheon, South Korea, September 2022.
```
@inproceedings{zaiem:hal-03817736,
  address = {Incheon, South Korea},
  author = {Zaiem, Salah and Parcollet, Titouan and Essid, Slim},
  booktitle = {{Interspeech 2022}},
  doi = {10.21437/interspeech.2022-10191},
  hal_id = {hal-03817736},
  hal_version = {v1},
  month = sep,
  pages = {669-673},
  pdf = {https://hal.science/hal-03817736v1/file/IS2022%20%2814%29.pdf},
  publisher = {{ISCA}},
  title = {{Automatic Data Augmentation Selection and Parametrization in Contrastive Self-Supervised Speech Representation Learning}},
  url = {https://hal.science/hal-03817736},
  year = {2022}
}
```
Contrastive learning enables learning useful audio and speech representations without ground-truth labels by maximizing the similarity between latent representations of similar signal segments. In this framework various data augmentation techniques are usually exploited to help enforce desired invariances within the learned representations, improving performance on various audio tasks thanks to more robust embeddings. Now, selecting the most relevant augmentations has proven crucial for better downstream performances. Thus, this work introduces a conditional independance-based method which allows for automatically selecting a suitable distribution on the choice of augmentations and their parametrization from a set of predefined ones, for contrastive self-supervised pre-training. This is performed with respect to a downstream task of interest, hence saving a costly hyper-parameter search. Experiments performed on two different downstream tasks validate the proposed approach showing better results than experimenting without augmentation or with baseline augmentations. We furthermore conduct a qualitative analysis of the automatically selected augmentations and their variation according to the considered final downstream dataset.
Elliptically Contoured Alpha-Stable Representation for MUSIC-Based Sound Source Localization
Mathieu Fontaine, Diego Di Carlo, Kouhei Sekiguchi, Aditya Arie Nugraha, Yoshiaki Bando, Kazuyoshi Yoshii
2022 30th European Signal Processing Conference (EUSIPCO), Belgrade, France, August 2022.
```
@inproceedings{fontaine:hal-03608767,
  address = {Belgrade, France},
  author = {Fontaine, Mathieu and Di Carlo, Diego and Sekiguchi, Kouhei and Nugraha, Aditya Arie and Bando, Yoshiaki and Yoshii, Kazuyoshi},
  booktitle = {{2022 30th European Signal Processing Conference (EUSIPCO)}},
  doi = {10.23919/EUSIPCO55093.2022.9909944},
  hal_id = {hal-03608767},
  hal_version = {v1},
  keywords = {sound source localization ; MUSIC ; $\alpha$-stable theory ; covariation},
  month = aug,
  pages = {26-30},
  pdf = {https://hal.science/hal-03608767v1/file/EUSIPCO_22_Mathieu_alphaMUSIC.pdf},
  publisher = {{IEEE}},
  title = {{Elliptically Contoured Alpha-Stable Representation for MUSIC-Based Sound Source Localization}},
  url = {https://hal.science/hal-03608767},
  year = {2022}
}
```
This paper introduces a theoretically-rigorous sound source localization (SSL) method based on a robust extension of the classical multiple signal classification (MUSIC) algorithm. The original SSL method estimates the noise eigenvectors and the MUSIC spectrum by computing the spatial covariance matrix of the observed multichannel signal and then detects the peaks from the spectrum. In this work, the covariance matrix is replaced with the positive definite shape matrix originating from the elliptically contoured α-stable model, which is more suitable under real noisy high-reverberant conditions. Evaluation on synthetic data shows that the proposed method outperforms baseline methods under such adverse conditions, while it is comparable on real data recorded in a mild acoustic condition.
FVTD simulation of the acoustics of the Phonocamptic Cave in Noyon
Hugo Duval, Antoine Thomas, Aidan Meacham, Roland Badeau, Jean-Christophe Valière, Jean-Dominique Polack
The Acoustics of Ancient Theatres, Verona, Italy, July 2022.
```
@inproceedings{duval:hal-03670586,
  address = {Verona, Italy},
  author = {Duval, Hugo and Thomas, Antoine and Meacham, Aidan and Badeau, Roland and Vali{\`e}re, Jean-Christophe and Polack, Jean-Dominique},
  booktitle = {{The Acoustics of Ancient Theatres}},
  hal_id = {hal-03670586},
  hal_version = {v1},
  month = jul,
  pdf = {https://telecom-paris.hal.science/hal-03670586v1/file/AAT_Verona_2020_FVTD.pdf},
  title = {{FVTD simulation of the acoustics of the Phonocamptic Cave in Noyon}},
  url = {https://telecom-paris.hal.science/hal-03670586},
  year = {2022}
}
```
Starting from new measurements of the acoustical pots and room geometry in the phonocamptic cave at the Cathedral of Noyon, a numerical study was undertaken to understand the acoustical effects at the boundaries, and to provide an auralization of the space. An implementation of the finite volume time domain (FVTD) method was used to model the cave, including fitting the impedance presented by the acoustical pots on certain boundaries. The individual impedances of the pots were estimated from impulse responses collected pot-by-pot and parameterized in terms of a Helmholtz resonator model. Then, using the electroacoustic analogy, the sum effect of the pots was modeled as an equivalent spatial distribution in the FVTD boundary conditions. Additionally, the space was discretized with an unstructured mesh in order to capture the complex geometry, minimize dispersion error, and to check the accuracy of the FVTD implementation.
Adapting the EST method to ancient theatres: a proposal
Jean-Dominique Polack, Aidan Meacham, Roland Badeau, Jean-Christophe Valière
The Acoustics of Ancient Theatres, Verona, Italy, July 2022.
```
@inproceedings{polack:hal-03670577,
  address = {Verona, Italy},
  author = {Polack, Jean-Dominique and Meacham, Aidan and Badeau, Roland and Vali{\`e}re, Jean-Christophe},
  booktitle = {{The Acoustics of Ancient Theatres}},
  hal_id = {hal-03670577},
  hal_version = {v1},
  month = jul,
  pdf = {https://telecom-paris.hal.science/hal-03670577v1/file/EST_for_ancient_theatres3.pdf},
  title = {{Adapting the EST method to ancient theatres: a proposal}},
  url = {https://telecom-paris.hal.science/hal-03670577},
  year = {2022}
}
```
The paper investigates under which assumptions the EST method, initially developed for modelling the propagation of acoustical energy in flat spaces such as hallways and open space offices, can be adapted to unbounded spaces such as ancient theatres. It turns out that it mainly requires that the air column above any position in the open theatre contains finite acoustical energy, whatever its height. This is indeed the case since at high altitudes above the theatre, energy decreases with the square of the height due to the increasingly accurate assimilation of the theatre to a point source. In other words, one must use high enough elements, so that the intensity on the top of the elements can be considered as negligible, leading to negligible absorption and scattering on the top boundary. Therefore, one only needs considering absorption and scattering at the bottom boundary of the elements; and the integration on the elements must be revisited to account for the decrease of intensity with altitude. The corresponding bi-dimensional equations will be presented and solved for a variety of absorption and scattering coefficients on the surface of the theatre, and compared to measurements in an actual theatre.
Rate-Distortion Theoretic Generalization Bounds for Stochastic Learning Algorithms
Milad Sefidgaran, Amin Gohari, Gael Richard, Umut Şimşekli
COLT 2022 - 35th Annual Conference on Learning Theory, London, United Kingdom, July 2022.
```
@inproceedings{sefidgaran:hal-03759597,
  address = {London, United Kingdom},
  author = {Sefidgaran, Milad and Gohari, Amin and Richard, Gael and {\c S}im{\c s}ekli, Umut},
  booktitle = {{COLT 2022 - 35th Annual Conference on Learning Theory}},
  hal_id = {hal-03759597},
  hal_version = {v1},
  keywords = {Generalization error ; Rate-distortion theory ; Source coding},
  month = jul,
  pdf = {https://telecom-paris.hal.science/hal-03759597v1/file/sefidgaran22a.pdf},
  series = {Proceedings of Machine Learning Research},
  title = {{Rate-Distortion Theoretic Generalization Bounds for Stochastic Learning Algorithms}},
  url = {https://telecom-paris.hal.science/hal-03759597},
  volume = {178},
  year = {2022}
}
```
Understanding generalization in modern machine learning settings has been one of the major challenges in statistical learning theory. In this context, recent years have witnessed the development of various generalization bounds suggesting different complexity notions such as the mutual information between the data sample and the algorithm output, compressibility of the hypothesis space, and the fractal dimension of the hypothesis space. While these bounds have illuminated the problem at hand from different angles, their suggested complexity notions might appear seemingly unrelated, thereby restricting their high-level impact. In this study, we prove novel generalization bounds through the lens of rate-distortion theory, and explicitly relate the concepts of mutual information, compressibility, and fractal dimensions in a single mathematical framework. Our approach consists of (i) defining a generalized notion of compressibility by using source coding concepts, and (ii) showing that the ’compression error rate’ can be linked to the generalization error both in expectation and with high probability. We show that in the ’lossless compression’ setting, we recover and improve existing mutual information-based bounds, whereas a ’lossy compression’ scheme allows us to link generalization to the rate-distortion dimension-a particular notion of fractal dimension. Our results bring a more unified perspective on generalization and open up several future research directions.
Opinions in Interactions : New Annotations of the SEMAINE Database
Valentin Barrière, Chloé Clavel, Slim Essid
LREC, Marseille, France, June 2022.
```
@inproceedings{barriere:hal-04276012,
  address = {Marseille, France},
  author = {Barri{\`e}re, Valentin and Clavel, Chlo{\'e} and Essid, Slim},
  booktitle = {{LREC}},
  hal_id = {hal-04276012},
  hal_version = {v1},
  keywords = {Opinion Multimodal Machine Learning Interactions ; Opinion ; Multimodal Machine Learning ; Interactions},
  month = jun,
  pdf = {https://hal.science/hal-04276012v1/file/2022.lrec-1.762.pdf},
  title = {{Opinions in Interactions : New Annotations of the SEMAINE Database}},
  url = {https://hal.science/hal-04276012},
  year = {2022}
}
```
In this paper, we present the process we used in order to collect new annotations of opinions over the multimodal corpus SEMAINE composed of dyadic interactions. The dataset had already been annotated continuously in two affective dimensions related to the emotions: Valence and Arousal. We annotated the part of SEMAINE called Solid SAL composed of 79 interactions between a user and an operator playing the role of a virtual agent designed to engage a person in a sustained, emotionally colored conversation. We aligned the audio at the word level using the available high-quality manual transcriptions. The annotated dataset contains 5627 speech turns for a total of 73,944 words, corresponding to 6 hours 20 minutes of dyadic interactions. Each interaction has been labeled by three annotators at the speech turn level following a three-step process. This method allows us to obtain a precise annotation regarding the opinion of a speaker. We obtain thus a dataset dense in opinions, with more than 48% of the annotated speech turns containing at least one opinion. We then propose a new baseline for the detection of opinions in interactions improving slightly a state of the art model with RoBERTa embeddings. The obtained results on the database are promising with a F1-score at 0.72.
END-TO-END SPEECH RECOGNITION FROM FEDERATED ACOUSTIC MODELS
Yan Gao, Titouan Parcollet, Salah Zaiem, Javier Fernandez-Marques, Pedro Gusmao, Daniel Beutel, Nicholas Lane
The International Conference on Acoustics, Speech, & Signal Processing (ICASSP), Singapour, Singapore, May 2022.
```
@inproceedings{gao:hal-03601224,
  address = {Singapour, Singapore},
  author = {Gao, Yan and Parcollet, Titouan and Zaiem, Salah and Fernandez-Marques, Javier and de Gusmao, Pedro and Beutel, Daniel and Lane, Nicholas},
  booktitle = {{The International Conference on Acoustics, Speech, \& Signal Processing (ICASSP)}},
  hal_id = {hal-03601224},
  hal_version = {v1},
  month = may,
  pdf = {https://hal.science/hal-03601224/file/2104.14297.pdf},
  title = {{END-TO-END SPEECH RECOGNITION FROM FEDERATED ACOUSTIC MODELS}},
  url = {https://hal.science/hal-03601224},
  year = {2022}
}
```
Training Automatic Speech Recognition (ASR) models under federated learning (FL) settings has attracted a lot of attention recently. However, the FL scenarios often presented in the literature are artificial and fail to capture the complexity of real FL systems. In this paper, we construct a challenging and realistic ASR federated experimental setup consisting of clients with heterogeneous data distributions using the French and Italian sets of the CommonVoice dataset, a large heterogeneous dataset containing thousands of different speakers, acoustic environments and noises. We present the first empirical study on attention-based sequence-to-sequence Endto-End (E2E) ASR model with three aggregation weighting strategies-standard FedAvg, loss-based aggregation and a novel word error rate (WER)-based aggregation, compared in two realistic FL scenarios: cross-silo with 10 clients and cross-device with 2K and 4K clients. Our analysis on E2E ASR from heterogeneous and realistic federated acoustic models provides the foundations for future research and development of realistic FL-based ASR applications.
PHASE SHIFTED BEDROSIAN FILTERBANK: AN INTERPRETABLE AUDIO FRONT-END FOR TIME-DOMAIN AUDIO SOURCE SEPARATION
Félix Mathieu, Thomas Courtat, Gael Richard, Geoffroy Peeters
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapour, Singapore, May 2022.
```
@inproceedings{mathieu:hal-03708610,
  address = {Singapour, Singapore},
  author = {Mathieu, F{\'e}lix and Courtat, Thomas and Richard, Gael and Peeters, Geoffroy},
  booktitle = {{ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  doi = {10.1109/ICASSP43922.2022.9746122},
  hal_id = {hal-03708610},
  hal_version = {v1},
  month = may,
  pdf = {https://hal.science/hal-03708610v1/file/Mathieu.pdf},
  title = {{PHASE SHIFTED BEDROSIAN FILTERBANK: AN INTERPRETABLE AUDIO FRONT-END FOR TIME-DOMAIN AUDIO SOURCE SEPARATION}},
  url = {https://hal.science/hal-03708610},
  year = {2022}
}
```
The use of a parameterized encoders or audio front-ends has shown promises in improving the interpretability of time domain single-channel source separation models such as Conv-TasNet. This type of filters also allows a potential reduction of the computational cost since larger encoder filters can be used. In this work, we propose to build a new parameterization of such encoder filter-bank which allows gaining interpretability while keeping flexibility. Based on the Hilbert transform and the Bedrosian theorem, we propose to build phase-shifted set of filters by modulating sinusoids through freely learned low pass filters. We show that the use of these filters allows to keep the same performances when using small filters and even improve them when using large filters.
Flow-Based Fast Multichannel Nonnegative Matrix Factorization for Blind Source Separation
Aditya Arie Nugraha, Kouhei Sekiguchi, Mathieu Fontaine, Yoshiaki Bando, Kazuyoshi Yoshii
2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022), Singapore, Singapore, May 2022.
```
@inproceedings{nugraha:hal-03637425,
  address = {Singapore, Singapore},
  author = {Nugraha, Aditya Arie and Sekiguchi, Kouhei and Fontaine, Mathieu and Bando, Yoshiaki and Yoshii, Kazuyoshi},
  booktitle = {{2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP  2022)}},
  hal_id = {hal-03637425},
  hal_version = {v1},
  keywords = {normalizing flow ; multichannel nonnegative matrix factorization ; joint diagonalization ; blind source separation},
  month = may,
  pdf = {https://telecom-paris.hal.science/hal-03637425v1/file/_ICASSP_22__NF_FastMNMF.pdf},
  title = {{Flow-Based Fast Multichannel Nonnegative Matrix Factorization for Blind Source Separation}},
  url = {https://telecom-paris.hal.science/hal-03637425},
  year = {2022}
}
```
This paper describes a blind source separation method for multichannel audio signals, called NF-FastMNMF, based on the integration of the normalizing flow (NF) into the multichannel nonnegative matrix factorization with jointly-diagonalizable spatial covariance matrices, a.k.a. FastMNMF. Whereas the NF of flow-based independent vector analysis, called NF-IVA, acts as the demixing matrices to transform an M-channel mixture into M independent sources, the NF of NF-FastMNMF acts as the diagonalization matrices to transform an Mchannel mixture into a spatially-independent M-channel mixture represented as a weighted sum of N source images. This diagonalization enables the NF, which has been used only for determined separation because of its bijective nature, to be applicable to non-determined separation. NF-FastMNMF has time-varying diagonalization matrices that are potentially better at handling dynamical data variation than the time-invariant ones in FastMNMF. To have an NF with richer expression capability, the dimension-wise scalings using diagonal matrices originally used in NF-IVA are replaced with linear transformations using upper triangular matrices; in both cases, the diagonal and upper triangular matrices are estimated by neural networks. The evaluation shows that NF-FastMNMF performs well for both determined and non-determined separations of multiple speech utterances by stationary or non-stationary speakers from a noisy reverberant mixture.
Algorithmes rapides pour la modélisation d’une réponse de salle dont l’atténuation dépend de la fréquence
Achille Aknin, Roland Badeau
16e Congrès Français d’Acoustique (CFA 2022), Marseille, France, April 2022.
```
@inproceedings{aknin:hal-03559398,
  address = {Marseille, France},
  author = {Aknin, Achille and Badeau, Roland},
  booktitle = {{16e Congr{\`e}s Fran{\c c}ais d'Acoustique (CFA 2022)}},
  hal_id = {hal-03559398},
  hal_version = {v1},
  month = apr,
  organization = {{Soci{\'e}t{\'e} Fran{\c c}aise d'Acoustique and Laboratoire de M{\'e}canique et d'Acoustique}},
  pdf = {https://telecom-paris.hal.science/hal-03559398v1/file/Algorithmes_rapides_pour_la_mod_lisation_d_une_r_ponse_de_salle_dont_l_att_nuation_d_pend_de_la_fr_quence___CFA_2022.pdf},
  title = {{Algorithmes rapides pour la mod{\'e}lisation d'une r{\'e}ponse de salle dont l'att{\'e}nuation d{\'e}pend de la fr{\'e}quence}},
  url = {https://telecom-paris.hal.science/hal-03559398},
  year = {2022}
}
```
En traitement du signal audio, la modélisation mathématique de la réponse de salle permet d’améliorer la qualité de l’estimation de signaux sources à partir de signaux réverbérés, afin par exemple d’effectuer une déréverbération ou de séparer un mélange de sources sonores. Dans un article précédent, nous avons travaillé sur l’implémentation d’un modèle stochastique de réponse impulsionnelle de salle, dans lequel l’atténuation exponentielle de la puissance au cours du temps dépend de la fréquence. En effet cette caractéristique est particulièrement importante si l’on veut prendre en compte la dépendance fréquentielle de l’absorption des murs, qui est généralement supérieure en hautes fréquences par rapport aux basses fréquences. Nous avons présenté une nouvelle structure de matrice, paramétrée par un unique filtre ppp, qui réalise cette atténuation exponentielle dépendant de la fréquence, et nous avons montré qu’elle pouvait être utilisée pour estimer les paramètres d’une réponse de salle. Cependant, cette matrice PPP étant de taille T×TT×TT \times T, où TTT est la longueur de la réponse impulsionnelle de la salle (généralement de l’ordre du millier ou de la dizaine de milliers d’échantillons en pratique), nous ne pouvons pas calculer directement des produits matriciels impliquant cette matrice PPP dans des conditions réelles pour des raisons de coût de calcul. Dans cet article, nous allons donc présenter plusieurs algorithmes rapides de produit matrice-vecteur que nous avons développés, qui exploitent la structure particulière de cette matrice, et dont la complexité est seulement de O(Tlog(T))O(Tlog⁡(T))O(T \log(T)) ou O(Tlog2(T))O(Tlog2⁡(T))O(T \log^2(T)) au lieu de O(T2)O(T2)O(T^2). Grâce à ces algorithmes, il devient possible d’exploiter la matrice PPP pour estimer les paramètres de vraies réponses de salle, sans être limité par la complexité de calcul.
Confirming dimensional reduction assumptions for the energy-stress tensor through comparison with high-frequency wave-based pressure simulations
Aidan Meacham, Roland Badeau, Jean-Dominique Polack
16ème Congrès Français d’Acoustique, CFA2022, Marseille, France, April 2022.
```
@inproceedings{meacham:hal-03848224,
  address = {Marseille, France},
  author = {Meacham, Aidan and Badeau, Roland and Polack, Jean-Dominique},
  booktitle = {{16{\`e}me Congr{\`e}s Fran{\c c}ais d'Acoustique, CFA2022}},
  hal_id = {hal-03848224},
  hal_version = {v1},
  month = apr,
  organization = {{Soci{\'e}t{\'e} Fran{\c c}aise d'Acoustique and Laboratoire de M{\'e}canique et d'Acoustique}},
  pdf = {https://hal.science/hal-03848224v1/file/CFA2022_Comm254.pdf},
  title = {{Confirming dimensional reduction assumptions for the energy-stress tensor  through comparison with high-frequency wave-based pressure simulations}},
  url = {https://hal.science/hal-03848224},
  year = {2022}
}
```
In room acoustics, the energy-stress tensor represents the conservative relationships between the acoustic energy density, sound intensity, and the symmetric wave-stress tensor. In real rooms, the off-diagonal components of the wave-stress tensor are non-zero, implying the existence of shear stresses acting upon the energetic quantities. Assumptions regarding these terms in 1- and 2-dimensional spaces [Dujourdy et al. 2017, 2019] were used to reduce the energy-stress tensor relationships to a tractable system capable of predicting frequency- dependent stochastic reverberation decays in those spaces [Meacham et al. 2019]. Direct verification of those assumptions at a single location in a real space would require more measurements at varying positions than can be reliably captured without robotization, let alone in acoustically distinct regions of a room. Therefore, in this work, we aim to verify the 1-dimensional reduction assumptions by examining a high- frequency wave-based pressure simulation, allowing averaging over a wide number of sampling positions at multiple locations throughout a space, providing insight into the relationship between room geometry and the terms of the energy-stress tensor.

Confirming dimensional reduction assumptions for the energy-stress tensor through comparison with high-frequency wave-based pressure simulations
Jean-Dominique Polack, Aidan Meacham, Roland Badeau
16e Congrès Français d’Acoustique (CFA 2022), Marseille, France, April 2022.

@inproceedings{polack:hal-03559400,
  address = {Marseille, France},
  author = {Polack, Jean-Dominique and Meacham, Aidan and Badeau, Roland},
  booktitle = {{16e Congr{\`e}s Fran{\c c}ais d'Acoustique (CFA 2022)}},
  hal_id = {hal-03559400},
  hal_version = {v1},
  month = apr,
  title = {{Confirming dimensional reduction assumptions for the energy-stress tensor through comparison with high-frequency wave-based pressure simulations}},
  url = {https://telecom-paris.hal.science/hal-03559400},
  year = {2022}
}

Riemannian space tessellation with polyhedral room images
Jean-Dominique Polack, Aidan Meacham, Roland Badeau, Jean-Christophe Valière
16e Congrès Français d’Acoustique (CFA 2022), Marseille, France, April 2022.

@inproceedings{polack:hal-03559402,
  address = {Marseille, France},
  author = {Polack, Jean-Dominique and Meacham, Aidan and Badeau, Roland and Vali{\`e}re, Jean-Christophe},
  booktitle = {{16e Congr{\`e}s Fran{\c c}ais d'Acoustique (CFA 2022)}},
  hal_id = {hal-03559402},
  hal_version = {v1},
  month = apr,
  title = {{Riemannian space tessellation with polyhedral room images}},
  url = {https://telecom-paris.hal.science/hal-03559402},
  year = {2022}
}

Riemannian space tessellation with polyhedral room images
Jean-Dominique Polack, Aidan Meacham, Roland Badeau, Jean-Christophe Valière
16ème Congrès Français d’Acoustique, CFA2022, Marseille, France, April 2022.
```
@inproceedings{polack:hal-03848222,
  address = {Marseille, France},
  author = {Polack, Jean-Dominique and Meacham, Aidan and Badeau, Roland and Vali{\`e}re, Jean-Christophe},
  booktitle = {{16{\`e}me Congr{\`e}s Fran{\c c}ais d'Acoustique, CFA2022}},
  hal_id = {hal-03848222},
  hal_version = {v1},
  month = apr,
  organization = {{Soci{\'e}t{\'e} Fran{\c c}aise d'Acoustique and Laboratoire de M{\'e}canique et d'Acoustique}},
  pdf = {https://hal.science/hal-03848222v1/file/CFA2022_Comm255.pdf},
  title = {{Riemannian space tessellation with polyhedral room images}},
  url = {https://hal.science/hal-03848222},
  year = {2022}
}
```
Counting the images sources of rectangular rooms is a well known technique, based on mirroring the original rooms on all its walls in order to tesselate the Euclidian space, leading to a quadratic increase with layer order. We show that a similar mirroring technique can be applied to polygonal and polyhedral rooms of arbitrary shapes, leading to the tessellation of a Riemannian space with negative curvature. From this tessellation we derive a close formulation for counting the numbers of image sources, which increases exponentially with layer order. Thus, a bridge between rooms with flat walls and generic mixing rooms with partially curved walls is obtained.
Direction-Aware Joint Adaptation of Neural Speech Enhancement and Recognition in Real Multiparty Conversational Environments
Yicheng Du, Aditya Arie Nugraha, Kouhei Sekiguchi, Yoshiaki Bando, Mathieu Fontaine, Kazuyoshi Yoshii
INTERSPEECH, Incheon, South Korea, 2022.
```
@inproceedings{du:hal-03727181,
  address = {Incheon, South Korea},
  author = {Du, Yicheng and Nugraha, Aditya Arie and Sekiguchi, Kouhei and Bando, Yoshiaki and Fontaine, Mathieu and Yoshii, Kazuyoshi},
  booktitle = {{INTERSPEECH}},
  hal_id = {hal-03727181},
  hal_version = {v1},
  keywords = {speech enhancement ; speech recognition ; humancomputer interaction ; run-time adaptation},
  pdf = {https://telecom-paris.hal.science/hal-03727181v1/file/interspeech_2022.pdf},
  title = {{Direction-Aware Joint Adaptation of Neural Speech Enhancement and Recognition in Real Multiparty Conversational Environments}},
  url = {https://telecom-paris.hal.science/hal-03727181},
  year = {2022}
}
```
This paper describes noisy speech recognition for an augmented reality headset that helps verbal communication with in real multiparty conversational environments. A major approach that has actively been studied in simulated environments is to sequentially perform speech enhancement and automatic speech recognition (ASR) based on deep neural networks (DNNs) trained in a supervised manner. In our task, however, such a pretrained system fails to work due to the mismatch between the training and test conditions and the head movements of the user. To enhance only the utterances of a target speaker, we use beamforming based on a DNN-based speech mask estimator that can adaptively extract the speech components corresponding to a head-relative particular direction. We propose a semi-supervised adaptation method that jointly updates the mask estimator and the ASR model at run-time using clean speech signals with ground-truth transcriptions and noisy speech signals with highly-confident estimated transcriptions. Comparative experiments using the state-of-theart distant speech recognition system show that the proposed method significantly improves the ASR performance.
Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF
Parekh Jayneel, Parekh Sanjeel, Mozharovskyi Pavlo, d’Alché-Buc Florence, Gael Richard
Advances in Neural Information Processing Systems, New Orleans, United States, 2022.
```
@inproceedings{jayneel:hal-04168435,
  address = {New Orleans, United States},
  author = {Jayneel, Parekh and Sanjeel, Parekh and Pavlo, Mozharovskyi and Florence, d'Alch{\'e}-Buc and Richard, Gael},
  booktitle = {{Advances in Neural Information Processing Systems}},
  hal_id = {hal-04168435},
  hal_version = {v1},
  pdf = {https://hal.science/hal-04168435v1/file/NeurIPS-2022-listen-to-interpret-post-hoc-interpretability-for-audio-networks-with-nmf-Paper-Conference.pdf},
  title = {{Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF}},
  url = {https://hal.science/hal-04168435},
  year = {2022}
}
```
This paper tackles post-hoc interpretability for audio processing networks. Our goal is to interpret decisions of a trained network in terms of high-level audio objects that are also listenable for the end-user. To this end, we propose a novel interpreter design that incorporates non-negative matrix factorization (NMF). In particular, a regularized interpreter module is trained to take hidden layer representations of the targeted network as input and produce time activations of pre-learnt NMF components as intermediate outputs. Our methodology allows us to generate intuitive audio-based interpretations that explicitly enhance parts of the input signal most relevant for a network’s decision. We demonstrate our method’s applicability on popular benchmarks, including a real-world multi-label classification task.
DNN-FREE LOW-LATENCY ADAPTIVE SPEECH ENHANCEMENT BASED ON FRAME-ONLINE BEAMFORMING POWERED BY BLOCK-ONLINE FASTMNMF
Aditya Arie Nugraha, Kouhei Sekiguchi, Mathieu Fontaine, Yoshiaki Bando, Kazuyoshi Yoshii
17th International Workshop on Acoustic Signal Enhancement (IWAENC 2022), Bamberg, Germany, 2022.
```
@inproceedings{nugraha:hal-03821095,
  address = {Bamberg, Germany},
  author = {Nugraha, Aditya Arie and Sekiguchi, Kouhei and Fontaine, Mathieu and Bando, Yoshiaki and Yoshii, Kazuyoshi},
  booktitle = {{17th International Workshop on Acoustic Signal Enhancement (IWAENC 2022)}},
  hal_id = {hal-03821095},
  hal_version = {v1},
  keywords = {speech enhancement ; beamforming ; blind source separation ; automatic speech recognition},
  pdf = {https://telecom-paris.hal.science/hal-03821095v1/file/2207.10934.pdf},
  title = {{DNN-FREE LOW-LATENCY ADAPTIVE SPEECH ENHANCEMENT BASED ON FRAME-ONLINE BEAMFORMING POWERED BY BLOCK-ONLINE FASTMNMF}},
  url = {https://telecom-paris.hal.science/hal-03821095},
  year = {2022}
}
```
This paper describes a practical dual-process speech enhancement system that adapts environment-sensitive frame-online beamforming (front-end) with help from environment-free block-online source separation (back-end). To use minimum variance distortionless response (MVDR) beamforming, one may train a deep neural network (DNN) that estimates timefrequency masks used for computing the covariance matrices of sources (speech and noise). Backpropagation-based runtime adaptation of the DNN was proposed for dealing with the mismatched training-test conditions. Instead, one may try to directly estimate the source covariance matrices with a state-ofthe-art blind source separation method called fast multichannel non-negative matrix factorization (FastMNMF). In practice, however, neither the DNN nor the FastMNMF can be updated in a frame-online manner due to its computationally-expensive iterative nature. Our DNN-free system leverages the posteriors of the latest source spectrograms given by block-online FastMNMF to derive the current source covariance matrices for frame-online beamforming. The evaluation shows that our frame-online system can quickly respond to scene changes caused by interfering speaker movements and outperformed an existing block-online system with DNN-based beamforming by 5.0 points in terms of the word error rate.
Direction-Aware Adaptive Online Neural Speech Enhancement with an Augmented Reality Headset in Real Noisy Conversational Environments
Kouhei Sekiguchi, Aditya Arie Nugraha, Yicheng Du, Yoshiaki Bando, Mathieu Fontaine, Kazuyoshi Yoshii
2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2022), Kyoto, France, 2022.
```
@inproceedings{sekiguchi:hal-03727169,
  address = {Kyoto, France},
  author = {Sekiguchi, Kouhei and Nugraha, Aditya Arie and Du, Yicheng and Bando, Yoshiaki and Fontaine, Mathieu and Yoshii, Kazuyoshi},
  booktitle = {{2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2022)}},
  hal_id = {hal-03727169},
  hal_version = {v1},
  pdf = {https://telecom-paris.hal.science/hal-03727169v1/file/_IROS_22__Direction_Aware_Adaptive_Online_Neural_Speech_Enhancement.pdf},
  title = {{Direction-Aware Adaptive Online Neural Speech Enhancement with an Augmented Reality Headset in Real Noisy Conversational Environments}},
  url = {https://telecom-paris.hal.science/hal-03727169},
  year = {2022}
}
```
This paper describes the practical response-and performance-aware development of online speech enhancement for an augmented reality (AR) headset that helps a user understand conversations made in real noisy echoic environments (e.g., cocktail party). One may use a state-of-the-art blind source separation method called fast multichannel nonnegative matrix factorization (FastMNMF) that works well in various environments thanks to its unsupervised nature. Its heavy computational cost, however, prevents its application to real-time processing. In contrast, a supervised beamforming method that uses a deep neural network (DNN) for estimating spatial information of speech and noise readily fits real-time processing, but suffers from drastic performance degradation in mismatched conditions. Given such complementary characteristics, we propose a dual-process robust online speech enhancement method based on DNN-based beamforming with FastMNMF-guided adaptation. FastMNMF (back end) is performed in a mini-batch style and the noisy and enhanced speech pairs are used together with the original parallel training data for updating the direction-aware DNN (front end) with backpropagation at a computationally-allowable interval. This method is used with a blind dereverberation method called weighted prediction error (WPE) for transcribing the noisy reverberant speech of a speaker, which can be detected from video or selected by a user’s hand gesture or eye gaze, in a streaming manner and spatially showing the transcriptions with an AR technique. Our experiment showed that the word error rate was improved by more than 10 points with the runtime adaptation using only twelve minutes observation.

Journal Articles

The Jazz Ontology: A semantic model and large-scale RDF repositories for jazz
Polina Proutskova, Daniel Wolff, György Fazekas, Klaus Frieler, Frank Höger, Olga Velichkina, Gabriel Solis, Tillman Weyde, Martin Pfleiderer, Hèlène Camille Crayencour, Geoffroy Peeters, Simon Dixon
Journal of Web Semantics, October 2022.

@article{proutskova:hal-03860468,
  author = {Proutskova, Polina and Wolff, Daniel and Fazekas, Gy{\"o}rgy and Frieler, Klaus and H{\"o}ger, Frank and Velichkina, Olga and Solis, Gabriel and Weyde, Tillman and Pfleiderer, Martin and Crayencour, H{\`e}l{\`e}ne Camille and Peeters, Geoffroy and Dixon, Simon},
  doi = {10.1016/j.websem.2022.100735},
  hal_id = {hal-03860468},
  hal_version = {v1},
  journal = {{Journal of Web Semantics}},
  month = oct,
  pages = {100735},
  publisher = {{Elsevier}},
  title = {{The Jazz Ontology: A semantic model and large-scale RDF repositories for jazz}},
  url = {https://telecom-paris.hal.science/hal-03860468},
  volume = {74},
  year = {2022}
}

Pretext Tasks selection for multitask self-supervised speech representation learning
Salah Zaiem, Titouan Parcollet, Slim Essid, Abdelwahab Heba
IEEE Journal of Selected Topics in Signal Processing, October 2022.
```
@article{zaiem:hal-03601330,
  author = {Zaiem, Salah and Parcollet, Titouan and Essid, Slim and Heba, Abdelwahab},
  doi = {10.1109/JSTSP.2022.3195430},
  hal_id = {hal-03601330},
  hal_version = {v1},
  journal = {{IEEE Journal of Selected Topics in Signal Processing}},
  month = oct,
  number = {6},
  pages = {1439-1453},
  pdf = {https://hal.science/hal-03601330v1/file/2107.00594.pdf},
  publisher = {{IEEE}},
  title = {{Pretext Tasks selection for multitask self-supervised speech representation learning}},
  url = {https://hal.science/hal-03601330},
  volume = {16},
  year = {2022}
}
```
Through solving pretext tasks, self-supervised learning leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. In audio/speech signal processing, a wide range of features where engineered through decades of research efforts. As it turns out, learning to predict such features (a.k.a pseudo-labels) has proven to be a particularly relevant pretext task, leading to useful self-supervised representations which prove to be effective for downstream tasks. However, methods and common practices for combining such pretext tasks for better performance on the downstream task have not been explored and understood properly. In fact, the process relies almost exclusively on a computationally heavy experimental procedure, which becomes intractable with the increase of the number of pretext tasks. This paper introduces a method to select a group of pretext tasks among a set of candidates. The method we propose estimates calibrated weights for the partial losses corresponding to the considered pretext tasks during the self-supervised training process. The experiments conducted on automatic speech recognition, speaker and emotion recognition validate our approach, as the groups selected and weighted with our method perform better than classic baselines, thus facilitating the selection and combination of relevant pseudo-labels for self-supervised representation learning.
The Jazz Ontology: A semantic model and large-scale RDF repositories for jazz
Polina Proutskova, Daniel Wolff, György Fazekas, Klaus Frieler, Frank Höger, Olga Velichkina, Gabriel Solis, Tillman Weyde, Martin Pfleiderer, Hèlène Camille Crayencour, Geoffroy Peeters, Simon Dixon
Journal of Web Semantics, June 2022.
```
@article{proutskova:hal-03864122,
  author = {Proutskova, Polina and Wolff, Daniel and Fazekas, Gy{\"o}rgy and Frieler, Klaus and H{\"o}ger, Frank and Velichkina, Olga and Solis, Gabriel and Weyde, Tillman and Pfleiderer, Martin and Crayencour, H{\`e}l{\`e}ne Camille and Peeters, Geoffroy and Dixon, Simon},
  doi = {10.1016/j.websem.2022.100735},
  hal_id = {hal-03864122},
  hal_version = {v1},
  journal = {{Journal of Web Semantics}},
  month = jun,
  pdf = {https://hal.science/hal-03864122v1/file/The%20Jazz%20Ontology-%20A%20semantic%20model%20and%20large-scale%20RDF%20repositories%20for%20jazz.pdf},
  publisher = {{Elsevier}},
  title = {{The Jazz Ontology: A semantic model and large-scale RDF repositories for jazz}},
  url = {https://hal.science/hal-03864122},
  volume = {74},
  year = {2022}
}
```
Jazz is a musical tradition that is just over 100 years old; unlike in other Western musical traditions, improvisation plays a central role in jazz. Modelling the domain of jazz poses some ontological challenges due to specificities in musical content and performance practice, such as band lineup fluidity and importance of short melodic patterns for improvisation. This paper presents the Jazz Ontology-a semantic model that addresses these challenges. Additionally, the model also describes workflows for annotating recordings with melody transcriptions and for pattern search. The Jazz Ontology incorporates existing standards and ontologies such as FRBR and the Music Ontology. The ontology has been assessed by examining how well it supports describing and merging existing datasets and whether it facilitates novel discoveries in a music browsing application. The utility of the ontology is also demonstrated in a novel framework for managing jazz related music information. This involves the population of the Jazz Ontology with the metadata from large scale audio and bibliographic corpora (the Jazz Encyclopedia and the Jazz Discography). The resulting RDF datasets were merged and linked to existing Linked Open Data resources. These datasets are publicly available and are driving an online application that is being used by jazz researchers and music lovers for the systematic study of jazz.
Lyrics segmentation via bimodal text–audio representation
Michael Fell, Yaroslav Nechaev, Gabriel Meseguer-Brocal, Elena Cabrio, Fabien Gandon, Geoffroy Peeters
Natural Language Engineering, 2022.
```
@article{fell:hal-03295581,
  author = {Fell, Michael and Nechaev, Yaroslav and Meseguer-Brocal, Gabriel and Cabrio, Elena and Gandon, Fabien and Peeters, Geoffroy},
  doi = {10.1017/S1351324921000024},
  hal_id = {hal-03295581},
  hal_version = {v1},
  journal = {{Natural Language Engineering}},
  keywords = {Natural Language in Multimodal and Multimedia Systems ; Text Segmentation ; Artificial Intelligence ; Natural Language ; Processing Music ; Information Retrieval},
  number = {3},
  pages = {317 - 336},
  pdf = {https://hal.science/hal-03295581v1/file/Bi_modal_Lyrics_Segmentation__NLE_journal__minor_revision_.pdf},
  publisher = {{Cambridge University Press (CUP)}},
  title = {{Lyrics segmentation via bimodal text--audio representation}},
  url = {https://hal.science/hal-03295581},
  volume = {28},
  year = {2022}
}
```
Song lyrics contain repeated patterns that have been proven to facilitate automated lyrics segmentation, with the final goal of detecting the building blocks (e.g., chorus, verse) of a song text. Our contribution in this article is twofold. First, we introduce a convolutional neural network (CNN)-based model that learns to segment the lyrics based on their repetitive text structure. We experiment with novel features to reveal different kinds of repetitions in the lyrics, for instance based on phonetical and syntactical properties. Second, using a novel corpus where the song text is synchronized to the audio of the song, we show that the text and audio modalities capture complementary structure of the lyrics and that combining both is beneficial for lyrics segmentation performance. For the purely text-based lyrics segmentation on a dataset of 103k lyrics, we achieve an F-score of 67.4%, improving on the state of the art (59.2% F-score). On the synchronized text–audio dataset of 4.8k songs, we show that the additional audio features improve segmentation performance to 75.3% F-score, significantly outperforming the purely text-based approaches.
Generalized Fast Multichannel Nonnegative Matrix Factorization Based on Gaussian Scale Mixtures for Blind Source Separation
Mathieu Fontaine, Kouhei Sekiguchi, Aditya Nugraha, Yoshiaki Bando, Kazuyoshi Yoshii
IEEE/ACM Transactions on Audio, Speech and Language Processing, 2022.
```
@article{fontaine:hal-03657196,
  author = {Fontaine, Mathieu and Sekiguchi, Kouhei and Nugraha, Aditya and Bando, Yoshiaki and Yoshii, Kazuyoshi},
  doi = {10.1109/TASLP.2022.3172631},
  hal_id = {hal-03657196},
  hal_version = {v1},
  journal = {{IEEE/ACM Transactions on Audio, Speech and Language Processing}},
  keywords = {expectation-maximization ; probabilistic framework ; blind source separation ; Nonnegative matrix factorization},
  pages = {1-1},
  pdf = {https://telecom-paris.hal.science/hal-03657196v1/file/main.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Generalized Fast Multichannel Nonnegative Matrix Factorization Based on Gaussian Scale Mixtures for Blind Source Separation}},
  url = {https://telecom-paris.hal.science/hal-03657196},
  year = {2022}
}
```
This paper describes heavy-tailed extensions of a state-of-the-art versatile blind source separation method called fast multichannel nonnegative matrix factorization (FastMNMF) from a unified point of view. The common way of deriving such an extension is to replace the multivariate complex Gaussian distribution in the likelihood function with its heavy-tailed generalization, e.g., the multivariate complex Student’s t and leptokurtic generalized Gaussian distributions, and tailor-make the corresponding parameter optimization algorithm. Using a wider class of heavy-tailed distributions called a Gaussian scale mixture (GSM), i.e., a mixture of Gaussian distributions whose variances are perturbed by positive random scalars called impulse variables, we propose GSM-FastMNMF and develop an expectationmaximization algorithm that works even when the probability density function of the impulse variables have no analytical expressions. We show that existing heavy-tailed FastMNMF extensions are instances of GSM-FastMNMF and derive a new instance based on the generalized hyperbolic distribution that include the normal-inverse Gaussian, Student’s t, and Gaussian distributions as the special cases. Our experiments show that the normalinverse Gaussian FastMNMF outperforms the state-of-the-art FastMNMF extensions and ILRMA model in speech enhancement and separation in terms of the signal-to-distortion ratio.

Video-to-Music Recommendation using Temporal Alignment of Segments
Laure Prétet, Gael Richard, Clément Souchier, Geoffroy Peeters
IEEE Transactions on Multimedia, 2022.

@article{pretet:hal-03562371,
  author = {Pr{\'e}tet, Laure and Richard, Gael and Souchier, Cl{\'e}ment and Peeters, Geoffroy},
  hal_id = {hal-03562371},
  hal_version = {v1},
  journal = {{IEEE Transactions on Multimedia}},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Video-to-Music Recommendation using Temporal Alignment of Segments}},
  url = {https://telecom-paris.hal.science/hal-03562371},
  year = {2022}
}

Autoregressive Moving Average Jointly-Diagonalizable Spatial Covariance Analysis for Joint Source Separation and Dereverberation
Kouhei Sekiguchi, Yoshiaki Bando, Aditya Arie Nugraha, Mathieu Fontaine, Kazuyoshi Yoshii, Tatsuya Kawahara
IEEE/ACM Transactions on Audio, Speech and Language Processing, 2022.
```
@article{sekiguchi:hal-03821125,
  author = {Sekiguchi, Kouhei and Bando, Yoshiaki and Nugraha, Aditya Arie and Fontaine, Mathieu and Yoshii, Kazuyoshi and Kawahara, Tatsuya},
  doi = {10.1109/taslp.2022.3190734},
  hal_id = {hal-03821125},
  hal_version = {v1},
  journal = {{IEEE/ACM Transactions on Audio, Speech and Language Processing}},
  keywords = {Multichannel audio signal processing ; source separation ; dereverberation ; joint diagonalization},
  pages = {2368 - 2382},
  pdf = {https://telecom-paris.hal.science/hal-03821125v1/file/09829286.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Autoregressive Moving Average Jointly-Diagonalizable Spatial Covariance Analysis for Joint Source Separation and Dereverberation}},
  url = {https://telecom-paris.hal.science/hal-03821125},
  volume = {30},
  year = {2022}
}
```
This article describes a computationally-efficient statistical approach to joint (semi-)blind source separation and dereverberation for multichannel noisy reverberant mixture signals. A standard approach to source separation is to formulate a generative model of a multichannel mixture spectrogram that consists of source and spatial models representing the time-frequency power spectral densities (PSDs) and spatial covariance matrices (SCMs) of source images, respectively, and find the maximum-likelihood estimates of these parameters. A state-of-the-art blind source separation method in this thread of research is fast multichannel nonnegative matrix factorization (FastMNMF) based on the lowrank PSDs and jointly-diagonalizable full-rank SCMs. To perform mutually-dependent separation and dereverberation jointly, in this paper we integrate both moving average (MA) and autoregressive (AR) models that represent the early reflections and late reverberations of sources, respectively, into the FastMNMF formalism. Using a pretrained deep generative model of speech PSDs as a source model, we realize semi-blind joint speech separation and dereverberation. We derive an iterative optimization algorithm based on iterative projection or iterative source steering for jointly and efficiently updating the AR parameters and the SCMs. Our experimental results showed the superiority of the proposed ARMA extension over its AR-or MA-ablated version in a speech separation and/or dereverberation task.

Comparing Deep Models and Evaluation Strategies for Multi-Pitch Estimation in Music Recordings
Christof Weis, Geoffroy Peeters
IEEE/ACM Transactions on Audio, Speech and Language Processing, 2022.

@article{weis:hal-03860460,
  author = {Weis, Christof and Peeters, Geoffroy},
  doi = {10.1109/TASLP.2022.3200547},
  hal_id = {hal-03860460},
  hal_version = {v1},
  journal = {{IEEE/ACM Transactions on Audio, Speech and Language Processing}},
  pages = {2814-2827},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Comparing Deep Models and Evaluation Strategies for Multi-Pitch Estimation in Music Recordings}},
  url = {https://telecom-paris.hal.science/hal-03860460},
  volume = {30},
  year = {2022}
}

2021

Conference Articles

Heavy Tails in SGD and Compressibility of Overparametrized Neural Networks
Melih Barsbey, Milad Sefidgaran, Murat A Erdogdu, Gael Richard, Umut Şimşekli
35th Conference on Neural Information Processing Systems (NeurIPS), Online, United States, December 2021.
```
@inproceedings{barsbey:hal-03413484,
  address = {Online, United States},
  author = {Barsbey, Melih and Sefidgaran, Milad and Erdogdu, Murat A and Richard, Gael and {\c S}im{\c s}ekli, Umut},
  booktitle = {{35th Conference on Neural Information Processing Systems (NeurIPS)}},
  hal_id = {hal-03413484},
  hal_version = {v1},
  month = dec,
  pdf = {https://telecom-paris.hal.science/hal-03413484v1/file/HT_and_Compressibility.pdf},
  title = {{Heavy Tails in SGD and Compressibility of Overparametrized Neural Networks}},
  url = {https://telecom-paris.hal.science/hal-03413484},
  year = {2021}
}
```
Neural network compression techniques have become increasingly popular as they can drastically reduce the storage and computation requirements for very large networks. Recent empirical studies have illustrated that even simple pruning strategies can be surprisingly effective, and several theoretical studies have shown that compressible networks (in specific senses) should achieve a low generalization error. Yet, a theoretical characterization of the underlying causes that make the networks amenable to such simple compression schemes is still missing. In this study, focusing our attention on stochastic gradient descent (SGD), our main contribution is to link compressibility to two recently established properties of SGD: (i) as the network size goes to infinity, the system can converge to a mean-field limit, where the network weights behave independently [DBDFŞ20], (ii) for a large stepsize/batch-size ratio, the SGD iterates can converge to a heavy-tailed stationary distribution [HM20, GŞZ21]. Assuming that both of these phenomena occur simultaneously, we prove that the networks are guaranteed to be ’ p-compressible’, and the compression errors of different pruning techniques (magnitude, singular value, or node pruning) become arbitrarily small as the network size increases. We further prove generalization bounds adapted to our theoretical framework, which are consistent with the observation that the generalization error will be lower for more compressible networks. Our theory and numerical study on various neural networks show that large step-size/batch-size ratios introduce heavy tails, which, in combination with overparametrization, result in compressibility. * Equal contribution. 35th Conference on Neural Information Processing Systems (NeurIPS 2021).
Fast Approximation of the Sliced-Wasserstein Distance Using Concentration of Random Projections
Kimia Nadjahi, Alain Durmus, Pierre E. Jacob, Roland Badeau, Umut Şimşekli
35th Conference on Neural Information Processing Systems (NeurIPS 2021), En ligne, France, December 2021.
```
@inproceedings{nadjahi:hal-03494781,
  address = {En ligne, France},
  author = {Nadjahi, Kimia and Durmus, Alain and Jacob, Pierre E. and Badeau, Roland and {\c S}im{\c s}ekli, Umut},
  booktitle = {{35th Conference on Neural Information Processing Systems (NeurIPS 2021)}},
  doi = {10.5555/3540261.3541211},
  hal_id = {hal-03494781},
  hal_version = {v1},
  month = dec,
  pdf = {https://telecom-paris.hal.science/hal-03494781v1/file/fast_approximation_of_the_slic.pdf},
  title = {{Fast Approximation of the Sliced-Wasserstein Distance Using Concentration of Random Projections}},
  url = {https://telecom-paris.hal.science/hal-03494781},
  year = {2021}
}
```
The Sliced-Wasserstein distance (SW) is being increasingly used in machine learning applications as an alternative to the Wasserstein distance and offers significant computational and statistical benefits. Since it is defined as an expectation over random projections, SW is commonly approximated by Monte Carlo. We adopt a new perspective to approximate SW by making use of the concentration of measure phenomenon: under mild assumptions, one-dimensional projections of a highdimensional random vector are approximately Gaussian. Based on this observation, we develop a simple deterministic approximation for SW. Our method does not require sampling a number of random projections, and is therefore both accurate and easy to use compared to the usual Monte Carlo approximation. We derive nonasymptotical guarantees for our approach, and show that the approximation error goes to zero as the dimension increases, under a weak dependence condition on the data distribution. We validate our theoretical findings on synthetic datasets, and illustrate the proposed approximation on a generative modeling problem.
DARKGAN: EXPLOITING KNOWLEDGE DISTILLATION FOR COMPREHENSIBLE AUDIO SYNTHESIS WITH GANS
Javier Nistal Hurlé, Stefan Lattner, Gael Richard
International Society for Music Information Retrieval, Virtual, France, November 2021.
```
@inproceedings{nistalhurle:hal-03349492,
  address = {Virtual, France},
  author = {Nistal Hurl{\'e}, Javier and Lattner, Stefan and Richard, Gael},
  booktitle = {{International Society for Music Information Retrieval}},
  hal_id = {hal-03349492},
  hal_version = {v1},
  month = nov,
  pdf = {https://hal.science/hal-03349492v1/file/2108.01216%20%281%29.pdf},
  title = {{DARKGAN: EXPLOITING KNOWLEDGE DISTILLATION FOR COMPREHENSIBLE AUDIO SYNTHESIS WITH GANS}},
  url = {https://hal.science/hal-03349492},
  year = {2021}
}
```
Generative Adversarial Networks (GANs) have achieved excellent audio synthesis quality in the last years. However, making them operable with semantically meaningful controls remains an open challenge. An obvious approach is to control the GAN by conditioning it on metadata contained in audio datasets. Unfortunately, audio datasets often lack the desired annotations, especially in the musical domain. A way to circumvent this lack of annotations is to generate them, for example, with an automatic audio tagging system. The output probabilities of such systems (so-called "soft labels") carry rich information about the characteristics of the respective audios and can be used to distill the knowledge from a teacher model into a student model. In this work, we perform knowledge distillation from a large audio tagging system into an adversarial audio synthesizer that we call DarkGAN. Results show that DarkGAN can synthesize musical audio with acceptable quality and exhibits moderate attribute control even with out-of-distribution input conditioning. We release the code and provide audio examples on the accompanying website.
Is There a ”Language of Music-Video Clips” ? A Qualitative and Quantitative Study
Laure Prétet, Gaël Richard, Geoffroy Peeters
ISMIR, Virtual Event, France, November 2021.
```
@inproceedings{pretet:hal-03330800,
  address = {Virtual Event, France},
  author = {Pr{\'e}tet, Laure and Richard, Ga{\"e}l and Peeters, Geoffroy},
  booktitle = {{ISMIR}},
  hal_id = {hal-03330800},
  hal_version = {v1},
  month = nov,
  pdf = {https://telecom-paris.hal.science/hal-03330800v1/file/camera_ready.pdf},
  title = {{Is There a ''Language of Music-Video Clips'' ? A Qualitative and Quantitative Study}},
  url = {https://telecom-paris.hal.science/hal-03330800},
  year = {2021}
}
```
Recommending automatically a video given a music or a music given a video has become an important asset for the audiovisual industry-with user-generated or professional content. While both music and video have specific temporal organizations, most current works do not consider those and only focus on globally recommending a media. As a first step toward the improvement of these recommendation systems, we study in this paper the relationship between music and video temporal organization. We do this for the case of official music videos, with a quantitative and a qualitative approach. Our assumption is that the movement in the music are correlated to the ones in the video. To validate this, we first interview a set of internationally recognized music video experts. We then perform a largescale analysis of official music-video clips (which we manually annotated into video genres) using MIR description tools (downbeats and functional segments estimation) and Computer Vision tools (shot detection). Our study confirms that a "language of music-video clips" exists; i.e. editors favor the co-occurrence of music and video events using strategies such as anticipation. It also highlights that the amount of co-occurrence depends on the music and video genres.
THE WORDS REMAIN THE SAME: COVER DETECTION WITH LYRICS TRANSCRIPTION
Andrea Vaglio, Romain Hennequin, Manuel Moussallam, Gael Richard
22nd International Society for Music Information Retrieval Conference ISMIR 2021, Online, India, November 2021.
```
@inproceedings{vaglio:hal-03356164,
  address = {Online, India},
  author = {Vaglio, Andrea and Hennequin, Romain and Moussallam, Manuel and Richard, Gael},
  booktitle = {{22nd International Society for Music Information Retrieval Conference ISMIR 2021}},
  hal_id = {hal-03356164},
  hal_version = {v1},
  month = nov,
  pdf = {https://telecom-paris.hal.science/hal-03356164v1/file/PAPER_ISMIR2021_COVER_DETECTION.pdf},
  title = {{THE WORDS REMAIN THE SAME: COVER DETECTION WITH LYRICS TRANSCRIPTION}},
  url = {https://telecom-paris.hal.science/hal-03356164},
  year = {2021}
}
```
Cover detection has gained sustained interest in the scientific community and has recently made significant progress both in terms of scalability and accuracy. However, most approaches are based on the estimation of harmonic and melodic features and neglect lyrics information although it is an important invariant across covers. In this work, we propose a novel approach leveraging lyrics without requiring access to full texts though the use of lyrics recognition on audio. Our approach relies on the fusion of a singing voice recognition framework and a more classic tonal-based cover detection method. To the best of our knowledge, this is the first time that lyrics estimation from audio has been explicitly used for cover detection. Furthermore, we exploit efficient string matching and an approximated nearest neighbors search algorithm which lead to a scalable system which is able to operate on very large databases. Extensive experiments on the largest publicly available cover detection dataset demonstrate the validity of using lyrics information for this task.
Training Deep Pitch-Class Representations With a Multi-Label CTC Loss
Christof Weiss, Geoffroy Peeters
Proceedings of the 22nd International Society for Music Information Retrieval Conference, Virtual Event, France, November 2021.
```
@inproceedings{weiss:hal-03349734,
  address = {Virtual Event, France},
  author = {Weiss, Christof and Peeters, Geoffroy},
  booktitle = {{Proceedings of the 22nd International Society for Music Information Retrieval Conference}},
  hal_id = {hal-03349734},
  hal_version = {v1},
  keywords = {Music transcription ; Harmony ; Chords and tonality ; CTC},
  month = nov,
  pdf = {https://hal.science/hal-03349734v1/file/WeissP21_PitchClassMCTC_ISMIR.pdf},
  title = {{Training Deep Pitch-Class Representations With a Multi-Label CTC Loss}},
  url = {https://hal.science/hal-03349734},
  year = {2021}
}
```
Despite the success of end-to-end approaches, chroma (or pitch-class) features remain a useful mid-level representation of music audio recordings due to their direct interpretability. Since traditional chroma variants obtained with signal processing suffer from timbral artifacts such as overtones or vibrato, they do not directly reflect the pitch classes notated in the score. For this reason, training a chroma representation using deep learning ("deep chroma") has become an interesting strategy. Existing approaches involve the use of supervised learning with strongly aligned labels for which, however, only few datasets are available. Recently, the Connectionist Temporal Classification (CTC) loss, initially proposed for speech, has been adopted to learn monophonic (single-label) pitch-class features using weakly aligned labels based on corresponding score–audio segment pairs. To exploit this strategy for the polyphonic case, we propose the use of a multi-label variant of this CTC loss, the MCTC, and formalize this loss for the pitch-class scenario. Our experiments demonstrate that the weakly aligned approach achieves almost equivalent pitch-class estimates than training with strongly aligned annotations. We then study the sensitivity of our approach to segment duration and mismatch. Finally, we compare the learned features with other pitch-class representations and demonstrate their use for chord and local key recognition on classical music datasets.

On the topic of frequency dependent exponential decay matrices and Lie groups
Achille Aknin, Roland Badeau
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, United States, October 2021.

@inproceedings{aknin:hal-03298695,
  address = {New Paltz, NY, United States},
  author = {Aknin, Achille and Badeau, Roland},
  booktitle = {{IEEE Workshop on Applications of Signal Processing to Audio and Acoustics}},
  hal_id = {hal-03298695},
  hal_version = {v1},
  keywords = {Reverberation ; room impulse response ; probabilistic modeling ; expectation-maximization algorithm ; artificial reverberation},
  month = oct,
  pdf = {https://hal.science/hal-03298695v1/file/Waspaa_2021.pdf},
  title = {{On the topic of frequency dependent exponential decay matrices and Lie groups}},
  url = {https://hal.science/hal-03298695},
  year = {2021}
}

User-guided one-shot deep model adaptation for music source separation
Giorgia Cantisani, Alexey Ozerov, Slim Essid, Gael Richard
2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, United States, October 2021.
```
@inproceedings{cantisani:hal-03219350,
  address = {New Paltz, NY, United States},
  author = {Cantisani, Giorgia and Ozerov, Alexey and Essid, Slim and Richard, Gael},
  booktitle = {{2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}},
  hal_id = {hal-03219350},
  hal_version = {v3},
  keywords = {Music Source Separation ; User-guided Source Separation ; One-shot Domain Adaptation},
  month = oct,
  organization = {{IEEE}},
  pdf = {https://telecom-paris.hal.science/hal-03219350v3/file/UGOSA_Hal.pdf},
  title = {{User-guided one-shot deep model adaptation for music source separation}},
  url = {https://telecom-paris.hal.science/hal-03219350},
  year = {2021}
}
```
Music source separation is the task of isolating individual instruments which are mixed in a musical piece. This task is particularly challenging, and even state-of-the-art models can hardly generalize to unseen test data. Nevertheless, prior knowledge about individual sources can be used to better adapt a generic source separation model to the observed signal. In this work, we propose to exploit a temporal segmentation provided by the user, that indicates when each instrument is active, in order to fine-tune a pre-trained deep model for source separation and adapt it to one specific mixture. This paradigm can be referred to as user-guided one-shot deep model adaptation for music source separation, as the adaptation acts on the target song instance only. Our results are promising and show that state-of-the-art source separation models have large margins of improvement especially for those instruments which are underrepresented in the training data.
VQCPC-GAN: VARIABLE-LENGTH ADVERSARIAL AUDIO SYNTHESIS USING VECTOR-QUANTIZED CONTRASTIVE PREDICTIVE CODING
Javier Nistal Hurlé, Cyran Aouameur, Stefan Lattner, Gael Richard
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, United States, October 2021.
```
@inproceedings{nistalhurle:hal-03413460,
  address = {New Paltz, United States},
  author = {Nistal Hurl{\'e}, Javier and Aouameur, Cyran and Lattner, Stefan and Richard, Gael},
  booktitle = {{IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}},
  hal_id = {hal-03413460},
  hal_version = {v1},
  month = oct,
  pdf = {https://telecom-paris.hal.science/hal-03413460v1/file/2021-Waspaa-nistal.pdf},
  title = {{VQCPC-GAN: VARIABLE-LENGTH ADVERSARIAL AUDIO SYNTHESIS USING VECTOR-QUANTIZED CONTRASTIVE PREDICTIVE CODING}},
  url = {https://telecom-paris.hal.science/hal-03413460},
  year = {2021}
}
```
Influenced by the field of Computer Vision, Generative Adversarial Networks (GANs) are often adopted for the audio domain using fixed-size two-dimensional spectrogram representations as the "image data". However, in the (musical) audio domain, it is often desired to generate output of variable duration. This paper presents VQCPC-GAN, an adversarial framework for synthesizing variablelength audio by exploiting Vector-Quantized Contrastive Predictive Coding (VQCPC). A sequence of VQCPC tokens extracted from real audio data serves as conditional input to a GAN architecture, providing step-wise time-dependent features of the generated content. The input noise z (characteristic in adversarial architectures) remains fixed over time, ensuring temporal consistency of global features. We evaluate the proposed model by comparing a diverse set of metrics against various strong baselines. Results show that, even though the baselines score best, VQCPC-GAN achieves comparable performance even when generating variable-length audio. Numerous sound examples are provided in the accompanying website, 1 and we release the code for reproducibility. 2 Index Terms-Generative Adversarial Networks, Audio Synthesis, Vector-Quantized Contrastive Predictive Coding * Nistal received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 765068. 1 sonycslparis.github.io/vqcpc-gan.io 2 github.com/SonyCSLParis/vqcpc-gan
Learning Multi-Pitch Estimation From Weakly Aligned Score-Audio Pairs Using a Multi-Label CTC Loss
Christof Weiss, Geoffroy Peeters
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Mohonk Mountain House, New Paltz, NY, United States, October 2021.
```
@inproceedings{weiss:hal-03349673,
  address = {Mohonk Mountain House, New Paltz, NY, United States},
  author = {Weiss, Christof and Peeters, Geoffroy},
  booktitle = {{IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}},
  hal_id = {hal-03349673},
  hal_version = {v1},
  keywords = {Music processing ; convolutional neural networks ; CTC ; multi-pitch estimation ; music transcription},
  month = oct,
  title = {{Learning Multi-Pitch Estimation From Weakly Aligned Score-Audio Pairs Using a Multi-Label CTC Loss}},
  url = {https://telecom-paris.hal.science/hal-03349673},
  year = {2021}
}
```
Detecting the simultaneous activity of pitches in music audio recordings is a central task within music processing, commonly known as multi-pitch estimation or frame-wise polyphonic music transcription. Deep-learning approaches recently achieved major improvements for this task, but the lack of annotated, large-size datasets beyond the piano solo scenario is still a limitation for fully exploiting their potential. In this paper, we propose a strategy for training a CNN-based multi-pitch estimator on weakly aligned score–audio pairs of pieces in different instrumentations. To this end, we make use of a multi-label variant of the connectionist temporal classification loss (MCTC), recently proposed for image recognition tasks. We re-formalize the MCTC loss to be applicable for multi-pitch estimation and perform several systematic experiments to analyze its behavior and robustness to training conditions. Finally, we report on multi-pitch estimation results for common datasets using weakly aligned training with MCTC, which performs similar than systems trained on strongly aligned scores.
Damped Chirp Mixture Estimation via Nonlinear Bayesian Regression
Julian Neri, Philippe Depalle, Roland Badeau
23rd International Conference on Digital Audio Effects (DAFx2020), Vienne, Austria, September 2021.
```
@inproceedings{neri:hal-03255349,
  address = {Vienne, Austria},
  author = {Neri, Julian and Depalle, Philippe and Badeau, Roland},
  booktitle = {{23rd International Conference on Digital Audio Effects (DAFx2020)}},
  hal_id = {hal-03255349},
  hal_version = {v1},
  month = sep,
  pdf = {https://telecom-paris.hal.science/hal-03255349v1/file/2021_DAFx_Damped_Chirp_Mixture_Neri_Julian_S3.pdf},
  title = {{Damped Chirp Mixture Estimation via Nonlinear Bayesian Regression}},
  url = {https://telecom-paris.hal.science/hal-03255349},
  year = {2021}
}
```
Estimating mixtures of damped chirp sinusoids in noise is a problem that affects audio analysis, coding, and synthesis applications. Phase-based non-stationary parameter estimators assume that sinusoids can be resolved in the Fourier transform domain, whereas high-resolution methods estimate superimposed components with accuracy close to the theoretical limits, but only for sinusoids with constant frequencies. We present a new method for estimating the parameters of superimposed damped chirps that has an accuracy competitive with existing non-stationary estimators but also has a high-resolution like subspace techniques. After providing the analytical expression for a Gaussian-windowed damped chirp signal’s Fourier transform, we propose an efficient variational EM algorithm for nonlinear Bayesian regression that jointly estimates the amplitudes, phases, frequencies, chirp rates, and decay rates of multiple non-stationary components that may be obfuscated under the same local maximum in the frequency spectrum. Quantitative results show that the new method not only has an estimation accuracy that is close to the Cramér-Rao bound, but also a high resolution that outperforms the state-of-the-art.
Attention-based distributed speech enhancement for unconstrained microphone arrays with varying number of nodes
Nicolas Furnon, Romain Serizel, Slim Essid, Irina Illina
European Signal Processing Conference (EUSIPCO), Dublin / Virtual, Ireland, August 2021.
```
@inproceedings{furnon:hal-03259801,
  address = {Dublin / Virtual, Ireland},
  author = {Furnon, Nicolas and Serizel, Romain and Essid, Slim and Illina, Irina},
  booktitle = {{European Signal Processing Conference (EUSIPCO)}},
  doi = {10.23919/EUSIPCO54536.2021.9616358},
  hal_id = {hal-03259801},
  hal_version = {v1},
  keywords = {Speech enhancement ; distributed processing ; attention mechanisms ; ad-hoc microphone arrays},
  month = aug,
  organization = {{IEEE}},
  pdf = {https://hal.science/hal-03259801v1/file/eusipco2021.pdf},
  title = {{Attention-based distributed speech enhancement for unconstrained microphone arrays with varying number of nodes}},
  url = {https://hal.science/hal-03259801},
  year = {2021}
}
```
Speech enhancement promises higher efficiency in ad-hoc microphone arrays than in constrained microphone arrays thanks to the wide spatial coverage of the devices in the acoustic scene. However, speech enhancement in ad-hoc microphone arrays still raises many challenges. In particular, the algorithms should be able to handle a variable number of microphones, as some devices in the array might appear or disappear. In this paper, we propose a solution that can efficiently process the spatial information captured by the different devices of the microphone array, while being robust to a link failure. To do this, we use an attention mechanism in order to put more weight on the relevant signals sent throughout the array and to neglect the redundant or empty channels.
Unsupervised Blind Source Separation with Variational Auto-Encoders
Julian Neri, Roland Badeau, Philippe Depalle
29th European Signal Processing Conference (EUSIPCO 2021), Dublin, Ireland, August 2021.
```
@inproceedings{neri:hal-03255341,
  address = {Dublin, Ireland},
  author = {Neri, Julian and Badeau, Roland and Depalle, Philippe},
  booktitle = {{29th European Signal Processing Conference (EUSIPCO 2021)}},
  hal_id = {hal-03255341},
  hal_version = {v1},
  keywords = {blind source separation ; Bayesian inference ; unmixing ; latent variable model ; universal sound separation},
  month = aug,
  pdf = {https://telecom-paris.hal.science/hal-03255341v1/file/2021_eusipco_vae_camera_ready.pdf},
  title = {{Unsupervised Blind Source Separation with Variational Auto-Encoders}},
  url = {https://telecom-paris.hal.science/hal-03255341},
  year = {2021}
}
```
Supervised source separation requires expensive synthetic datasets containing clean, ground truth-source signals, while unsupervised separation requires only data mixtures. Existing unsupervised methods still use supervision to avoid over-separation and compete with fully supervised methods. We present a new method of completely unsupervised single-channel blind source separation, based on variational auto-encoding, that automatically learns the correct number of sources in data mixtures and quantitatively outperforms the existing methods. A deep inference network disentangles (separates) data mixtures into low-dimensional latent source variables. A deep generative network individually decodes each latent source into its source signal, such that their sum represents the given mixture. Qualitative and quantitative results from separation experiments on pairs of randomly mixed MNIST handwritten digits and mixed audio spectrograms demonstrate that our method outperforms stateof-the-art unsupervised and semi-supervised methods, showing promise as a solution to this long-standing problem in computer vision and audition.
Conditional Independence for Pretext Task Selection in Self-Supervised Speech Representation Learning
Salah Zaiem, Titouan Parcollet, Slim Essid
Interspeech 2021, Brno, Czech Republic, August 2021.
```
@inproceedings{zaiem:hal-03601265,
  address = {Brno, Czech Republic},
  author = {Zaiem, Salah and Parcollet, Titouan and Essid, Slim},
  booktitle = {{Interspeech 2021}},
  doi = {10.21437/interspeech.2021-1027},
  hal_id = {hal-03601265},
  hal_version = {v1},
  month = aug,
  pages = {2851-2855},
  publisher = {{ISCA}},
  title = {{Conditional Independence for Pretext Task Selection in Self-Supervised Speech Representation Learning}},
  url = {https://hal.science/hal-03601265},
  year = {2021}
}
```
Through solving pretext tasks, self-supervised learning (SSL) leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. A common pretext task consists in pretraining a SSL model on pseudo-labels derived from the original signal. This technique is particularly relevant for speech data where various meaningful signal processing features may serve as pseudolabels. However, the process of selecting pseudo-labels, for speech or other types of data, remains mostly unexplored and currently relies on observing the results on the final downstream task. Nevertheless, this methodology is not sustainable at scale due to substantial computational (hence carbon) costs. Thus, this paper introduces a practical and theoretical framework to select relevant pseudo-labels with respect to a given downstream task. More precisely, we propose a functional estimator of the pseudo-label utility grounded in the conditional independence theory, which does not require any training. The experiments conducted on speaker recognition and automatic speech recognition validate our estimator, showing a significant correlation between the performance observed on the downstream task and the utility estimates obtained with our approach, facilitating the prospection of relevant pseudo-labels for selfsupervised speech representation learning.
Relative Positional Encoding for Transformers with Linear Complexity
Antoine Liutkus, Ondřej Cífka, Shih-Lun Wu, Umut Şimşekli, Yi-Hsuan Yang, Gael Richard
Proceedings of the 38th International Conference on Machine Learning, Virtual Only, United States, July 2021.
```
@inproceedings{liutkus:hal-03256451,
  address = {Virtual Only, United States},
  author = {Liutkus, Antoine and C{\'i}fka, Ond{\v r}ej and Wu, Shih-Lun and {\c S}im{\c s}ekli, Umut and Yang, Yi-Hsuan and Richard, Gael},
  booktitle = {{Proceedings of the 38th International Conference on Machine Learning}},
  hal_id = {hal-03256451},
  hal_version = {v1},
  month = jul,
  number = {139},
  pages = {7067-7079},
  pdf = {https://telecom-paris.hal.science/hal-03256451v1/file/spe.pdf},
  publisher = {{PMLR}},
  series = {Proceedings of the 38th International Conference on Machine Learning},
  title = {{Relative Positional Encoding for Transformers with Linear Complexity}},
  url = {https://telecom-paris.hal.science/hal-03256451},
  volume = {Proceedings of Machine Learning Research},
  year = {2021}
}
```
Recent advances in Transformer models allow for unprecedented sequence lengths, due to linear space and time complexity. In the meantime, relative positional encoding (RPE) was proposed as beneficial for classical Transformers and consists in exploiting lags instead of absolute positions for inference. Still, RPE is not available for the recent linear-variants of the Transformer, because it requires the explicit computation of the attention matrix, which is precisely what is avoided by such methods. In this paper, we bridge this gap and present Stochastic Positional Encoding as a way to generate PE that can be used as a replacement to the classical additive (sinusoidal) PE and provably behaves like RPE. The main theoretical contribution is to make a connection between positional encoding and cross-covariance structures of correlated Gaussian processes. We illustrate the performance of our approach on the Long-Range Arena benchmark and on music generation.

Cross-Modal Music-Video Recommendation: A Study of Design Choices
Laure Prétet, Gael Richard, Geoffroy Peeters
Special Session of the International Joint Conference on Neural Networks (IJCNN 2021), Shenzhen, China, July 2021.

@inproceedings{pretet:hal-03208323,
  address = {Shenzhen, China},
  author = {Pr{\'e}tet, Laure and Richard, Gael and Peeters, Geoffroy},
  booktitle = {{Special Session of the International Joint Conference on Neural Networks (IJCNN 2021)}},
  hal_id = {hal-03208323},
  hal_version = {v1},
  month = jul,
  pdf = {https://telecom-paris.hal.science/hal-03208323v1/file/2021075874.pdf},
  title = {{Cross-Modal Music-Video Recommendation: A Study of Design Choices}},
  url = {https://telecom-paris.hal.science/hal-03208323},
  year = {2021}
}

NEURO-STEERED MUSIC SOURCE SEPARATION WITH EEG-BASED AUDITORY ATTENTION DECODING AND CONTRASTIVE-NMF
Giorgia Cantisani, Slim Essid, Gael Richard
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto (virtual conference), Canada, June 2021.
```
@inproceedings{cantisani:hal-02978978,
  address = {Toronto (virtual conference), Canada},
  author = {Cantisani, Giorgia and Essid, Slim and Richard, Gael},
  booktitle = {{ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  doi = {10.1109/ICASSP39728.2021.9413841},
  hal_id = {hal-02978978},
  hal_version = {v4},
  keywords = {Index Terms-Audio source separation ; Auditory attention decoding ; Polyphonic music ; EEG ; Audio source separation},
  month = jun,
  pdf = {https://telecom-paris.hal.science/hal-02978978v4/file/C-NMF-Hal.pdf},
  title = {{NEURO-STEERED MUSIC SOURCE SEPARATION WITH EEG-BASED AUDITORY ATTENTION DECODING AND CONTRASTIVE-NMF}},
  url = {https://telecom-paris.hal.science/hal-02978978},
  year = {2021}
}
```
We propose a novel informed music source separation paradigm, which can be referred to as neuro-steered music source separation. More precisely, the source separation process is guided by the user’s selective auditory attention decoded from his/her EEG response to the stimulus. This high-level prior information is used to select the desired instrument to isolate and to adapt the generic source separation model to the observed signal. To this aim, we leverage the fact that the attended instrument’s neural encoding is substantially stronger than the one of the unattended sources left in the mixture. This "contrast" is extracted using an attention decoder and used to inform a source separation model based on non-negative matrix fac-torization named Contrastive-NMF. The results are promising and show that the EEG information can automatically select the desired source to enhance and improve the separation quality.
Self-Supervised VQ-VAE for One-Shot Music Style Transfer
Ondřej Cífka, Alexey Ozerov, Umut Şimşekli, Gael Richard
ICASSP 2021 - IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto / Virtual, Canada, June 2021.
```
@inproceedings{cifka:hal-03132940,
  address = {Toronto / Virtual, Canada},
  author = {C{\'i}fka, Ond{\v r}ej and Ozerov, Alexey and {\c S}im{\c s}ekli, Umut and Richard, Gael},
  booktitle = {{ICASSP 2021 - IEEE International Conference on Acoustics, Speech and Signal Processing}},
  doi = {10.1109/ICASSP39728.2021.9414235},
  hal_id = {hal-03132940},
  hal_version = {v1},
  month = jun,
  pdf = {https://telecom-paris.hal.science/hal-03132940v1/file/paper.pdf},
  title = {{Self-Supervised VQ-VAE for One-Shot Music Style Transfer}},
  url = {https://telecom-paris.hal.science/hal-03132940},
  year = {2021}
}
```
Neural style transfer, allowing to apply the artistic style of one image to another, has become one of the most widely showcased computer vision applications shortly after its introduction. In contrast, related tasks in the music audio domain remained, until recently, largely untackled. While several style conversion methods tailored to musical signals have been proposed, most lack the ’one-shot’ capability of classical image style transfer algorithms. On the other hand, the results of existing one-shot audio style transfer methods on musical inputs are not as compelling. In this work, we are specifically interested in the problem of one-shot timbre transfer. We present a novel method for this task, based on an extension of the vector-quantized variational autoencoder (VQ-VAE), along with a simple self-supervised learning strategy designed to obtain disentangled representations of timbre and pitch. We evaluate the method using a set of objective metrics and show that it is able to outperform selected baselines.
Distributed speech separation in spatially unconstrained microphone arrays
Nicolas Furnon, Romain Serizel, Irina Illina, Slim Essid
ICASSP 2021 - 46th International Conference on Acoustics, Speech, and Signal Processing, Toronto / Virtual, Canada, June 2021.
```
@inproceedings{furnon:hal-02985794,
  address = {Toronto / Virtual, Canada},
  author = {Furnon, Nicolas and Serizel, Romain and Illina, Irina and Essid, Slim},
  booktitle = {{ICASSP 2021 - 46th International Conference on Acoustics, Speech, and Signal Processing}},
  doi = {10.1109/ICASSP39728.2021.9414758},
  hal_id = {hal-02985794},
  hal_version = {v3},
  keywords = {Speech separation ; Microphone arrays ; Distributed processing},
  month = jun,
  pdf = {https://hal.science/hal-02985794v3/file/icassp2021.pdf},
  title = {{Distributed speech separation in spatially unconstrained microphone arrays}},
  url = {https://hal.science/hal-02985794},
  year = {2021}
}
```
Speech separation with several speakers is a challenging task because of the non-stationarity of the speech and the strong signal similarity between interferent sources. Current state-of-the-art solutions can separate well the different sources using sophisticated deep neural networks which are very tedious to train. When several microphones are available, spatial information can be exploited to design much simpler algorithms to discriminate speakers. We propose a distributed algorithm that can process spatial information in a spatially unconstrained microphone array. The algorithm relies on a convolutional recurrent neural network that can exploit the signal diversity from the distributed nodes. In a typical case of a meeting room, this algorithm can capture an estimate of each source in a first step and propagate it over the microphone array in order to increase the separation performance in a second step. We show that this approach performs even better when the number of sources and nodes increases. We also study the influence of a mismatch in the number of sources between the training and testing conditions.
Comparing Representations for Audio Synthesis Using Generative Adversarial Networks
Javier Nistal Hurlé, Stefan Lattner, Gael Richard
2020 28th European Signal Processing Conference (EUSIPCO), Amsterdam (virtual), France, January 2021.
```
@inproceedings{nistalhurle:hal-03233340,
  address = {Amsterdam (virtual), France},
  author = {Nistal Hurl{\'e}, Javier and Lattner, Stefan and Richard, Gael},
  booktitle = {{2020 28th European Signal Processing Conference (EUSIPCO)}},
  doi = {10.23919/Eusipco47968.2020.9287799},
  hal_id = {hal-03233340},
  hal_version = {v1},
  keywords = {audio ; representations ; synthesis ; generative ; adversarial},
  month = jan,
  pages = {161-165},
  pdf = {https://telecom-paris.hal.science/hal-03233340/file/2021-Eusipco-Nistal.pdf},
  publisher = {{IEEE}},
  title = {{Comparing Representations for Audio Synthesis Using Generative Adversarial Networks}},
  url = {https://telecom-paris.hal.science/hal-03233340},
  year = {2021}
}
```
In this paper, we compare different audio signal representations, including the raw audio waveform and a variety of time-frequency representations, for the task of audio synthesis with Generative Adversarial Networks (GANs). We conduct the experiments on a subset of the NSynth dataset. The architecture follows the benchmark Progressive Growing Wasserstein GAN. We perform experiments both in a fully non-conditional manner as well as conditioning the network on the pitch information. We quantitatively evaluate the generated material utilizing standard metrics for assessing generative models, and compare training and sampling times. We show that complex-valued as well as the magnitude and Instantaneous Frequency of the Short-Time Fourier Transform achieve the best results, and yield fast generation and inversion times. The code for feature extraction, training and evaluating the model is available online.
Comparing Representations for Audio Synthesis Using Generative Adversarial Networks
Gaël Richard, Javier Nistal, Stefan Plattner
2020 28th European Signal Processing Conference (EUSIPCO), Amsterdam (Virtual), Netherlands, January 2021.
```
@inproceedings{richard:hal-03073936,
  address = {Amsterdam (Virtual), Netherlands},
  author = {Richard, Ga{\"e}l and Nistal, Javier and Plattner, Stefan},
  booktitle = {{2020 28th European Signal Processing Conference (EUSIPCO)}},
  doi = {10.23919/Eusipco47968.2020.9287799},
  hal_id = {hal-03073936},
  hal_version = {v1},
  keywords = {Audio ; Representations ; Synthesis ; Generative ; Adversarial},
  month = jan,
  pages = {161-165},
  pdf = {https://hal.science/hal-03073936v1/file/2006.09266.pdf},
  publisher = {{IEEE}},
  title = {{Comparing Representations for Audio Synthesis Using Generative Adversarial Networks}},
  url = {https://hal.science/hal-03073936},
  year = {2021}
}
```
—In this paper, we compare different audio signal representations, including the raw audio waveform and a variety of time-frequency representations, for the task of audio synthesis with Generative Adversarial Networks (GANs). We conduct the experiments on a subset of the NSynth dataset. The architecture follows the benchmark Progressive Growing Wasserstein GAN. We perform experiments both in a fully non-conditional manner as well as conditioning the network on the pitch information. We quantitatively evaluate the generated material utilizing standard metrics for assessing generative models, and compare training and sampling times. We show that complex-valued as well as the magnitude and Instantaneous Frequency of the ShortTime Fourier Transform achieve the best results, and yield fast generation and inversion times. The code for feature extraction, training and evaluating the model is available online.

Theses

Personalized audio auto-tagging as proxy for contextual music recommendation
Karim Magdi Abdelfattah Ibrahim
December 2021.
```
@phdthesis{ibrahim:tel-03633097,
  author = {Ibrahim, Karim Magdi Abdelfattah},
  hal_id = {tel-03633097},
  hal_version = {v1},
  keywords = {Music auto-tagging ; Context-aware ; Music recommendation ; Auto-tagging musical ; Context-aware ; Recommandation musicale},
  month = dec,
  number = {2021IPPAT039},
  pdf = {https://tel.archives-ouvertes.fr/tel-03633097/file/103328_IBRAHIM_2021_archivage.pdf},
  school = {{Institut Polytechnique de Paris}},
  title = {{Personalized audio auto-tagging as proxy for contextual music recommendation}},
  type = {Theses},
  url = {https://tel.archives-ouvertes.fr/tel-03633097},
  year = {2021}
}
```
The exponential growth of online services and user data changed how we interact with various services, and how we explore and select new products. Hence, there is a growing need for methods to recommend the appropriate items for each user. In the case of music, it is more important to recommend the right items at the right moment. It has been well documented that the context, i.e. the listening situation of the users, strongly influences their listening preferences. Hence, there has been an increasing attention towards developing recommendation systems. State-of-the-art approaches are sequence-based models aiming at predicting the tracks in the next session using available contextual information. However, these approaches lack interpretability and serve as a hit-or-miss with no room for user involvement. Additionally, few previous approaches focused on studying how the audio content relates to these situational influences, and even to a less extent making use of the audio content in providing contextual recommendations. Hence, these approaches suffer from both lack of interpretability.In this dissertation, we study the potential of using the audio content primarily to disambiguate the listening situations, providing a pathway for interpretable recommendations based on the situation.First, we study the potential listening situations that influence/change the listening preferences of the users. We developed a semi-automated approach to link between the listened tracks and the listening situation using playlist titles as a proxy. Through this approach, we were able to collect datasets of music tracks labelled with their situational use. We proceeded with studying the use of music auto-taggers to identify potential listening situations using the audio content. These studies led to the conclusion that the situational use of a track is highly user-dependent. Hence, we proceeded with extending the music-autotaggers to a user-aware model to make personalized predictions. Our studies showed that including the user in the loop significantly improves the performance of predicting the situations. This user-aware music auto-tagger enabled us to tag a given track through the audio content with potential situational use, according to a given user by leveraging their listening history.Finally, to successfully employ this approach for a recommendation task, we needed a different method to predict the potential current situations of a given user. To this end, we developed a model to predict the situation given the data transmitted from the user’s device to the service, and the demographic information of the given user. Our evaluations show that the models can successfully learn to discriminate the potential situations and rank them accordingly. By combining the two model; the auto-tagger and situation predictor, we developed a framework to generate situational sessions in real-time and propose them to the user. This framework provides an alternative pathway to recommending situational sessions, aside from the primary sequential recommendation system deployed by the service, which is both interpretable and addressing the cold-start problem in terms of recommending tracks based on their content.

patent

Conversion de la parole par apprentissage statistique avec modélisation complexe des modifications temporelles
Enguerrand Gentet, Sebastien Denjean, Vincent Roussarie, David Bertrand, Gael Richard
France, July 2021.

@patent{gentet:hal-03413450,
  address = {France},
  author = {Gentet, Enguerrand and Denjean, Sebastien and Roussarie, Vincent and Bertrand, David and Richard, Gael},
  hal_id = {hal-03413450},
  hal_version = {v1},
  month = jul,
  number = {FR3106691},
  title = {{Conversion de la parole par apprentissage statistique avec mod{\'e}lisation complexe des modifications temporelles}},
  url = {https://telecom-paris.hal.science/hal-03413450},
  year = {2021}
}

Journal Articles

DNN-based mask estimation for distributed speech enhancement in spatially unconstrained microphone arrays
Nicolas Furnon, Romain Serizel, Slim Essid, Irina Illina
IEEE/ACM Transactions on Audio, Speech and Language Processing, 2021.
```
@article{furnon:hal-02985867,
  author = {Furnon, Nicolas and Serizel, Romain and Essid, Slim and Illina, Irina},
  doi = {10.1109/TASLP.2021.3092838},
  hal_id = {hal-02985867},
  hal_version = {v3},
  journal = {{IEEE/ACM Transactions on Audio, Speech and Language Processing}},
  pages = {2310 - 2323},
  pdf = {https://hal.science/hal-02985867v3/file/furnon.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{DNN-based mask estimation for distributed speech enhancement in spatially unconstrained microphone arrays}},
  url = {https://hal.science/hal-02985867},
  volume = {29},
  year = {2021}
}
```
Deep neural network (DNN)-based speech enhancement algorithms in microphone arrays have now proven to be efficient solutions to speech understanding and speech recognition in noisy environments. However, in the context of ad-hoc microphone arrays, many challenges remain and raise the need for distributed processing. In this paper, we propose to extend a previously introduced distributed DNN-based time-frequency mask estimation scheme that can efficiently use spatial information in form of so-called compressed signals which are pre-filtered target estimations. We study the performance of this algorithm named Tango under realistic acoustic conditions and investigate practical aspects of its optimal application. We show that the nodes in the microphone array cooperate by taking profit of their spatial coverage in the room. We also propose to use the compressed signals not only to convey the target estimation but also the noise estimation in order to exploit the acoustic diversity recorded throughout the microphone array.
Approximate Inference and Learning of State Space Models with Laplace Noise
Julian Neri, Philippe Depalle, Roland Badeau
IEEE Transactions on Signal Processing, 2021.
```
@article{neri:hal-03255319,
  author = {Neri, Julian and Depalle, Philippe and Badeau, Roland},
  doi = {10.1109/tsp.2021.3075146},
  hal_id = {hal-03255319},
  hal_version = {v1},
  journal = {{IEEE Transactions on Signal Processing}},
  keywords = {Bayesian inference ; time series ; heavy-tailed noise ; EM algorithm ; machine learning ; expectation propagation ; Kalman filter ; state estimation ; Laplace distribution},
  pages = {3176 - 3189},
  pdf = {https://telecom-paris.hal.science/hal-03255319v1/file/2020_IEEE_Laplace_Inference_Neri_Julian.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Approximate Inference and Learning of State Space Models with Laplace Noise}},
  url = {https://telecom-paris.hal.science/hal-03255319},
  volume = {69},
  year = {2021}
}
```
State space models have been extensively applied to model and control dynamical systems in disciplines including neuroscience, target tracking, and audio processing. A common modeling assumption is that both the state and data noise are Gaussian because it simplifies the estimation of the system’s state and model parameters. However, in many real-world scenarios where the noise is heavy-tailed or includes outliers, this assumption does not hold, and the performance of the model degrades. In this aper, we present a new approximate inference algorithm for state space models with Laplace-distributed multivariate data that is robust to a wide range of non-Gaussian noise. Exact inference is combined with an expectation propagation algorithm, leading to filtering and smoothing that outperforms existing approximate inference methods for Laplace-distributed data, while retaining a fast speed similar to the Kalman filter. Further, we present a maximum posterior expectation-maximization (EM) algorithm that learns the parameters of the model in an unsupervised way, automatically avoids over-fitting the data, and provides better model estimation than existing methods for the Gaussian model. The quality of the inference and learning algorithms are exemplified through a diverse set of experiments and an application to non-linear tracking of audio frequency.
Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation
Kilian Schulze-Forster, Clement S J Doire, Gael Richard, Roland Badeau
IEEE/ACM Transactions on Audio, Speech and Language Processing, 2021.
```
@article{schulzeforster:hal-03255334,
  author = {Schulze-Forster, Kilian and Doire, Clement S J and Richard, Gael and Badeau, Roland},
  doi = {10.1109/TASLP.2021.3091817},
  hal_id = {hal-03255334},
  hal_version = {v1},
  journal = {{IEEE/ACM Transactions on Audio, Speech and Language Processing}},
  keywords = {monotonic attention mechanism ; Singing voice separation ; lyrics alignment},
  pdf = {https://telecom-paris.hal.science/hal-03255334v1/file/2021_Phoneme_level_lyrics_alignment_and_text-informed_singing_voice_separation.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation}},
  url = {https://telecom-paris.hal.science/hal-03255334},
  year = {2021}
}
```
The goal of singing voice separation is to recover the vocals signal from music mixtures. State-of-the-art performance is achieved by deep neural networks trained in a supervised fashion. Since training data are scarce and music signals are extremely diverse, it remains challenging to achieve high separation quality across various recording and mixing conditions as well as music styles. In this paper, we investigate to which extent the separation can be improved when lyrics transcripts are used as additional information. To this end, we propose a joint approach to phoneme level lyrics alignment and text-informed singing voice separation. It is based on DTW-attention, a new monotonic attention mechanism including a differentiable approximation of dynamic time warping. Experimental results show that the method can align phonemes with mixed singing voice with high precision given accurate transcripts. It also achieves competitive results on challenging word level alignment test sets using less training data than state-of-the-art methods. Sequential alignment and informed separation lead to improved separation quality according to objective measures. Text information helps preserving spectral phoneme properties in the separated voice signals.

2020

Conference Articles

Auralization of a Hybrid Sound Field using a Wave-Stress Tensor Based Model
Aidan Meacham, Roland Badeau, Jean-Dominique Polack
Forum Acusticum, Lyon, France, December 2020.
```
@inproceedings{meacham:hal-03235295,
  address = {Lyon, France},
  author = {Meacham, Aidan and Badeau, Roland and Polack, Jean-Dominique},
  booktitle = {{Forum Acusticum}},
  doi = {10.48465/fa.2020.0833},
  hal_id = {hal-03235295},
  hal_local_reference = {Wave-based room simulations},
  hal_version = {v1},
  keywords = {hybrid ; wave-stress tensor ; wave-based methods ; auralization},
  month = dec,
  pages = {523-529},
  pdf = {https://hal.science/hal-03235295v1/file/000833.pdf},
  title = {{Auralization of a Hybrid Sound Field using a Wave-Stress Tensor Based Model}},
  url = {https://hal.science/hal-03235295},
  year = {2020}
}
```
A hybrid approach to room impulse response synthesis and auralization is developed in the context of a wave-stress tensor based model of late reverberation. This method for efficiently computing spatially varying energy envelopes has been demonstrated to represent the sound field in a sufficiently-diffusing 1-dimensional hallway above 250 Hz. To synthesize a realistic impulse response from the computed decay curves, the direct path, early reflections, and low frequency portion of the sound field must be calculated separately and then combined with the late field to form a hybrid scheme. In this work, we propose one strategy for generating the late field from the aforementioned energy envelopes and suggest the use of a typical pressure-velocity wave-based scheme to generate the other necessary sound field components. Because of the efficiency of the wave-stress tensor based method and the reduced demands on the secondary simulation technique, such a hybridization presents a promising architecture for future real-time auralization in large spaces that may be difficult to model using only a single method.
Extending Deep Rhythm for Tempo and Genre Estimation Using Complex Convolutions, Multitask Learning and Multi-input Network
Hadrien Foroughmand, Geoffroy Peeters
The 2020 Joint Conference on AI Music Creativity, Stockholm, Sweden, October 2020.
```
@inproceedings{foroughmand:hal-03127155,
  address = {Stockholm, Sweden},
  author = {Foroughmand, Hadrien and Peeters, Geoffroy},
  booktitle = {{The 2020 Joint Conference on AI Music Creativity}},
  hal_id = {hal-03127155},
  hal_version = {v1},
  keywords = {Tempo estimation ; genre classification ; deep-learning ; complex network ; multitask ; multi-input},
  month = oct,
  organization = {{Bob Sturm}},
  pdf = {https://hal.science/hal-03127155v1/file/2020_AIMUSIC.pdf},
  title = {{Extending Deep Rhythm for Tempo and Genre Estimation Using Complex Convolutions, Multitask Learning and Multi-input Network}},
  url = {https://hal.science/hal-03127155},
  year = {2020}
}
```
Tempo and genre are two inter-leaved aspects of music, genres are often associated to rhythm patterns which are played in specific tempo ranges. In this paper, we focus on the recent Deep Rhythm system based on a harmonic representation of rhythm used as an input to a convolutional neural network. To consider the relationships between frequency bands, we process complex-valued inputs through complexconvolutions. We also study the joint estimation of tempo/genre using a multitask learning approach. Finally, we study the addition of a second input branch to the system based on a VGG-like architecture applied to a mel-spectrogram input. This multi-input approach allows to improve the performances for tempo and genre estimation.
SHOULD WE CONSIDER THE USERS IN CONTEXTUAL MUSIC AUTO-TAGGING MODELS?
Karim M Ibrahim, Elena V Epure, Geoffroy Peeters, Gael Richard
21st International Society for Music Information Retrieval Conference, Montreal, Canada, October 2020.
```
@inproceedings{ibrahim:hal-02934433,
  address = {Montreal, Canada},
  author = {Ibrahim, Karim M and Epure, Elena V and Peeters, Geoffroy and Richard, Gael},
  booktitle = {{21st International Society for Music Information Retrieval Conference}},
  doi = {10.5281/zenodo.3961560},
  hal_id = {hal-02934433},
  hal_version = {v1},
  month = oct,
  pdf = {https://telecom-paris.hal.science/hal-02934433v1/file/ISMIR2020_V3.1.pdf},
  title = {{SHOULD WE CONSIDER THE USERS IN CONTEXTUAL MUSIC AUTO-TAGGING MODELS?}},
  url = {https://telecom-paris.hal.science/hal-02934433},
  year = {2020}
}
```
Music tags are commonly used to describe and categorize music. Various auto-tagging models and datasets have been proposed for the automatic music annotation with tags. However, the past approaches often neglect the fact that many of these tags largely depend on the user, especially the tags related to the context of music listening. In this paper, we address this problem by proposing a user-aware music auto-tagging system and evaluation protocol. Specifically, we use both the audio content and user information extracted from the user listening history to predict contextual tags for a given user/track pair. We propose a new dataset of music tracks annotated with contextual tags per user. We compare our model to the traditional audio-based model and study the influence of user embeddings on the classification quality. Our work shows that explicitly modeling the user listening history into the automatic tagging process could lead to more accurate estimation of contextual tags.
CONTENT BASED SINGING VOICE SOURCE SEPARATION VIA STRONG CONDITIONING USING ALIGNED PHONEMES
Gabriel Meseguer-Brocal, Geoffroy Peeters
21st International Society for Music Information Retrieval Conference, Montréal (virtual), Canada, October 2020.
```
@inproceedings{meseguerbrocal:hal-03200161,
  address = {Montr{\'e}al (virtual), Canada},
  author = {Meseguer-Brocal, Gabriel and Peeters, Geoffroy},
  booktitle = {{21st International Society for Music Information Retrieval Conference}},
  hal_id = {hal-03200161},
  hal_version = {v1},
  month = oct,
  pdf = {https://hal.science/hal-03200161v1/file/2008.02070.pdf},
  title = {{CONTENT BASED SINGING VOICE SOURCE SEPARATION VIA STRONG CONDITIONING USING ALIGNED PHONEMES}},
  url = {https://hal.science/hal-03200161},
  year = {2020}
}
```
Informed source separation has recently gained renewed interest with the introduction of neural networks and the availability of large multitrack datasets containing both the mixture and the separated sources. These approaches use prior information about the target source to improve separation. Historically, Music Information Retrieval researchers have focused primarily on scoreinformed source separation, but more recent approaches explore lyrics-informed source separation. However, because of the lack of multitrack datasets with time-aligned lyrics, models use weak conditioning with non-aligned lyrics. In this paper, we present a multimodal multitrack dataset with lyrics aligned in time at the word level with phonetic information as well as explore strong conditioning using the aligned phonemes. Our model follows a U-Net architecture and takes as input both the magnitude spectrogram of a musical mixture and a matrix with aligned phonetic information. The phoneme matrix is embedded to obtain the parameters that control Feature-wise Linear Modulation (FiLM) layers. These layers condition the U-Net feature maps to adapt the separation process to the presence of different phonemes via affine transformations. We show that phoneme conditioning can be successfully applied to improve singing voice source separation.
MULTILINGUAL LYRICS-TO-AUDIO ALIGNMENT
Andrea Vaglio, Romain Hennequin, Manuel Moussallam, Gael Richard, Florence d’Alché-Buc
International Society for Music Information Retrieval Conference (ISMIR), Montreal, Canada, October 2020.
```
@inproceedings{vaglio:hal-02996940,
  address = {Montreal, Canada},
  author = {Vaglio, Andrea and Hennequin, Romain and Moussallam, Manuel and Richard, Gael and d'Alch{\'e}-Buc, Florence},
  booktitle = {{International Society for Music Information Retrieval Conference (ISMIR)}},
  hal_id = {hal-02996940},
  hal_version = {v1},
  month = oct,
  pdf = {https://hal.science/hal-02996940v1/file/101.pdf},
  title = {{MULTILINGUAL LYRICS-TO-AUDIO ALIGNMENT}},
  url = {https://hal.science/hal-02996940},
  year = {2020}
}
```
Lyrics-to-audio alignment methods have recently reported impressive results, opening the door to practical applications such as karaoke and within song navigation. However , most studies focus on a single language-usually En-glish-for which annotated data are abundant. The question of their ability to generalize to other languages, especially in low (or even zero) training resource scenarios has been so far left unexplored. In this paper, we address the lyrics-to-audio alignment task in a generalized multilingual setup. More precisely, this investigation presents the first (to the best of our knowledge) attempt to create a language-independent lyrics-to-audio alignment system. Building on a Recurrent Neural Network (RNN) model trained with a Connectionist Temporal Classification (CTC) algorithm, we study the relevance of different intermediate representations, either character or phoneme, along with several strategies to design a training set. The evaluation is conducted on multiple languages with a varying amount of data available, from plenty to zero. Results show that learning from diverse data and using a universal phoneme set as an intermediate representation yield the best generalization performances.
EVALUATION OF A STOCHASTIC REVERBERATION MODEL BASED ON THE IMAGE SOURCE PRINCIPLE
Achille Aknin, Théophile Dupré, Roland Badeau
International Conference on Digital Audio Effects, Vienne, Austria, September 2020.
```
@inproceedings{aknin:hal-02932485,
  address = {Vienne, Austria},
  author = {Aknin, Achille and Dupr{\'e}, Th{\'e}ophile and Badeau, Roland},
  booktitle = {{International Conference on Digital Audio Effects}},
  hal_id = {hal-02932485},
  hal_version = {v1},
  month = sep,
  pdf = {https://telecom-paris.hal.science/hal-02932485v1/file/Dafx_2020.pdf},
  title = {{EVALUATION OF A STOCHASTIC REVERBERATION MODEL BASED ON THE IMAGE SOURCE PRINCIPLE}},
  url = {https://telecom-paris.hal.science/hal-02932485},
  year = {2020}
}
```
Various audio signal processing applications, such as source separation and dereverberation, require an accurate mathematical modeling of the input audio data. In the literature, many works have focused on source signal modeling, while the reverberation model is often kept very simplistic. This paper aims to investigate a stochastic room impulse response model presented in a previous article: this model is first adapted to discrete time, then we propose a parametric estimation algorithm, that we evaluate experimentally. Our results show that this algorithm is able to efficiently estimate the model parameters, in various experimental settings (various signal-to-noise ratios and absorption coefficients of the room walls).
DrumGAN: Synthesis of drum sounds with timbral feature conditioning using Generative Adversarial Networks
Javier Nistal Hurlé, Stefan Lattner, Gael Richard
21 st International Society for Music Information Retrieval Conference (ISMIR), Toronto, Canada, August 2020.
```
@inproceedings{nistalhurle:hal-03233337,
  address = {Toronto, Canada},
  author = {Nistal Hurl{\'e}, Javier and Lattner, Stefan and Richard, Gael},
  booktitle = {{21 st International Society for Music Information Retrieval Conference (ISMIR)}},
  hal_id = {hal-03233337},
  hal_version = {v1},
  keywords = {[z ; c] $\rightarrow$ [batch ; ch ; fs0 ; ts0] ❖ 1.1M iterations (~200k i/scale) ❖ batch-size: [30 ; 30 ; 20 ; 12 ; 12] ❖ Adam optimizer ❖ learning rate: 1e-3},
  month = aug,
  pdf = {https://telecom-paris.hal.science/hal-03233337v1/file/2020-ISMIR_DrumGAN.pdf},
  title = {{DrumGAN: Synthesis of drum sounds with timbral feature conditioning using Generative Adversarial Networks}},
  url = {https://telecom-paris.hal.science/hal-03233337},
  year = {2020}
}
```
Synthetic creation of drum sounds (e.g., in drum machines)is commonly performed using analog or digital synthesis,allowing a musician to sculpt the desired timbre modify-ing various parameters. Typically, such parameters controllow-level features of the sound and often have no musicalmeaning or perceptual correspondence. With the rise ofDeep Learning, data-driven processing of audio emergesas an alternative to traditional signal processing. This newparadigm allows controlling the synthesis process throughlearned high-level features or by conditioning a modelon musically relevant information. In this paper, we ap-ply a Generative Adversarial Network to the task of au-dio synthesis of drum sounds. By conditioning the modelon perceptual features computed with a publicly availablefeature-extractor, intuitive control is gained over the gen-eration process. The experiments are carried out on a largecollection of kick, snare, and cymbal sounds. We showthat, compared to a specific prior work based on a U-Netarchitecture, our approach considerably improves the qual-ity of the generated drum samples, and that the conditionalinput indeed shapes the perceptual characteristics of thesounds. Also, we provide audio examples and release thecode for reproducibility.1
Confidence-based Weighted Loss for Multi-label Classification with Missing Labels
Karim M Ibrahim, Elena Epure, Geoffroy Peeters, Gael Richard
The 2020 International Conference on Multimedia Retrieval (ICMR ’20), Dublin, Ireland, June 2020.
```
@inproceedings{ibrahim:hal-02547012,
  address = {Dublin, Ireland},
  author = {Ibrahim, Karim M and Epure, Elena and Peeters, Geoffroy and Richard, Gael},
  booktitle = {{The 2020 International Conference on Multimedia Retrieval (ICMR '20)}},
  doi = {10.1145/3372278.3390728},
  hal_id = {hal-02547012},
  hal_version = {v1},
  month = jun,
  pdf = {https://hal.science/hal-02547012v1/file/ICMR_paper_v5.3.pdf},
  title = {{Confidence-based Weighted Loss for Multi-label Classification with Missing Labels}},
  url = {https://hal.science/hal-02547012},
  year = {2020}
}
```
The problem of multi-label classification with missing labels (MLML) is a common challenge that is prevalent in several domains, e.g. image annotation and auto-tagging. In multi-label classification, each instance may belong to multiple class labels simultaneously. Due to the nature of the dataset collection and labelling procedure , it is common to have incomplete annotations in the dataset, i.e. not all samples are labelled with all the corresponding labels. However, the incomplete data labelling hinders the training of classification models. MLML has received much attention from the research community. However, in cases where a pre-trained model is fine-tuned on an MLML dataset, there has been no straightforward approach to tackle the missing labels, specifically when there is no information about which are the missing ones. In this paper, we propose a weighted loss function to account for the confidence in each label/sample pair that can easily be incorporated to fine-tune a pre-trained model on an incomplete dataset. Our experiment results show that using the proposed loss function improves the performance of the model as the ratio of missing labels increases.

Neutral to Lombard Speech Conversion with Deep Learning
Bertrand David, Enguerrand Gentet, Sebastien Denjean, Gael Richard, Vincent Roussarie
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, France, May 2020.

@inproceedings{david:hal-02713204,
  address = {Barcelona, France},
  author = {David, Bertrand and Gentet, Enguerrand and Denjean, Sebastien and Richard, Gael and Roussarie, Vincent},
  booktitle = {{ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  doi = {10.1109/ICASSP40776.2020.9053006},
  hal_id = {hal-02713204},
  hal_version = {v1},
  month = may,
  pages = {7739-7743},
  publisher = {{IEEE}},
  title = {{Neutral to Lombard Speech Conversion with Deep Learning}},
  url = {https://inria.hal.science/hal-02713204},
  year = {2020}
}

A Prototypical Triplet Loss for Cover Detection
Guillaume Doras, Geoffroy Peeters
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, France, May 2020.
```
@inproceedings{doras:hal-04448257,
  address = {Barcelona, France},
  author = {Doras, Guillaume and Peeters, Geoffroy},
  booktitle = {{ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  doi = {10.1109/ICASSP40776.2020.9054619},
  hal_id = {hal-04448257},
  hal_version = {v1},
  keywords = {Industries ; Training ; Signal processing ; Data models ; Task analysis ; Speech processing ; Standards},
  month = may,
  pages = {3797-3801},
  publisher = {{IEEE}},
  title = {{A Prototypical Triplet Loss for Cover Detection}},
  url = {https://hal.science/hal-04448257},
  year = {2020}
}
```
Automatic cover detection - the task of finding in an audio dataset all covers of a query track - has long been a challenging theoretical problem in MIR community. It also became a practical need for music composers societies requiring to detect automatically if an audio excerpt embeds musical content belonging to their catalog. In a recent work, we addressed this problem with a convolutional neural network mapping each track’s dominant melody to an embedding vector, and trained to minimize cover pairs distance in the embeddings space, while maximizing it for non-covers. We showed in particular that training this model with enough works having five or more covers yields state-of-the-art results. This however does not reflect the realistic use case, where music catalogs typically contain works with zero or at most one or two covers. We thus introduce here a new test set incorporating these constraints, and propose two contributions to improve our model’s accuracy under these stricter conditions: we replace dominant melody with multi-pitch representation as input data, and describe a novel prototypical triplet loss designed to improve covers clustering. We show that these changes improve results significantly for two concrete use cases, large dataset lookup and live songs identification.
DNN-Based Distributed Multichannel Mask Estimation for Speech Enhancement in Microphone Arrays
Nicolas Furnon, Romain Serizel, Irina Illina, Slim Essid
ICASSP 2020 - 45th International Conference on Acoustics, Speech, and Signal Processing, Barcelona, Spain, May 2020. Submitted to....
```
@inproceedings{furnon:hal-02389159,
  address = {Barcelona, Spain},
  author = {Furnon, Nicolas and Serizel, Romain and Illina, Irina and Essid, Slim},
  booktitle = {{ICASSP 2020 - 45th International Conference on Acoustics, Speech, and Signal Processing}},
  hal_id = {hal-02389159},
  hal_version = {v3},
  keywords = {Index Terms-Speech enhancement ; dis- tributed processing ; microphone arrays ; Distributed processing ; Speech enhancement},
  month = may,
  note = {Submitted to ICASSP2020},
  pdf = {https://hal.science/hal-02389159v3/file/icassp2020.pdf},
  title = {{DNN-Based Distributed Multichannel Mask Estimation for Speech Enhancement in Microphone Arrays}},
  url = {https://hal.science/hal-02389159},
  year = {2020}
}
```
Multichannel processing is widely used for speech enhancement but several limitations appear when trying to deploy these solutions to the real-world. Distributed sensor arrays that consider several devices with a few microphones is a viable alternative that allows for exploiting the multiple devices equipped with microphones that we are using in our everyday life. In this context, we propose to extend the distributed adaptive node-specific signal estimation approach to a neural networks framework. At each node, a local filtering is performed to send one signal to the other nodes where a mask is estimated by a neural network in order to compute a global multi-channel Wiener filter. In an array of two nodes, we show that this additional signal can be efficiently taken into account to predict the masks and leads to better speech enhancement performances than when the mask estimation relies only on the local signals.

Speech Intelligibility Enhancement by Equalization for in-Car Applications
Enguerrand Gentet, David Bertrand, Sebastien Denjean, Gael Richard, Vincent Roussarie
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, France, May 2020.

@inproceedings{gentet:hal-02713178,
  address = {Barcelona, France},
  author = {Gentet, Enguerrand and Bertrand, David and Denjean, Sebastien and Richard, Gael and Roussarie, Vincent},
  booktitle = {{ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  doi = {10.1109/ICASSP40776.2020.9053537},
  hal_id = {hal-02713178},
  hal_version = {v1},
  month = may,
  pages = {6934-6938},
  publisher = {{IEEE}},
  title = {{Speech Intelligibility Enhancement by Equalization for in-Car Applications}},
  url = {https://inria.hal.science/hal-02713178},
  year = {2020}
}

AUDIO-BASED AUTO-TAGGING WITH CONTEXTUAL TAGS FOR MUSIC
Karim M Ibrahim, Jimena Royo-Letelier, Elena V. Epure, Geoffroy Peeters, Gael Richard
International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, May 2020.
```
@inproceedings{ibrahim:hal-02481374,
  address = {Barcelona, Spain},
  author = {Ibrahim, Karim M and Royo-Letelier, Jimena and Epure, Elena V. and Peeters, Geoffroy and Richard, Gael},
  booktitle = {{International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}},
  doi = {10.5281/zenodo.3648287},
  hal_id = {hal-02481374},
  hal_version = {v1},
  keywords = {Index Terms-music auto-tagging ; user context ; dataset col- lection ; multi-label classification ; missing labels},
  month = may,
  pdf = {https://hal.science/hal-02481374v1/file/ICASSP2020_context_v5.1.pdf},
  title = {{AUDIO-BASED AUTO-TAGGING WITH CONTEXTUAL TAGS FOR MUSIC}},
  url = {https://hal.science/hal-02481374},
  year = {2020}
}
```
Music listening context such as location or activity has been shown to greatly influence the users’ musical tastes. In this work, we study the relationship between user context and audio content in order to enable context-aware music recommendation agnostic to user data. For that, we propose a semi-automatic procedure to collect track sets which leverages playlist titles as a proxy for context labelling. Using this, we create and release a dataset of ∼50k tracks labelled with 15 different contexts. Then, we present benchmark classification results on the created dataset using an audio auto-tagging model. As the training and evaluation of these models are impacted by missing negative labels due to incomplete annotations, we propose a sample-level weighted cross entropy loss to account for the confidence in missing labels and show improved context prediction results.

Approximate Bayesian computation with the sliced-Wasserstein distance
Kimia Nadjahi, Valentin Bortoli, Alain Durmus, Roland Badeau, Umut Şimşekli
45th International Conference on Acoustics, Speech, and Signal Processing, Barcelona, Spain, May 2020.

@inproceedings{nadjahi:hal-02457063,
  address = {Barcelona, Spain},
  author = {Nadjahi, Kimia and de Bortoli, Valentin and Durmus, Alain and Badeau, Roland and {\c S}im{\c s}ekli, Umut},
  booktitle = {{45th International Conference on Acoustics, Speech, and Signal Processing}},
  doi = {10.1109/icassp40776.2020.9054735},
  hal_id = {hal-02457063},
  hal_version = {v1},
  month = may,
  title = {{Approximate Bayesian computation with the sliced-Wasserstein distance}},
  url = {https://telecom-paris.hal.science/hal-02457063},
  year = {2020}
}

Laplace state space filter with exact inference and moment matching
Julian Neri, Philippe Depalle, Roland Badeau
45th International Conference on Acoustics, Speech, and Signal Processing, Barcelona, Spain, May 2020.

@inproceedings{neri:hal-02456643,
  address = {Barcelona, Spain},
  author = {Neri, Julian and Depalle, Philippe and Badeau, Roland},
  booktitle = {{45th International Conference on Acoustics, Speech, and Signal Processing}},
  hal_id = {hal-02456643},
  hal_version = {v1},
  month = may,
  pdf = {https://telecom-paris.hal.science/hal-02456643v1/file/2020_ICASSP_LSSF_Neri_Final.pdf},
  title = {{Laplace state space filter with exact inference and moment matching}},
  url = {https://telecom-paris.hal.science/hal-02456643},
  year = {2020}
}

Probabilistic filter and smoother for variational inference of Bayesian linear dynamical systems
Julian Neri, Roland Badeau, Philippe Depalle
45th International Conference on Acoustics, Speech, and Signal Processing, Barcelona, Spain, May 2020.

@inproceedings{neri:hal-02456651,
  address = {Barcelona, Spain},
  author = {Neri, Julian and Badeau, Roland and Depalle, Philippe},
  booktitle = {{45th International Conference on Acoustics, Speech, and Signal Processing}},
  hal_id = {hal-02456651},
  hal_version = {v1},
  month = may,
  pdf = {https://telecom-paris.hal.science/hal-02456651v1/file/2020_ICASSP_VBLDS_Neri_Julian.pdf},
  title = {{Probabilistic filter and smoother for variational inference of Bayesian linear dynamical systems}},
  url = {https://telecom-paris.hal.science/hal-02456651},
  year = {2020}
}

LEARNING TO RANK MUSIC TRACKS USING TRIPLET LOSS
Laure Prétet, Gael Richard, Geoffroy Peeters
ICASSP, Barcelona, Spain, May 2020.
```
@inproceedings{pretet:hal-02477242,
  address = {Barcelona, Spain},
  author = {Pr{\'e}tet, Laure and Richard, Gael and Peeters, Geoffroy},
  booktitle = {{ICASSP}},
  hal_id = {hal-02477242},
  hal_version = {v1},
  keywords = {deep learning ; triplet loss ; triplet mining ; audio music similarity},
  month = may,
  pdf = {https://telecom-paris.hal.science/hal-02477242v1/file/camera_ready.pdf},
  title = {{LEARNING TO RANK MUSIC TRACKS USING TRIPLET LOSS}},
  url = {https://telecom-paris.hal.science/hal-02477242},
  year = {2020}
}
```
Most music streaming services rely on automatic recommendation algorithms to exploit their large music catalogs. These algorithms aim at retrieving a ranked list of music tracks based on their similarity with a target music track. In this work, we propose a method for direct recommendation based on the audio content without explicitly tagging the music tracks. To that aim, we propose several strategies to perform triplet mining from ranked lists. We train a Convolutional Neural Network to learn the similarity via triplet loss. These different strategies are compared and validated on a large-scale experiment against an auto-tagging based approach. The results obtained highlight the efficiency of our system, especially when associated with an Auto-pooling layer.
Joint phoneme alignment and text-informed speech separation on highly corrupted speech
Kilian Schulze-Forster, Clément Doire, Gael Richard, Roland Badeau
45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020), Barcelona, Spain, May 2020.
```
@inproceedings{schulzeforster:hal-02457075,
  address = {Barcelona, Spain},
  author = {Schulze-Forster, Kilian and Doire, Cl{\'e}ment and Richard, Gael and Badeau, Roland},
  booktitle = {{45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)}},
  hal_id = {hal-02457075},
  hal_version = {v1},
  keywords = {speech separation ; phoneme alignment ; informed source separation ; attention},
  month = may,
  pdf = {https://telecom-paris.hal.science/hal-02457075v1/file/ICASSP_2020_paper_HAL.pdf},
  title = {{Joint phoneme alignment and text-informed speech separation on highly corrupted speech}},
  url = {https://telecom-paris.hal.science/hal-02457075},
  year = {2020}
}
```
Speech separation quality can be improved by exploiting textual information. However, this usually requires text-to-speech alignment at phoneme level. Classical alignment methods are made for rather clean speech and do not work as well on corrupted speech. We propose to perform text-informed speech-music separation and phoneme alignment jointly using recurrent neural networks and the attention mechanism. We show that it leads to benefits for both tasks. In experiments, phoneme transcripts are used to improve the perceived quality of separated speech over a non-informed baseline. Moreover, our novel phoneme alignment method based on the attention mechanism achieves state-of-the-art alignment accuracy on clean and on heavily corrupted speech.

Audio-Based Detection of Explicit Content in Music
Andrea Vaglio, Romain Hennequin, Manuel Moussallam, Gael Richard, Florence d’Alché-Buc
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, France, May 2020.

@inproceedings{vaglio:hal-02747449,
  address = {Barcelona, France},
  author = {Vaglio, Andrea and Hennequin, Romain and Moussallam, Manuel and Richard, Gael and d'Alch{\'e}-Buc, Florence},
  booktitle = {{ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  doi = {10.1109/ICASSP40776.2020.9054278},
  hal_id = {hal-02747449},
  hal_version = {v1},
  month = may,
  pages = {526-530},
  publisher = {{IEEE}},
  title = {{Audio-Based Detection of Explicit Content in Music}},
  url = {https://hal.science/hal-02747449},
  year = {2020}
}

Unsupervised Robust Speech Enhancement Based on Alpha-Stable Fast Multichannel Nonnegative Matrix Factorization
Mathieu Fontaine, Kouhei Sekiguchi, Aditya Arie Nugraha, Kazuyoshi Yoshii
Proc. Interspeech 2020, 2020.

@inproceedings{fontaine_unsupervised_2020,
  author = {Fontaine, Mathieu and Sekiguchi, Kouhei and Nugraha, Aditya Arie and Yoshii, Kazuyoshi},
  booktitle = {Proc. {Interspeech} 2020},
  copyright = {Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC-BY-NC-SA)},
  pages = {4541--4545},
  title = {Unsupervised {Robust} {Speech} {Enhancement} {Based} on {Alpha}-{Stable} {Fast} {Multichannel} {Nonnegative} {Matrix} {Factorization}},
  year = {2020}
}

Matrix Factorization for High Frequency Non Intrusive Load Monitoring
Simon Henriet, Benoît Fuentes, Umut Şimşekli, Gael Richard
BuildSys ’20: The 7th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation, Virtual Event, Japan, 2020.
```
@inproceedings{henriet:hal-03162808,
  address = {Virtual Event, Japan},
  author = {Henriet, Simon and Fuentes, Beno{\^i}t and {\c S}im{\c s}ekli, Umut and Richard, Gael},
  booktitle = {{BuildSys '20: The 7th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation}},
  doi = {10.1145/3427771.3427847},
  hal_id = {hal-03162808},
  hal_version = {v1},
  keywords = {$\bullet$ Computing methodologies $\rightarrow$ Factorization methods ; Source separation ; $\bullet$ Hardware $\rightarrow$ Energy metering NILM ; Energy Disaggregation ; High Frequency Data ; Matrix Factorization ; Source Separation},
  pages = {20-24},
  publisher = {{ACM}},
  title = {{Matrix Factorization for High Frequency Non Intrusive Load Monitoring}},
  url = {https://telecom-paris.hal.science/hal-03162808},
  year = {2020}
}
```
Non Intrusive Load Monitoring has been introduced 30 years ago in order to monitor the electric consumption of specific equipments inside a building without the need of installing multiples sensors. During three decades, researchers and industrials have described the NILM problems according to the electric data available, the desired quantity to be monitored and the application it was used for. As a consequence of the multitude of choices, a lot of different formulations can be found in the literature. This diversity makes it difficult for researchers from general domains such as machine learning to tackle the NILM problem. In this paper we aim at defining the NILM problem as a Matrix Factorization task using high frequency measurements and also to review methods to solve this problem. We start by defining the general concepts driving the NILM problem and then show how to cast high frequency NILM into a Matrix Factorization problem. Once casted as a machine learning problem, we will review general purposes algorithms applicable to this problem such as Independent Component Analysis, Sparse Coding or Semi Non-negative Matrix Factorization and specific NILM methods such as BOLT and IVMF.
The POTUS Corpus, a database of weekly addresses for the study of stance in politics and virtual agents
Thomas Janssoone, Kevin Bailly, Gael Richard, Chloé Clavel
Conference on Language Resources and Evaluation (LREC 2020), Marseille, France, 2020.
```
@inproceedings{janssoone:hal-02873020,
  address = {Marseille, France},
  author = {Janssoone, Thomas and Bailly, Kevin and Richard, Gael and Clavel, Chlo{\'e}},
  booktitle = {{Conference on Language Resources and Evaluation (LREC 2020)}},
  hal_id = {hal-02873020},
  hal_version = {v1},
  keywords = {Multi-modal Social Signal ; Signal Processing ; Embodied Conversational Agent ; Audio Video Corpus ; POTUS},
  pages = {11 - 16},
  pdf = {https://telecom-paris.hal.science/hal-02873020v1/file/2020.lrec-1.193.pdf},
  title = {{The POTUS Corpus, a database of weekly addresses for the study of stance in politics and virtual agents}},
  url = {https://telecom-paris.hal.science/hal-02873020},
  year = {2020}
}
```
One of the main challenges in the field of Embodied Conversational Agent (ECA) is to generate socially believable agents. The common strategy for agent behaviour synthesis is to rely on dedicated corpus analysis. Such a corpus is composed of multimedia files of socio-emotional behaviors which have been annotated by external observers. The underlying idea is to identify interaction information for the agent’s socio-emotional behavior by checking whether the intended socio-emotional behavior is actually perceived by humans. Then, the annotations can be used as learning classes for machine learning algorithms applied to the social signals. This paper introduces the POTUS Corpus composed of high-quality audio-video files of political addresses to the American people. Two protagonists are present in this database. First, it includes speeches of former president Barack Obama to the American people. Secondly, it provides videos of these same speeches given by a virtual agent named Rodrigue. The ECA reproduces the original address as closely as possible using social signals automatically extracted from the original one. Both are annotated for social attitudes, providing information about the stance observed in each file. It also provides the social signals automatically extracted from Obama’s addresses used to generate Rodrigue’s ones.
Statistical and Topological Properties of Sliced Probability Divergences
Kimia Nadjahi, Alain Durmus, Lénaïc Chizat, Soheil Kolouri, Shahin Shahrampour, Umut Şimşekli
Advances in Neural Processing Systems, Online, France, 2020.
```
@inproceedings{nadjahi:hal-03269119,
  address = {Online, France},
  author = {Nadjahi, Kimia and Durmus, Alain and Chizat, L{\'e}na{\"i}c and Kolouri, Soheil and Shahrampour, Shahin and {\c S}im{\c s}ekli, Umut},
  booktitle = {{Advances in Neural Processing Systems}},
  hal_id = {hal-03269119},
  hal_version = {v1},
  title = {{Statistical and Topological Properties of Sliced Probability Divergences}},
  url = {https://hal.archives-ouvertes.fr/hal-03269119},
  year = {2020}
}
```
The idea of slicing divergences has been proven to be successful when comparing two probability measures in various machine learning applications including generative modeling, and consists in computing the expected value of a ‘base divergence’ between one-dimensional random projections of the two measures. However, the computational and statistical consequences of such a technique have not yet been well-established. In this paper, we aim at bridging this gap and derive some properties of sliced divergence functions. First, we show that slicing preserves the metric axioms and the weak continuity of the divergence, implying that the sliced divergence will share similar topological properties. We then precise the results in the case where the base divergence belongs to the class of integral probability metrics. On the other hand, we establish that, under mild conditions, the sample complexity of the sliced divergence does not depend on the dimension, even when the base divergence suffers from the curse of dimensionality. We finally apply our general results to the Wasserstein distance and Sinkhorn divergences, and illustrate our theory on both synthetic and real data experiments.

patent

Method and System for Broadcasting a Multichannel Audio Stream to Terminals of Spectators Attending a Sports Event
Raphael Blouet, Slim Essid
September 2020.

@patent{SE:patent20,
  author = {Blouet, Raphael and Essid, Slim},
  month = sep,
  number = {US 2021/0014627 A1},
  title = {Method and System for Broadcasting a Multichannel Audio Stream to Terminals of Spectators Attending a Sports Event},
  url = {https://perso.telecom-paristech.fr/essid/papers/US20210014627A1.pdf},
  year = {2020}
}

Journal Articles

Creating DALI, a Large Dataset of Synchronized Audio, Lyrics, and Notes
Gabriel Meseguer-Brocal, Alice Cohen-Hadria, Geoffroy Peeters
Transactions of the International Society for Music Information Retrieval (TISMIR), June 2020.

@article{meseguerbrocal:hal-03985545,
  author = {Meseguer-Brocal, Gabriel and Cohen-Hadria, Alice and Peeters, Geoffroy},
  doi = {10.5334/tismir.30},
  hal_id = {hal-03985545},
  hal_version = {v1},
  journal = {{Transactions of the International Society for Music Information Retrieval (TISMIR)}},
  month = jun,
  number = {1},
  pages = {55-67},
  publisher = {{Ubiquity Press}},
  title = {{Creating DALI, a Large Dataset of Synchronized Audio, Lyrics, and Notes}},
  url = {https://hal.science/hal-03985545},
  volume = {3},
  year = {2020}
}

Separation of Alpha-Stable Random Vectors
Mathieu Fontaine, Roland Badeau, Antoine Liutkus
Signal Processing, January 2020.
```
@article{fontaine:hal-02433213,
  author = {Fontaine, Mathieu and Badeau, Roland and Liutkus, Antoine},
  doi = {10.1016/j.sigpro.2020.107465},
  hal_id = {hal-02433213},
  hal_version = {v1},
  journal = {{Signal Processing}},
  keywords = {alpha-stable distribution ; separation theory ; additive models ; measure theory ; optimization},
  month = jan,
  pages = {107465},
  pdf = {https://inria.hal.science/hal-02433213v1/file/AlphaStableVector_final.pdf},
  publisher = {{Elsevier}},
  title = {{Separation of Alpha-Stable Random Vectors}},
  url = {https://inria.hal.science/hal-02433213},
  year = {2020}
}
```
Source separation aims at decomposing a vector into additive components. This is often done by first estimating source parameters before feeding them into a filtering method, often based on ratios of covariances. The whole pipeline is traditionally rooted in some probabilistic framework providing both the likelihood for parameter estimation and the separation method. While Gaussians are ubiquitous for this purpose, many studies showed the benefit of heavy-tailed models for estimation. However, there is no counterpart filtering method to date exploiting such formalism, so that related studies revert to covariance-based filtering after estimation is finished. Here, we introduce a new multivariate separation technique, that fully exploits the flexibility of α-stable heavy-tailed distributions. We show how a spatial representation can be exploited, which decomposes the observation as an infinite sum of contributions originating from all directions. Two methods for separation are derived. The first one is non-linear and similar to a beamforming technique, while the second one is linear, but minimizes a covariation criterion, which is the counterpart of the covariance for α-stable vectors. We evaluate the proposed techniques in a large number of challenging and adverse situations on synthetic experiments, demonstrating their performance for the extraction of signals from strong interferences.
Groove2Groove: One-Shot Music Style Transfer with Supervision from Synthetic Data
Ondřej Cífka, Umut Şimşekli, Gael Richard
IEEE/ACM Transactions on Audio, Speech and Language Processing, 2020.
```
@article{cifka:hal-02923548,
  author = {C{\'i}fka, Ond{\v r}ej and {\c S}im{\c s}ekli, Umut and Richard, Gael},
  doi = {10.1109/TASLP.2020.3019642},
  hal_id = {hal-02923548},
  hal_version = {v2},
  journal = {{IEEE/ACM Transactions on Audio, Speech and Language Processing}},
  keywords = {style transfer ; symbolic music ; synthetic data ; deep learning ; recurrent neural networks},
  pages = {2638-2650},
  pdf = {https://hal.science/hal-02923548v2/file/Groove2Groove.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Groove2Groove: One-Shot Music Style Transfer with Supervision from Synthetic Data}},
  url = {https://hal.science/hal-02923548},
  volume = {28},
  year = {2020}
}
```
Style transfer is the process of changing the style of an image, video, audio clip or musical piece so as to match the style of a given example. Even though the task has interesting practical applications within the music industry, it has so far received little attention from the audio and music processing community. In this paper, we present Groove2Groove, a one-shot style transfer method for symbolic music, focusing on the case of accompaniment styles in popular music and jazz. We propose an encoder-decoder neural network for the task, along with a synthetic data generation scheme to supply it with parallel training examples. This synthetic parallel data allows us to tackle the style transfer problem using end-to-end supervised learning, employing powerful techniques used in natural language processing. We experimentally demonstrate the performance of the model on style transfer using existing and newly proposed metrics, and also explore the possibility of style interpolation.

2015 - 2019 [102 publications]

2019

Conference Articles

Generalized Sliced Wasserstein Distances
Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, Gustavo K.
NeurIPS 2019, Vancouver, Canada, December 2019.

@inproceedings{kolouri:hal-02280948,
  address = {Vancouver, Canada},
  author = {Kolouri, Soheil and Nadjahi, Kimia and Simsekli, Umut and Badeau, Roland and K., Gustavo},
  booktitle = {{NeurIPS 2019}},
  hal_id = {hal-02280948},
  hal_version = {v1},
  month = dec,
  title = {{Generalized Sliced Wasserstein Distances}},
  url = {https://hal.telecom-paris.fr/hal-02280948},
  year = {2019}
}

Asymptotic Guarantees for Learning Generative Models with the Sliced-Wasserstein Distance
Kimia Nadjahi, Alain Durmus, Umut Simsekli, Roland Badeau
NeurIPS 2019, Vancouver, Canada, December 2019.

@inproceedings{nadjahi:hal-02280944,
  address = {Vancouver, Canada},
  author = {Nadjahi, Kimia and Durmus, Alain and Simsekli, Umut and Badeau, Roland},
  booktitle = {{NeurIPS 2019}},
  hal_id = {hal-02280944},
  hal_version = {v1},
  month = dec,
  title = {{Asymptotic Guarantees for Learning Generative Models with the Sliced-Wasserstein Distance}},
  url = {https://hal.telecom-paris.fr/hal-02280944},
  year = {2019}
}

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise
Thanh Huy Nguyen, Umut Simsekli, Mert Gürbüzbalaban, Gael Richard
33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, December 2019.
```
@inproceedings{nguyen:hal-02372376,
  address = {Vancouver, Canada},
  author = {Nguyen, Thanh Huy and Simsekli, Umut and G{\"u}rb{\"u}zbalaban, Mert and Richard, Gael},
  booktitle = {{33rd Conference on Neural Information Processing Systems (NeurIPS 2019)}},
  hal_id = {hal-02372376},
  hal_version = {v1},
  month = dec,
  pdf = {https://hal.telecom-paris.fr/hal-02372376/file/first-exit-time-analysis-of-stochastic-gradient-descent-under-heavy-tailed-gradient-noise.pdf},
  title = {{First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise}},
  url = {https://hal.telecom-paris.fr/hal-02372376},
  year = {2019}
}
```
Stochastic gradient descent (SGD) has been widely used in machine learning due to its computational efficiency and favorable generalization properties. Recently, it has been empirically demonstrated that the gradient noise in several deep learning settings admits a non-Gaussian, heavy-tailed behavior. This suggests that the gradient noise can be modeled by using α-stable distributions, a family of heavytailed distributions that appear in the generalized central limit theorem. In this context, SGD can be viewed as a discretization of a stochastic differential equation (SDE) driven by a Lévy motion, and the metastability results for this SDE can then be used for illuminating the behavior of SGD, especially in terms of ‘preferring wide minima’. While this approach brings a new perspective for analyzing SGD, it is limited in the sense that, due to the time discretization, SGD might admit a significantly different behavior than its continuous-time limit. Intuitively, the behaviors of these two systems are expected to be similar to each other only when the discretization step is sufficiently small; however, to the best of our knowledge, there is no theoretical understanding on how small the step-size should be chosen in order to guarantee that the discretized system inherits the properties of the continuous-time system. In this study, we provide formal theoretical analysis where we derive explicit conditions for the step-size such that the metastability behavior of the discrete-time system is similar to its continuous-time limit. We show that the behaviors of the two systems are indeed similar for small step-sizes and we identify how the error depends on the algorithm and problem parameters. We illustrate our results with simulations on a synthetic model and neural networks.
Supervised Symbolic Music Style Translation Using Synthetic Data
Ondřej Cífka, Umut Şimşekli, Gael Richard
20th International Society for Music Information Retrieval Conference (ISMIR), Delft, Netherlands, November 2019.
```
@inproceedings{cifka:hal-02366954,
  address = {Delft, Netherlands},
  author = {C{\'i}fka, Ond{\v r}ej and {\c S}im{\c s}ekli, Umut and Richard, Gael},
  booktitle = {{20th International Society for Music Information Retrieval Conference (ISMIR)}},
  doi = {10.5281/zenodo.3527878},
  hal_id = {hal-02366954},
  hal_version = {v1},
  month = nov,
  pdf = {https://hal.archives-ouvertes.fr/hal-02366954/file/ismir2019_paper_000071.pdf},
  title = {{Supervised Symbolic Music Style Translation Using Synthetic Data}},
  url = {https://hal.archives-ouvertes.fr/hal-02366954},
  year = {2019}
}
```
Research on style transfer and domain translation has clearly demonstrated the ability of deep learning-based algorithms to manipulate images in terms of artistic style. More recently, several attempts have been made to extend such approaches to music (both symbolic and audio) in order to enable transforming musical style in a similar manner. In this study, we focus on symbolic music with the goal of altering the ’style’ of a piece while keeping its original ’content’. As opposed to the current methods, which are inherently restricted to be unsupervised due to the lack of ’aligned’ data (i.e. the same musical piece played in multiple styles), we develop the first fully supervised algorithm for this task. At the core of our approach lies a synthetic data generation scheme which allows us to produce virtually unlimited amounts of aligned data, and hence avoid the above issue. In view of this data generation scheme, we propose an encoder-decoder model for translating symbolic music accompaniments between a number of different styles. Our experiments show that our models, although trained entirely on synthetic data, are capable of producing musically meaningful accompaniments even for real (non-synthetic) MIDI recordings.

TRACKING BEATS AND MICROTIMING IN AFRO-LATIN AMERICAN MUSIC USING CONDITIONAL RANDOM FIELDS AND DEEP LEARNING
Magdalena Fuentes, Lucas S Maia, Martín Rocamora, Luiz W P Biscainho, Hélène C Crayencour, Slim Essid, Juan P. Bello
ISMIR, Delft, Netherlands, November 2019.

@inproceedings{fuentes:hal-02419361,
  address = {Delft, Netherlands},
  author = {Fuentes, Magdalena and Maia, Lucas S and Rocamora, Mart{\'i}n and Biscainho, Luiz W P and Crayencour, H{\'e}l{\`e}ne C and Essid, Slim and Bello, Juan P.},
  booktitle = {{ISMIR}},
  hal_id = {hal-02419361},
  hal_version = {v1},
  month = nov,
  pdf = {https://hal.archives-ouvertes.fr/hal-02419361/file/2_Microtiming_tracking_final_2.pdf},
  title = {{TRACKING BEATS AND MICROTIMING IN AFRO-LATIN AMERICAN MUSIC USING CONDITIONAL RANDOM FIELDS AND DEEP LEARNING}},
  url = {https://hal.archives-ouvertes.fr/hal-02419361},
  year = {2019}
}

From the Token to the Review: A Hierarchical Multimodal approach to Opinion Mining
Alexandre Garcia, Pierre Colombo, Slim Essid, Florence d’Alché-Buc, Chloe Clavel
2019 Conference on Empirical Methods in Natural Language Processing, Hong-Kong, China, November 2019.
```
@inproceedings{garcia:hal-02371140,
  address = {Hong-Kong, China},
  author = {Garcia, Alexandre and Colombo, Pierre and Essid, Slim and d'Alch{\'e}-Buc, Florence and Clavel, Chloe},
  booktitle = {{2019 Conference on Empirical Methods in Natural Language Processing}},
  hal_id = {hal-02371140},
  hal_version = {v1},
  month = nov,
  pdf = {https://hal.archives-ouvertes.fr/hal-02371140/file/1908.11216.pdf},
  title = {{From the Token to the Review: A Hierarchical Multimodal approach to Opinion Mining}},
  url = {https://hal.archives-ouvertes.fr/hal-02371140},
  year = {2019}
}
```
The task of predicting fine grained user opinion based on spontaneous spoken language is a key problem arising in the development of Computational Agents as well as in the development of social network based opinion miners. Unfortunately, gathering reliable data on which a model can be trained is notoriously difficult and existing works rely only on coarsely labeled opinions. In this work we aim at bridging the gap separating fine grained opinion models already developed for written language and coarse grained models developed for spontaneous multimodal opinion mining. We take advantage of the implicit hierarchical structure of opinions to build a joint fine and coarse grained opinion model that exploits different views of the opinion expression. The resulting model shares some properties with attention-based models and is shown to provide competitive results on a recently released multimodal fine grained annotated corpus.
SAMBASET: A DATASET OF HISTORICAL SAMBA DE ENREDO RECORDINGS FOR COMPUTATIONAL MUSIC ANALYSIS
Lucas S Maia, Magdalena Fuentes, Luiz W P Biscainho, Martín Rocamora, Slim Essid
The 20th International Society for Music Information Retrieval Conference, Delft, Netherlands, November 2019.
```
@inproceedings{maia:hal-02943462,
  address = {Delft, Netherlands},
  author = {Maia, Lucas S and Fuentes, Magdalena and Biscainho, Luiz W P and Rocamora, Mart{\'i}n and Essid, Slim},
  booktitle = {{The 20th International Society for Music Information Retrieval Conference}},
  hal_id = {hal-02943462},
  hal_version = {v1},
  month = nov,
  pdf = {https://hal.archives-ouvertes.fr/hal-02943462/file/LM_ISMIR-19.pdf},
  title = {{SAMBASET: A DATASET OF HISTORICAL SAMBA DE ENREDO RECORDINGS FOR COMPUTATIONAL MUSIC ANALYSIS}},
  url = {https://hal.archives-ouvertes.fr/hal-02943462},
  year = {2019}
}
```
In the last few years, several datasets have been released to meet the requirements of "hungry" yet promising data-driven approaches in music technology research. Since, for historical reasons, most investigations conducted in the field still revolve around music of the so-called "West-ern" tradition, the corresponding data, methodology and conclusions carry a strong cultural bias. Music of non-"Western" background, whenever present, is usually un-derrepresented, poorly labeled, or even mislabeled, the exception being projects that aim at specifically describing such music. In this paper we present SAMBASET, a dataset of Brazilian samba music that contains over 40 hours of historical and modern samba de enredo commercial recordings. To the best of our knowledge, this is the first dataset of this genre. We describe the collection of metadata (e.g. artist, composer, release date) and outline our semiautomatic approach to the challenging task of annotating beats in this large dataset, which includes the assessment of the performance of state-of-the-art beat tracking algorithms for this specific case. Finally, we present a study on tempo and beat tracking that illustrates SAM-BASET’s value, and we comment on other tasks for which it could be used.
CONDITIONED-U-NET: INTRODUCING A CONTROL MECHANISM IN THE U-NET FOR MULTIPLE SOURCE SEPARATIONS
Gabriel Meseguer-Brocal, Geoffroy Peeters
Proceedings of the 20th International Society for Music Information Retrieval Conference, Delft, Netherlands, November 2019.
```
@inproceedings{meseguerbrocal:hal-02448917,
  address = {Delft, Netherlands},
  author = {Meseguer-Brocal, Gabriel and Peeters, Geoffroy},
  booktitle = {{Proceedings of the 20th International Society for Music Information Retrieval Conference}},
  doi = {10.5281/zenodo.3527766},
  hal_id = {hal-02448917},
  hal_version = {v1},
  month = nov,
  pdf = {https://hal.archives-ouvertes.fr/hal-02448917/file/1907.01277.pdf},
  title = {{CONDITIONED-U-NET: INTRODUCING A CONTROL MECHANISM IN THE U-NET FOR MULTIPLE SOURCE SEPARATIONS}},
  url = {https://hal.archives-ouvertes.fr/hal-02448917},
  year = {2019}
}
```
Data-driven models for audio source separation such as U-Net or Wave-U-Net are usually models dedicated to and specifically trained for a single task, e.g. a particular instrument isolation. Training them for various tasks at once commonly results in worse performances than training them for a single specialized task. In this work, we introduce the Conditioned-U-Net (C-U-Net) which adds a control mechanism to the standard U-Net. The control mechanism allows us to train a unique and generic U-Net to perform the separation of various instruments. The C-U-Net decides the instrument to isolate according to a one-hot-encoding input vector. The input vector is embedded to obtain the parameters that control Feature-wise Linear Modulation (FiLM) layers. FiLM layers modify the U-Net feature maps in order to separate the desired instrument via affine transformations. The C-U-Net performs different instrument separations, all with a single model achieving the same performances as the dedicated ones at a lower cost.
EEG-BASED DECODING OF AUDITORY ATTENTION TO A TARGET INSTRUMENT IN POLYPHONIC MUSIC
giorgia cantisani, Slim Essid, Gael Richard
2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, United States, October 2019. Accepted for....
```
@inproceedings{cantisani:hal-02291896,
  address = {New Paltz, NY, United States},
  author = {cantisani, giorgia and Essid, Slim and Richard, Gael},
  booktitle = {{2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}},
  hal_id = {hal-02291896},
  hal_version = {v1},
  keywords = {Stimulus reconstruction model ; Polyphonic music ; EEG ; Index Terms-Auditory attention decoding},
  month = oct,
  note = {Accepted for publication at WASPAA 2019},
  pdf = {https://hal.archives-ouvertes.fr/hal-02291896/file/MUSIC_AAD.pdf},
  title = {{EEG-BASED DECODING OF AUDITORY ATTENTION TO A TARGET INSTRUMENT IN POLYPHONIC MUSIC}},
  url = {https://hal.archives-ouvertes.fr/hal-02291896},
  year = {2019}
}
```
Auditory attention decoding aims at determining which sound source a subject is "focusing on". In this work, we address the problem of EEG-based decoding of auditory attention to a target instrument in realistic polyphonic music. To this end, we exploit a stimulus reconstruction model which was proven to decode successfully the attention to speech in multi-speaker environments. To our knowledge, this model was never applied to musical stimuli for decoding attention. The task we consider here is quite complex as the stimuli used are polyphonic, including duets and trios, and are reproduced using loudspeakers instead of headphones. We consider the decoding of three different audio representations and investigate the influence on the decoding performance of multiple variants of musical stimuli, such as the number and type of instruments in the mixture, the spatial rendering, the music genre and the melody/rhythmical pattern that is played. We obtain promising results, comparable to those obtained on speech data in previous works, and confirm that it is possible to correlate the human brain activity with musically relevant features of the attended source.

IDENTIFY, LOCATE AND SEPARATE: AUDIO-VISUAL OBJECT EXTRACTION IN LARGEVIDEO COLLECTIONS USING WEAK SUPERVISION
Sanjeel Parekh, Alexey Ozerov, Slim Essid, Ngoc Duong, Patrick Pérez, Gael Richard
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, United States, October 2019.

@inproceedings{parekh:hal-02380780,
  address = {New Paltz, United States},
  author = {Parekh, Sanjeel and Ozerov, Alexey and Essid, Slim and Duong, Ngoc and P{\'e}rez, Patrick and Richard, Gael},
  booktitle = {{IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}},
  hal_id = {hal-02380780},
  hal_version = {v1},
  month = oct,
  pdf = {https://hal.telecom-paris.fr/hal-02380780/file/waspaa2019_cdraft-final.pdf},
  title = {{IDENTIFY, LOCATE AND SEPARATE: AUDIO-VISUAL OBJECT EXTRACTION IN LARGEVIDEO COLLECTIONS USING WEAK SUPERVISION}},
  url = {https://hal.telecom-paris.fr/hal-02380780},
  year = {2019}
}

Weakly informed audio source separation
Kilian Schulze-Forster, Clément Doire, Gael Richard, Roland Badeau
WASPAA, New Paltz, New York, United States, October 2019.

@inproceedings{schulzeforster:hal-02280472,
  address = {New Paltz, New York, United States},
  author = {Schulze-Forster, Kilian and Doire, Cl{\'e}ment and Richard, Gael and Badeau, Roland},
  booktitle = {{WASPAA}},
  hal_id = {hal-02280472},
  hal_version = {v1},
  month = oct,
  title = {{Weakly informed audio source separation}},
  url = {https://hal.telecom-paris.fr/hal-02280472},
  year = {2019}
}

MAD-EEG: an EEG dataset for decoding auditory attention to a target instrument in polyphonic music
giorgia cantisani, Gabriel Trégoat, Slim Essid, Gael Richard
Speech, Music and Mind (SMM), Satellite Workshop of Interspeech 2019, Vienna, Austria, September 2019.
```
@inproceedings{cantisani:hal-02291882,
  address = {Vienna, Austria},
  author = {cantisani, giorgia and Tr{\'e}goat, Gabriel and Essid, Slim and Richard, Gael},
  booktitle = {{Speech, Music and Mind (SMM), Satellite Workshop of Interspeech 2019}},
  hal_id = {hal-02291882},
  hal_version = {v1},
  keywords = {Polyphonic music ; EEG ; Auditory attention},
  month = sep,
  pdf = {https://hal.archives-ouvertes.fr/hal-02291882/file/MAD-EEG.pdf},
  title = {{MAD-EEG: an EEG dataset for decoding auditory attention to a target instrument in polyphonic music}},
  url = {https://hal.archives-ouvertes.fr/hal-02291882},
  year = {2019}
}
```
We present MAD-EEG, a new, freely available dataset for studying EEG-based auditory attention decoding considering the challenging case of subjects attending to a target instrument in polyphonic music. The dataset represents the first music-related EEG dataset of its kind, enabling, in particular, studies on single-trial EEG-based attention decoding, while also opening the path for research on other EEG-based music analysis tasks. MAD-EEG has so far collected 20-channel EEG signals recorded from 8 subjects listening to solo, duo and trio music excerpts and attending to one pre-specified instrument. The proposed experimental setting differs from the ones previously considered as the stimuli are polyphonic and are played to the subject using speakers instead of headphones. The stimuli were designed considering variations in terms of number and type of instruments in the mixture, spatial rendering, music genre and melody that is played. Preliminary results obtained with a state-of-the-art stimulus reconstruction algorithm commonly used for speech stimuli show that the audio representation reconstructed from the EEG response is more correlated with that of the attended source than with the one of the unattended source, proving the dataset to be suitable for such kind of studies.
Cauchy Multichannel Speech Enhancement with a Deep Speech Prior
Mathieu Fontaine, Aditya Arie Nugraha, Roland Badeau, Kazuyoshi Yoshii, Antoine Liutkus
EUSIPCO 2019 - 27th European Signal Processing Conference, Coruña, Spain, September 2019.
```
@inproceedings{fontaine:hal-02288063,
  address = {Coru{\~n}a, Spain},
  author = {Fontaine, Mathieu and Nugraha, Aditya Arie and Badeau, Roland and Yoshii, Kazuyoshi and Liutkus, Antoine},
  booktitle = {{EUSIPCO 2019 - 27th European Signal Processing Conference}},
  hal_id = {hal-02288063},
  hal_local_reference = {MF:EUSIPCO-19},
  hal_version = {v1},
  keywords = {nonnega- tive matrix factorization ; multivariate complex Cauchy distribution ; Multichannel speech enhancement ; variational autoencoder},
  month = sep,
  pdf = {https://hal.telecom-paris.fr/hal-02288063/file/eusipco-2019-fontaine.pdf},
  title = {{Cauchy Multichannel Speech Enhancement with a Deep Speech Prior}},
  url = {https://hal.telecom-paris.fr/hal-02288063},
  year = {2019}
}
```
We propose a semi-supervised multichannel speech enhancement system based on a probabilistic model which assumes that both speech and noise follow the heavy-tailed multi-variate complex Cauchy distribution. As we advocate, this allows handling strong and adverse noisy conditions. Consequently, the model is parameterized by the source magnitude spectrograms and the source spatial scatter matrices. To deal with the non-additivity of scatter matrices, our first contribution is to perform the enhancement on a projected space. Then, our second contribution is to combine a latent variable model for speech, which is trained by following the variational autoencoder framework, with a low-rank model for the noise source. At test time, an iterative inference algorithm is applied, which produces estimated parameters to use for separation. The speech latent variables are estimated first from the noisy speech and then updated by a gradient descent method, while a majorization-equalization strategy is used to update both the noise and the spatial parameters of both sources. Our experimental results show that the Cauchy model outperforms the state-of-art methods. The standard deviation scores also reveal that the proposed method is more robust against non-stationary noise.
Lower Bound on Frequency Validity of Energy-Stress Tensor Based Diffuse Sound Field Model
Aidan Meacham, Roland Badeau, Jean-Dominique Polack
ICA 2019, Aachen, Germany, September 2019.
```
@inproceedings{meacham:hal-02288565,
  address = {Aachen, Germany},
  author = {Meacham, Aidan and Badeau, Roland and Polack, Jean-Dominique},
  booktitle = {{ICA 2019}},
  doi = {10.18154/RWTH-CONV-239324},
  hal_id = {hal-02288565},
  hal_local_reference = {AM:ICA-19},
  hal_version = {v1},
  keywords = {diffuse field ; room acoustics ; finite difference methods},
  month = sep,
  title = {{Lower Bound on Frequency Validity of Energy-Stress Tensor Based Diffuse Sound Field Model}},
  url = {https://hal.telecom-paris.fr/hal-02288565},
  year = {2019}
}
```
A lower bound on the frequency validity limit is established for an energetic wave equation derived from the energy-stress tensor, examined in the one-dimensional case [Dujourdy et al, Acta Acustica united with Acustica 103:480-491, 2017]. The method efﬁciently models diffuse sound ﬁelds that dominate reverberation at higher frequencies and larger distances. Initially noted in the course of an exhaustive search of the solution space of all valid model parameters, the low-frequency cutoff has implications for the utility of the method in a hybridization context. In practice, the bound is encountered when determining the absorption and diffusion coefﬁcients by iteratively approaching the temporal and spatial decay of measured data. As the test frequency decreases, the ranges of coefﬁcient combinations that result in less than 10% variation from each decay measure can diverge until the region where both measures are satisfactory (the intersection of the two domains) disappears. Further evidence for the bound is provided through comparison with measurements of a long hallway, and stability concerns in the cases where both coefﬁcients are very small are addressed.
Factorisation Matricielle Semi Non-Négative: Applicationà la Décomposition de Consommations Electriques
Simon Henriet, Umut Simsekli, Sérgio F. Santos, Benoît Fuentes, Gael Richard
Colloque francophonede traitement du signal et des images (GRETSI), Lille, France, August 2019.
```
@inproceedings{henriet:hal-02381367,
  address = {Lille, France},
  author = {Henriet, Simon and Simsekli, Umut and Santos, S{\'e}rgio F. and Fuentes, Beno{\^i}t and Richard, Gael},
  booktitle = {{Colloque francophonede traitement du signal et des images (GRETSI)}},
  hal_id = {hal-02381367},
  hal_version = {v1},
  month = aug,
  pdf = {https://hal.telecom-paris.fr/hal-02381367/file/2019_GRETSI.pdf},
  title = {{Factorisation Matricielle Semi Non-N{\'e}gative: Application{\`a} la D{\'e}composition de Consommations Electriques}},
  url = {https://hal.telecom-paris.fr/hal-02381367},
  year = {2019}
}
```
Depuis de nombreuses années, la mesure et le suivi des consommationsélectriques dans les bâtiments résidentiels et commerciaux comme les bureaux, les centres commerciaux ou les entrepôts ont connu un essor important. Cependant, obtenir la consommation individuelle deséquipementsà partir de la consommation totale (NILM), est un problème complexe. Plusieurs approches ontété proposées dans le cadre des bâtiments résidentiels. Des résultats prometteurs ont notammentété obtenus par le biais de techniques de factorisation de matrices appliquées aux mesures haute fréquence de la tension et du courant. Ces méthodes ne sont pas efficaces lorsqu’on les applique aux bâtiments commerciaux. Dans ce papier, nous proposons une nouvelle méthode de factorisation basée sur une extension de la factorisation semi non négative de matrices (SNMF)à laquelle est ajoutée une pénalisation de la variation totale (TV-SNMF). Pour résoudre ce problème d’optimisation sous contraintes, nous avons développé une stratégie d’optimisation alternée qui utilise une méthode quasi-Newton. Les expériences sur une base de données de simulations de bâtiments commerciaux montrent clairement un gain d’efficacité comparéà d’autres approches comme l’analyse en composantes indépendantes (ICA) ou la SNMF classique. Abstract-In the recent years, there has been an increasing academic and industrial interest for analysing the electrical consumption of commercial buildings. One approach to enable energy efficiency is to disaggregate total energy consumptions into individual ones. This problem is also called Non Intrusive Load Monitoring (NILM). While several approaches have been studied to solve it for residential building using high frequency current and voltage measurements, none of them seems efficient applied to commercial buildings. Amongst the NILM method for residential buildings, matrix factorization approached showed promising results. In this paper, we propose a novel method as an extension of factorization techniques based on Semi Non-Negative Matrix Factorization constrained with a total variation penalization (TV-SNMF). To solve this constrained optimization problem, we rely on an alternating minimization strategy involving a quasi-newton algorithm. The experiments on a simulated commercial building dataset demonstrate clear improvements compared to other approaches such as Independent Component Analysis (ICA) and classic SNMF.
Generalized formulation of acoustics
Jean-Dominique Polack, Aidan Meacham, Roland Badeau
Congrès Français de Mécanique, Brest, France, August 2019.
```
@inproceedings{polack:hal-02288067,
  address = {Brest, France},
  author = {Polack, Jean-Dominique and Meacham, Aidan and Badeau, Roland},
  booktitle = {{Congr{\`e}s Fran{\c c}ais de M{\'e}canique}},
  hal_id = {hal-02288067},
  hal_local_reference = {JDP:CFM-19},
  hal_version = {v1},
  keywords = {eneral linear acoustics ; generalized coordinates ; absorption ; conservation of energy},
  month = aug,
  title = {{Generalized formulation of acoustics}},
  url = {https://hal.telecom-paris.fr/hal-02288067},
  year = {2019}
}
```
In 1937, Janowski and Spandöck experimentally demonstrated the curvature of sound rays travelling at grazing incidence above absorbing materials. 30 years later, Cremer and Müller showed that this curvature was created by the surface admittance of the material ; however, they were not able to demonstrate the curvature of the rays. By making use of the stress-energy tensor, introduced in Acoustics by Morse and Ingard, we show that this formalism, borrowed from the general relativity theory, makes it possible to derive the curvature of grazing rays, provided that general coordinates are adopted so that the metric tensor adapts itself to the admittance at the boundary. We show that the metric tensor can be arbitrarily defined, with the only condition that the normal derivative of the velocity potential be null.Absorption is then taken into account by the residual Christoffel symbols that define normal derivation of the stress-energy tensor at the boundary. A generalized formulation of energy conservation is then obtained. It generalizes earlier work on energy propagation in corridors and flat rooms.
Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for Non-Convex Optimization
Thanh Huy Nguyen, Umut Şimşekli, Gael Richard
International Conference on Machine Learning (ICML), Long Beach, United States, June 2019.
```
@inproceedings{nguyen:hal-02346147,
  address = {Long Beach, United States},
  author = {Nguyen, Thanh Huy and {\c S}im{\c s}ekli, Umut and Richard, Gael},
  booktitle = {{International Conference on Machine Learning (ICML)}},
  hal_id = {hal-02346147},
  hal_version = {v1},
  month = jun,
  pdf = {https://hal.telecom-paris.fr/hal-02346147/file/nguyen19c.pdf},
  title = {{Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for Non-Convex Optimization}},
  url = {https://hal.telecom-paris.fr/hal-02346147},
  year = {2019}
}
```
Recent studies on diffusion-based sampling methods have shown that Langevin Monte Carlo (LMC) algorithms can be beneficial for non-convex optimization, and rigorous theoretical guarantees have been proven for both asymp-totic and finite-time regimes. Algorithmically, LMC-based algorithms resemble the well-known gradient descent (GD) algorithm, where the GD recursion is perturbed by an additive Gaussian noise whose variance has a particular form. Fractional Langevin Monte Carlo (FLMC) is a recently proposed extension of LMC, where the Gaussian noise is replaced by a heavy-tailed α-stable noise. As opposed to its Gaussian counterpart , these heavy-tailed perturbations can incur large jumps and it has been empirically demonstrated that the choice of α-stable noise can provide several advantages in modern machine learning problems, both in optimization and sampling contexts. However, as opposed to LMC, only asymptotic convergence properties of FLMC have been yet established. In this study, we analyze the non-asymptotic behavior of FLMC for non-convex optimization and prove finite-time bounds for its expected suboptimality. Our results show that the weak-error of FLMC increases faster than LMC, which suggests using smaller step-sizes in FLMC. We finally extend our results to the case where the exact gradients are replaced by stochas-tic gradients and show that similar results hold in this setting as well.

A Music Structure Informed Downbeat Tracking System Using Skip-chain Conditional Random Fields and Deep Learning
Magdalena Fuentes, Brian Mcfee, Helene-Camille Crayencour, Slim Essid, Juan P. Bello
ICASSP, Brighton, United Kingdom, May 2019.

@inproceedings{fuentes:hal-02420403,
  address = {Brighton, United Kingdom},
  author = {Fuentes, Magdalena and Mcfee, Brian and Crayencour, Helene-Camille and Essid, Slim and Bello, Juan P.},
  booktitle = {{ICASSP}},
  doi = {10.1109/icassp.2019.8682870},
  hal_id = {hal-02420403},
  hal_version = {v1},
  month = may,
  pdf = {https://hal.archives-ouvertes.fr/hal-02420403/file/6_Structure_downbeat_cready.pdf},
  title = {{A Music Structure Informed Downbeat Tracking System Using Skip-chain Conditional Random Fields and Deep Learning}},
  url = {https://hal.archives-ouvertes.fr/hal-02420403},
  year = {2019}
}

Singing Voice Separation: A Study on Training Data
Laure Prétet, Romain Hennequin, Jimena Royo-Letelier, Andrea Vaglio
ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, May 2019.
```
@inproceedings{pretet:hal-02372076,
  address = {Brighton, United Kingdom},
  author = {Pr{\'e}tet, Laure and Hennequin, Romain and Royo-Letelier, Jimena and Vaglio, Andrea},
  booktitle = {{ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  doi = {10.1109/ICASSP.2019.8683555},
  hal_id = {hal-02372076},
  hal_version = {v1},
  keywords = {Index Terms-source separation ; supervised learning ; training data ; data augmentation},
  month = may,
  pages = {506-510},
  pdf = {https://hal.telecom-paris.fr/hal-02372076/file/Singing_voice_separation_a_study_on_training_data_camera_ready_version.pdf},
  publisher = {{IEEE}},
  title = {{Singing Voice Separation: A Study on Training Data}},
  url = {https://hal.telecom-paris.fr/hal-02372076},
  year = {2019}
}
```
In the recent years, singing voice separation systems showed increased performance due to the use of supervised training. The design of training datasets is known as a crucial factor in the performance of such systems. We investigate on how the characteristics of the training dataset impacts the separation performances of state-of-the-art singing voice separation algorithms. We show that the separation quality and diversity are two important and complementary assets of a good training dataset. We also provide insights on possible transforms to perform data augmentation for this task.

mirdata: Software for Reproducible Usage of Datasets
R. M. Bittner, M. Fuentes, D. Rubinstein, A. Jansson, K. Choi, T. Kell
20th International Society for Music Information Retrieval Conference, 2019.

@inproceedings{bittner2019_mirdata,
  author = {Bittner, R. M. and Fuentes, M. and Rubinstein, D. and Jansson, A. and Choi, K. and Kell, T.},
  year = {2019},
  title = {mirdata: Software for Reproducible Usage of Datasets},
  booktitle = {20th International Society for Music Information Retrieval Conference},
  series = {ISMIR}
}

Journal Articles

Weakly Supervised Representation Learning for Audio-Visual Scene Analysis
Sanjeel Parekh, Slim Essid, Alexey Ozerov, Ngoc Q. K. Duong, Patrick Pérez, Gael Richard
IEEE/ACM Transactions on Audio, Speech and Language Processing, December 2019.
```
@article{parekh:hal-02399993,
  author = {Parekh, Sanjeel and Essid, Slim and Ozerov, Alexey and Duong, Ngoc Q. K. and P{\'e}rez, Patrick and Richard, Gael},
  hal_id = {hal-02399993},
  hal_version = {v1},
  journal = {{IEEE/ACM Transactions on Audio, Speech and Language Processing}},
  keywords = {Index Terms-Multimodal classification ; sound event detection ; object localization ; multiple instance learning ; deep learning ; audio-visual fusion},
  month = dec,
  pdf = {https://hal.telecom-paris.fr/hal-02399993/file/2019-IEEE_TASLP_Parekh.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Weakly Supervised Representation Learning for Audio-Visual Scene Analysis}},
  url = {https://hal.telecom-paris.fr/hal-02399993},
  year = {2019}
}
```
Audiovisual (AV) representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance learning. Specifically, we develop methods that identify events and localize corresponding AV cues in unconstrained videos. Importantly, this is done using weak labels where only video-level event labels are known without any information about their location in time. We show that the learnt representations are useful for performing several tasks such as event/object classification, audio event detection, audio source separation and visual object localization. An important feature of our method is its capacity to learn from unsynchronized audiovisual events. We also demonstrate our framework’s ability to separate out the audio source of interest through a novel use of nonnegative matrix factorization. State-of-the-art classification results, with a F1-score of 65.0, are achieved on DCASE 2017 smart cars challenge data with promising generalization to diverse object types such as musical instruments. Visualizations of localized visual regions and audio segments substantiate our system’s efficacy, especially when dealing with noisy situations where modality-specific cues appear asynchronously.
Independent-Variation Matrix Factorization With Application to Energy Disaggregation
Simon Henriet, Umut Simsekli, Sérgio F. Santos, Benoît Fuentes, Gael Richard
IEEE Signal Processing Letters, November 2019.
```
@article{henriet:hal-02307248,
  author = {Henriet, Simon and Simsekli, Umut and Santos, S{\'e}rgio F. and Fuentes, Beno{\^i}t and Richard, Gael},
  doi = {10.1109/LSP.2019.2941428},
  hal_id = {hal-02307248},
  hal_version = {v1},
  journal = {{IEEE Signal Processing Letters}},
  keywords = {Dictionary Learning ; Semi-Nonnegative Matrix Factorization ; Independent Component Analysis ; Total Variation ; Non-Intrusive Load Monitoring},
  month = nov,
  number = {11},
  pages = {1643-1647},
  pdf = {https://hal.telecom-paris.fr/hal-02307248/file/2019_TVSNMF_IEEE_SPL.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Independent-Variation Matrix Factorization With Application to Energy Disaggregation}},
  url = {https://hal.telecom-paris.fr/hal-02307248},
  volume = {26},
  year = {2019}
}
```
Matrix factorization techniques have proven to be useful in many unsupervised learning applications. Such techniques have been recently applied to Non Intrusive Load Monitoring (NILM), the process of breaking down the total electric consumption of a building into consumptions of individual appliances. While several studies addressed the NILM problem for small-scale buildings, only few studies considered the problem for large buildings, where the signals exhibit significantly different behavior. To overcome the unaddressed difficulties of processing high frequency current signals that are measured in large buildings, we propose a novel technique called Independent-Variation Matrix Factorization (IVMF), which expresses an observation matrix as the product of two matrices: the signature and the activation. Motivated by the nature of the current signals, it uses a regularization term on the temporal variations of the activation matrix and a positivity constraint, and the columns of the signature matrix are constrained to lie in a specific set. To solve the resulting optimization problem, we rely on an alternating minimization strategy involving dual optimization and quasi-Newton algorithms. The algorithm is tested against Independent Component Analysis (ICA) and Semi Nonnegative Matrix Factorization (SNMF) on a synthetic source separation problem and on a realistic NILM application for large commercial buildings. We show that IVMF outperforms competing methods and is particularly appropriate to recover positive sources that have a strong temporal dependency and sources whose variations are independent from each other.
Common mathematical framework for stochastic reverberation models
Roland Badeau
Journal of the Acoustical Society of America, April 2019.
```
@article{badeau:hal-01958485,
  author = {Badeau, Roland},
  doi = {10.1121/1.5096153},
  hal_id = {hal-01958485},
  hal_version = {v1},
  journal = {{Journal of the Acoustical Society of America}},
  keywords = {Reverberation ; Diffusion ; Room impulse response ; Stochastic models},
  month = apr,
  number = {4},
  pages = {2733-2745},
  pdf = {https://hal.archives-ouvertes.fr/hal-01958485/file/Badeau-JASA-2019.pdf},
  publisher = {{Acoustical Society of America}},
  title = {{Common mathematical framework for stochastic reverberation models}},
  url = {https://hal.archives-ouvertes.fr/hal-01958485},
  volume = {145},
  year = {2019}
}
```
In the field of room acoustics, it is well known that reverberation can be character-1 ized statistically in a particular region of the time-frequency domain (after the tran-2 sition time and above Schroeder’s frequency). Since the 1950s, various formulas have 3 been established, focusing on particular aspects of reverberation: exponential decay 4 over time, correlations between frequencies, correlations between sensors at each fre-5 quency, and time-frequency distribution. In this paper, we introduce a stochastic 6 reverberation model, that permits us to retrieve all these well-known results within 7 a common mathematical framework. To the best of our knowledge, this is the first 8 time that such a unification work is presented. The benefits are multiple: several 9 formulas generalizing the classical results are established, that jointly characterize 10 the spatial, temporal and spectral properties of late reverberation. 11

De Fourier à la reconnaissance musicale
Gael Richard, Sebastien Fenet, Yves Grenier
Interstices, February 2019.

@article{richard:hal-02068670,
  author = {Richard, Gael and Fenet, Sebastien and Grenier, Yves},
  hal_id = {hal-02068670},
  hal_version = {v1},
  journal = {{Interstices}},
  month = feb,
  publisher = {{INRIA}},
  title = {{De Fourier {\`a} la reconnaissance musicale}},
  url = {https://hal.telecom-paris.fr/hal-02068670},
  year = {2019}
}

Early Detection of User Engagement Breakdown in Spontaneous Human-Humanoid Interaction
Atef Ben Youssef, Chloé Clavel, Slim Essid
IEEE Transactions on Affective Computing , January 2019.
```
@article{benyoussef:hal-02288043,
  author = {Ben Youssef, Atef and Clavel, Chlo{\'e} and Essid, Slim},
  hal_id = {hal-02288043},
  hal_local_reference = {ABY:IEEE-2019},
  hal_version = {v1},
  journal = {{IEEE Transactions on Affective Computing }},
  keywords = {User engagement ; prediction of engagement breakdown ; HRI ; spontaneous interaction ; real-time prediction},
  month = jan,
  title = {{Early Detection of User Engagement Breakdown in Spontaneous Human-Humanoid Interaction}},
  url = {https://hal.telecom-paris.fr/hal-02288043},
  year = {2019}
}
```
This paper presents a supervised classification system for forecasting a potential user engagement breakdown in human-robot interaction. We define engagement breakdown as a failure to successfully complete a predefined interaction scenario, where the user leaves before the expected end. The goal is thus to detect as early as possible such a potential engagement breakdown during the interaction between a human and a humanoid robot. To this end, we exploit a dataset that we have collected in real-world conditions where a set of participants were left to spontaneously engage in an interaction with the robot. The dataset is labeled according to the presence/absence of engagement breakdown. This study investigates the use of a multimodal approach to this problem, where a set of non-verbal features is considered to characterize the users’ behavior. The use of combined multimodal features is found to effectively improve the performance of the system. The optimal set of data streams useful for this task is the combination of the distance to the robot, gaze and head motion, as well as facial expressions and speech. We study the time extent over which a user’s departure can be anticipated. We find that this ability to anticipate the departure depends on the window during which we observe the user behavior.
On-the-fly Detection of User Engagement Decrease in Spontaneous Human-Robot Interaction
Atef Ben Youssef, Giovanna Varni, Slim Essid, Chloé Clavel
International Journal of Social Robotics, January 2019.
```
@article{benyoussef:hal-02288044,
  author = {Ben Youssef, Atef and Varni, Giovanna and Essid, Slim and Clavel, Chlo{\'e}},
  hal_id = {hal-02288044},
  hal_local_reference = {ABY:IJSR-2019},
  hal_version = {v1},
  journal = {{International Journal of Social Robotics}},
  keywords = {User engagement decrease ; Socially assistive robot ; HRI in public space ; Real-time detection},
  month = jan,
  title = {{On-the-fly Detection of User Engagement Decrease in Spontaneous Human-Robot Interaction}},
  url = {https://hal.telecom-paris.fr/hal-02288044},
  year = {2019}
}
```
In this paper, we address the detection of engagement decrease of users spontaneously interacting with a socially assistive robot in a public space. We first describe the UE-HRI dataset that collects spontaneous Human-Robot Interactions following the guidelines provided by the Affective Computing research community to collect data "in-the-wild". We then analyze the users’ behaviors focusing on proxemics, gaze, head motion, facial expressions and speech during interactions with the robot. Engaged behaviors versus signs of engagement decrease exhibited by the users were annotated and analyzed. Finally, we investigate the use of deep leaning techniques (Recurrent and Deep Neural Networks) to detect user engagement decrease in real-time. The results of this work particularly highlight the relevance of taking into account temporal dynamics of the user’s behavior. Allowing 1 to 2 seconds as buffer delay improves the performance of taking a decision on user engagement.

Audiovisual Analysis of Music Performances: Overview of an Emerging Field
Zhiyao Duan, Slim Essid, Cynthia Liem, Gael Richard, Gaurav Sharma
IEEE Signal Processing magazine, January 2019.

@article{duan:hal-02287983,
  author = {Duan, Zhiyao and Essid, Slim and Liem, Cynthia and Richard, Gael and Sharma, Gaurav},
  hal_id = {hal-02287983},
  hal_local_reference = {duan:hal-01893410},
  hal_version = {v1},
  journal = {{IEEE Signal Processing magazine}},
  month = jan,
  number = {1},
  pages = {63-73},
  title = {{Audiovisual Analysis of Music Performances: Overview of an Emerging Field}},
  url = {https://hal.telecom-paris.fr/hal-02287983},
  volume = {36},
  year = {2019}
}

Technical Reports

Stochastic reverberation model for uniform and non-diffuse acoustic fields
Roland Badeau
April 2019.
```
@techreport{badeau:hal-02127799,
  author = {Badeau, Roland},
  hal_id = {hal-02127799},
  hal_version = {v1},
  institution = {{T{\'e}l{\'e}com ParisTech}},
  keywords = {Reverberation ; Diffusion ; Room impulse response ; Stochastic models ; R{\'e}ponse impulsionnelle de salle ; Mod{\`e}les stochastiques},
  month = apr,
  pdf = {https://hal.telecom-paris.fr/hal-02127799/file/publication-344.pdf},
  title = {{Stochastic reverberation model for uniform and non-diffuse acoustic fields}},
  type = {Research Report},
  url = {https://hal.telecom-paris.fr/hal-02127799},
  year = {2019}
}
```
In a recent research report, we introduced a general stochastic reverberation model that aims to represent the statistical properties of reverberation in a broad variety of acoustic environments. A simplified version of this model, dedicated to the particular case of diffuse (i.e. uniform and isotropic) acoustic fields, omnidirectional sources and microphones, and constant attenuation w.r.t frequency, has been investigated both mathematically and experimentally in a recent research paper. We showed that this model provides a common mathematical framework that unifies several well-known results regarding the statistical properties of reverberation in the space, time and frequency domains. In this research report, we aim to extend this mathematical analysis to uniform and non-diffuse acoustic fields, and directive sources and microphones. We show that the predictions of the general stochastic model experimentally match the observations, based on both synthetic and real room impulse responses, measured in various acoustic environments.
General stochastic reverberation model
Roland Badeau
February 2019.
```
@techreport{badeau:hal-02049987,
  author = {Badeau, Roland},
  hal_id = {hal-02049987},
  hal_version = {v1},
  institution = {{T{\'e}l{\'e}com ParisTech}},
  keywords = {Stochastic models ; Room impulse response ; R{\'e}verb{\'e}ration ; Diffusion ; R{\'e}ponse impulsionnelle de salle ; Mod{\`e}les stochastiques},
  month = feb,
  pdf = {https://hal.archives-ouvertes.fr/hal-02049987/file/TechReport-Badeau-2019.pdf},
  title = {{General stochastic reverberation model}},
  type = {Research Report},
  url = {https://hal.archives-ouvertes.fr/hal-02049987},
  year = {2019}
}
```
In a recent research paper, we proposed a common mathematical framework for stochastic reverberation models, that aimed to unify several well-known results regarding the statistical properties of reverberation, in the spatial, spectral and temporal domains. This model was dedicated to diffuse (i.e. isotropic and uniform) acoustic fields, omnidirectional sources and microphones, and constant attenuation coefficients w.r.t. the frequency. In this technical report, we introduce several extensions of this model, that aim to model reverberation more realistically, by considering anisotropic and non-uniform acoustic fields, directive sources and microphones, and frequency-varying attenuation coefficients.

Theses

Processus alpha-stables pour le traitement du signal
Mathieu Fontaine
2019.

@phdthesis{fontaine_processus_2019,
  type = {{PhD} {Thesis}},
  title = {Processus alpha-stables pour le traitement du signal},
  copyright = {Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC-BY-NC-SA)},
  school = {Université de Lorraine},
  author = {Fontaine, Mathieu},
  year = {2019}
}

2018

patent

Procédé de traitement d’un signal audio et dispositif électronique correspondant, produit-programme lisible par ordinateur non transitoire et support d’informations lisible par ordinateur
Sanjeel Parekh, Alexey Ozerov, Quang-Khanh-Ngoc Duong, Gael Richard, Slim Essid, Patrick Pérez
France, October 2018.

@patent{parekh:hal-02651234,
  address = {France},
  author = {Parekh, Sanjeel and Ozerov, Alexey and Duong, Quang-Khanh-Ngoc and Richard, Gael and Essid, Slim and P{\'e}rez, Patrick},
  hal_id = {hal-02651234},
  hal_version = {v1},
  month = oct,
  number = {EP3392882 A1},
  title = {{Proc{\'e}d{\'e} de traitement d'un signal audio et dispositif {\'e}lectronique correspondant, produit-programme lisible par ordinateur non transitoire et support d'informations lisible par ordinateur}},
  url = {https://hal.telecom-paris.fr/hal-02651234},
  year = {2018}
}

Procédé de classification et de localisation d’événements audiovisuels et appareil correspondant, produit-programme lisible par ordinateur et support d’informations lisible par ordinateur
Quang-Khanh-Ngoc Duong, Alexey Ozerov, Sanjeel Parekh, Slim Essid, Gael Richard, Patrick Pérez
France, March 2018.

@patent{duong:hal-02651256,
  address = {France},
  author = {Duong, Quang-Khanh-Ngoc and Ozerov, Alexey and Parekh, Sanjeel and Essid, Slim and Richard, Gael and P{\'e}rez, Patrick},
  hal_id = {hal-02651256},
  hal_version = {v1},
  month = mar,
  number = {EP3540634},
  title = {{Proc{\'e}d{\'e} de classification et de localisation d'{\'e}v{\'e}nements audiovisuels et appareil correspondant, produit-programme lisible par ordinateur et support d'informations lisible par ordinateur}},
  url = {https://hal.telecom-paris.fr/hal-02651256},
  year = {2018}
}

Procede et Systeme de Diffusion d un Flux Audio Multicanal a des terminaux de spectateurs assistant a un evenement sportif
Raphael Blouet, Slim Essid
March 2018.

@patent{SE:patent18,
  author = {Blouet, Raphael and Essid, Slim},
  title = {Procede et Systeme de Diffusion d un Flux Audio Multicanal a des terminaux de spectateurs assistant a un evenement sportif},
  year = {2018},
  month = mar,
  url = {https://perso.telecom-paristech.fr/essid/FR3079706B1.pdf},
  number = {1852774}
}

Conference Articles

Unified Stochastic Reverberation Modeling
Roland Badeau
26th European Signal Processing Conference (EUSIPCO), Rome, Italy, September 2018.
```
@inproceedings{badeau:hal-01795319,
  address = {Rome, Italy},
  author = {Badeau, Roland},
  booktitle = {{26th European Signal Processing Conference (EUSIPCO)}},
  hal_id = {hal-01795319},
  hal_version = {v1},
  keywords = {Index Terms-Reverberation ; Wigner distribution. ; stationary processes ; room impulse response ; Reverberation ; room frequency response ; stochastic models ; Poisson processes ; station-ary processes ; Wigner distribution},
  month = sep,
  pdf = {https://hal.archives-ouvertes.fr/hal-01795319/file/Badeau-EUSIPCO-2018.pdf},
  series = {Proc. 26th European Signal Processing Conference (EUSIPCO 2018)},
  title = {{Unified Stochastic Reverberation Modeling}},
  url = {https://hal.archives-ouvertes.fr/hal-01795319},
  year = {2018}
}
```
In the field of room acoustics, it is well known that reverberation can be characterized statistically in a particular region of the time-frequency domain (after the transition time and above Schroeder’s frequency). Since the 1950s, various formulas have been established, focusing on particular aspects of reverberation: exponential decay over time, correlations between frequencies, correlations between sensors at each frequency, and time-frequency distribution. In this paper, we introduce a new stochastic reverberation model, that permits us to retrieve all these well-known results within a common mathematical framework. To the best of our knowledge, this is the first time that such a unification work is presented. The benefits are multiple: several new formulas generalizing the classical results are established, that jointly characterize the spatial, temporal and spectral properties of late reverberation.
MAIN MELODY EXTRACTION WITH SOURCE-FILTER NMF AND CRNN
Dogac Basaran, Slim Essid, Geoffroy Peeters
19th International Society for Music Information Retreival, Paris, France, September 2018.
```
@inproceedings{basaran:hal-02019103,
  address = {Paris, France},
  author = {Basaran, Dogac and Essid, Slim and Peeters, Geoffroy},
  booktitle = {{19th International Society for Music Information Retreival}},
  hal_id = {hal-02019103},
  hal_version = {v1},
  month = sep,
  pdf = {https://hal.archives-ouvertes.fr/hal-02019103/file/273_Paper.pdf},
  title = {{MAIN MELODY EXTRACTION WITH SOURCE-FILTER NMF AND CRNN}},
  url = {https://hal.archives-ouvertes.fr/hal-02019103},
  year = {2018}
}
```
Estimating the main melody of a polyphonic audio recording remains a challenging task. We approach the task from a classification perspective and adopt a convolutional recurrent neural network (CRNN) architecture that relies on a particular form of pretraining by source-filter nonneg-ative matrix factorisation (NMF). The source-filter NMF decomposition is chosen for its ability to capture the pitch and timbre content of the leading voice/instrument, providing a better initial pitch salience than standard time-frequency representations. Starting from such a musically motivated representation, we propose to further enhance the NMF-based salience representations with CNN layers , then to model the temporal structure by an RNN network and to estimate the dominant melody with a final classification layer. The results show that such a system achieves state-of-the-art performance on the MedleyDB dataset without any augmentation methods or large training sets.
ANALYSIS OF COMMON DESIGN CHOICES IN DEEP LEARNING SYSTEMS FOR DOWNBEAT TRACKING
Magdalena Fuentes, Brian Mcfee, Hélène C Crayencour, Slim Essid, Juan P Bello
The 19th International Society for Music Information Retrieval Conference, Paris, France, September 2018.
```
@inproceedings{fuentes:hal-02943467,
  address = {Paris, France},
  author = {Fuentes, Magdalena and Mcfee, Brian and Crayencour, H{\'e}l{\`e}ne C and Essid, Slim and Bello, Juan P},
  booktitle = {{The 19th International Society for Music Information Retrieval Conference}},
  hal_id = {hal-02943467},
  hal_version = {v1},
  month = sep,
  pdf = {https://hal.archives-ouvertes.fr/hal-02943467/file/MF_ISMIR-18.pdf},
  title = {{ANALYSIS OF COMMON DESIGN CHOICES IN DEEP LEARNING SYSTEMS FOR DOWNBEAT TRACKING}},
  url = {https://hal.archives-ouvertes.fr/hal-02943467},
  year = {2018}
}
```
Downbeat tracking consists of annotating a piece of musical audio with the estimated position of the first beat of each bar. In recent years, increasing attention has been paid to applying deep learning models to this task, and various architectures have been proposed, leading to a significant improvement in accuracy. However, there are few insights about the role of the various design choices and the delicate interactions between them. In this paper we offer a systematic investigation of the impact of largely adopted variants. We study the effects of the temporal granularity of the input representation (i.e. beat-level vs tatum-level) and the encoding of the networks outputs. We also investigate the potential of convolutional-recurrent networks, which have not been explored in previous downbeat tracking systems. To this end, we exploit a state-of-the-art recurrent neural network where we introduce those variants, while keeping the training data, network learning parameters and post-processing stages fixed. We find that temporal granularity has a significant impact on performance, and we analyze its interaction with the encoding of the networks outputs.

Multi-task Feature Learning for EEG-based Emotion Recognition Using Group Nonnegative Matrix Factorization
Ayoub Hajlaoui, Mohamed Chetouani, Slim Essid
2018 26th European Signal Processing Conference (EUSIPCO), Rome, France, September 2018.

@inproceedings{hajlaoui:hal-02422892,
  address = {Rome, France},
  author = {Hajlaoui, Ayoub and Chetouani, Mohamed and Essid, Slim},
  booktitle = {{2018 26th European Signal Processing Conference (EUSIPCO)}},
  doi = {10.23919/EUSIPCO.2018.8553390},
  hal_id = {hal-02422892},
  hal_version = {v1},
  month = sep,
  pages = {91-95},
  publisher = {{IEEE}},
  title = {{Multi-task Feature Learning for EEG-based Emotion Recognition Using Group Nonnegative Matrix Factorization}},
  url = {https://hal.sorbonne-universite.fr/hal-02422892},
  year = {2018}
}

Non-linear auto-regressive models for cross-frequency coupling in neural time series
Tom Dupré La Tour, Lucile Tallot, Laeticia Grabot, Valérie Doyère, Virginie Van Wassenhove, Yves Grenier, Alexandre Gramfort
BIOMAG, Philadelphia, USA, August 2018.

@inproceedings{TD:BIOMAG-2018,
  author = {{Dupr{\'e} La Tour}, Tom and Tallot, Lucile and Grabot, Laeticia and Doy{\`e}re, Val{\'e}rie and Van Wassenhove, Virginie and Grenier, Yves and Gramfort, Alexandre},
  title = {Non-linear auto-regressive models for cross-frequency coupling in neural time series},
  booktitle = {BIOMAG},
  address = {Philadelphia, USA},
  year = {2018},
  month = aug
}

Multichannel Audio Modeling with Elliptically Stable Tensor Decomposition
Mathieu Fontaine, Fabian-Robert Stöter, Antoine Liutkus, Umut Simsekli, Romain Serizel, Roland Badeau
LVA/ICA: Latent Variable Analysis and Signal Separation, Surrey, United Kingdom, July 2018.
```
@inproceedings{fontaine:lirmm-01766795,
  address = {Surrey, United Kingdom},
  author = {Fontaine, Mathieu and St{\"o}ter, Fabian-Robert and Liutkus, Antoine and Simsekli, Umut and Serizel, Romain and Badeau, Roland},
  booktitle = {{LVA/ICA: Latent Variable Analysis and Signal Separation}},
  doi = {10.1007/978-3-319-93764-9\_2},
  editor = {Y., Deville and S., Gannot and R., Mason and M., Plumbley and D., Ward},
  hal_id = {lirmm-01766795},
  hal_version = {v1},
  month = jul,
  number = {10891},
  pages = {13-23},
  pdf = {https://hal-lirmm.ccsd.cnrs.fr/lirmm-01766795/file/LVA-ICA2018_046_original_v5.pdf},
  publisher = {{Springer}},
  title = {{Multichannel Audio Modeling with Elliptically Stable Tensor Decomposition}},
  url = {https://hal-lirmm.ccsd.cnrs.fr/lirmm-01766795},
  volume = {LNCS},
  year = {2018}
}
```
This paper introduces a new method for multichannel speech enhancement based on a versatile modeling of the residual noise spec-trogram. Such a model has already been presented before in the single channel case where the noise component is assumed to follow an alpha-stable distribution for each time-frequency bin, whereas the speech spec-trogram, supposed to be more regular, is modeled as Gaussian. In this paper, we describe a multichannel extension of this model, as well as a Monte Carlo Expectation-Maximisation algorithm for parameter estimation. In particular, a multichannel extension of the Itakura-Saito nonnegative matrix factorization is exploited to estimate the spectral parameters for speech, and a Metropolis-Hastings algorithm is proposed to estimate the noise contribution. We evaluate the proposed method in a challenging multichannel denoising application and compare it to other state-of-the-art algorithms.

Attitude Classification in Adjacency Pairs of a Human-Agent Interaction with Hidden Conditional Random Fields
Valentin Barriere, Chloe Clavel, Slim Essid
ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, April 2018.

@inproceedings{barriere:hal-02943469,
  address = {Calgary, Canada},
  author = {Barriere, Valentin and Clavel, Chloe and Essid, Slim},
  booktitle = {{ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  doi = {10.1109/ICASSP.2018.8462160},
  hal_id = {hal-02943469},
  hal_version = {v1},
  month = apr,
  pages = {4949-4953},
  publisher = {{IEEE}},
  title = {{Attitude Classification in Adjacency Pairs of a Human-Agent Interaction with Hidden Conditional Random Fields}},
  url = {https://hal.archives-ouvertes.fr/hal-02943469},
  year = {2018}
}

Driver estimation in non-linear autoregressive models
Tom Tour, Yves Grenier, Alexandre Gramfort
43nd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018), Calgary, Canada, April 2018.
```
@inproceedings{duprelatour:hal-01696786,
  address = {Calgary, Canada},
  author = {Dupr{\'e} la Tour, Tom and Grenier, Yves and Gramfort, Alexandre},
  booktitle = {{43nd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018)}},
  hal_id = {hal-01696786},
  hal_version = {v1},
  keywords = {cross-frequency coupling ; non-linear autoregressive models ; spectrum estimation ; electrophysiology},
  month = apr,
  pdf = {https://hal.archives-ouvertes.fr/hal-01696786/file/duprelatour2018icassp.pdf},
  title = {{Driver estimation in non-linear autoregressive models}},
  url = {https://hal.archives-ouvertes.fr/hal-01696786},
  year = {2018}
}
```
In non-linear autoregressive models, the time dependency of coefficients is often driven by a particular time-series which is not given and thus has to be estimated from the data. To allow model evaluation on a validation set, we describe a parametric approach for such driver estimation. After estimating the driver as a weighted sum of potential drivers, we use it in a non-linear autoregressive model with a polynomial parametrization. Using gradient descent, we optimize the linear filter extracting the driver, outperforming a typical grid-search on predefined filters.
Optimisation d’un critère d’Intelligibilité de la Parole dans un Contexte Bruité Automobile
Enguerrand Gentet, Bertrand David, Sébastien Denjean, Gael Richard, Vincent Roussarie
CFA 2018, Le Havre, France, April 2018.
```
@inproceedings{gentet:hal-02287919,
  address = {Le Havre, France},
  author = {Gentet, Enguerrand and David, Bertrand and Denjean, S{\'e}bastien and Richard, Gael and Roussarie, Vincent},
  booktitle = {{CFA 2018}},
  hal_id = {hal-02287919},
  hal_local_reference = {EG:OPTISII-2018},
  hal_version = {v1},
  keywords = {intelligibilit{\'e} ; parole ; {\'e}galiseur ; SII},
  month = apr,
  title = {{Optimisation d'un crit{\`e}re d'Intelligibilit{\'e} de la Parole dans un Contexte Bruit{\'e} Automobile}},
  url = {https://hal.telecom-paris.fr/hal-02287919},
  year = {2018}
}
```
Ce travail s’inscrit dans le cadre de l’amélioration de l’intelligibilité des signaux de parole en contexte bruité automobile. Contrairement aux approches de rehaussement il s’agit ici de réallouer l’énergie de la parole pour améliorer sa compréhension, mais sans modifier le RSB (Rapport Signal sur Bruit) global. L’approche étudiée consiste à effectuer cette réallocation en maximisant une fonction objectif basée sur le SII (Speech Intelligibility Index). Ce critère est calculé à partir d’une somme pondérée de RSB sur différents canaux fréquentiels et il est alors possible d’appliquer un égaliseur dynamique aux signaux de parole sous la contrainte donnée. À la différence d’autres méthodes, nous proposons une résolution exacte du problème d’optimisation du SII avec adaptation dynamique aux signaux (parole et bruit). Elle prend en compte la sensibilité acoustique de l’utilisateur et utilise une échelle de puissance adaptée au contexte de manière à ne pas augmenter l’intensité perçue. Des résultats préliminaires montrent une nette amélioration de l’intelligibilité et des tests d’écoute subjectifs viendront valider la méthode. L’influence des différents paramètres d’ajustement de la méthode sera détaillée pour d’une part, mettre en évidence des aspects importants liés à l’égalisation et d’autre part, relever certaines limitations d’une approche basée sur le SII.
Alpha-stable low-rank plus residual decomposition for speech enhancement
Umut Simsekli, Halil Erdogan, Simon Leglaive, Antoine Liutkus, Roland Badeau, Gael Richard
ICASSP: International Conference on Acoustics, Speech, and Signal Processing, Calgary, Canada, April 2018.
```
@inproceedings{simsekli:hal-01714909,
  address = {Calgary, Canada},
  author = {Simsekli, Umut and Erdogan, Halil and Leglaive, Simon and Liutkus, Antoine and Badeau, Roland and Richard, Gael},
  booktitle = {{ICASSP: International Conference on Acoustics, Speech, and Signal Processing}},
  doi = {10.1109/ICASSP.2018.8461539},
  hal_id = {hal-01714909},
  hal_version = {v1},
  keywords = {Speech enhancement ; Monte Carlo Expectation-Maximization ; Alpha-stable distributions ; Audio source separation},
  month = apr,
  pages = {651-655},
  pdf = {https://hal.inria.fr/hal-01714909/file/2017102794510_839706_2832.pdf},
  publisher = {{IEEE}},
  title = {{Alpha-stable low-rank plus residual decomposition for speech enhancement}},
  url = {https://hal.inria.fr/hal-01714909},
  year = {2018}
}
```
In this study, we propose a novel probabilistic model for separating clean speech signals from noisy mixtures by decomposing the mixture spectrograms into a structured speech part and a more flexible residual part. The main novelty in our model is that it uses a family of heavy-tailed distributions, so called the α-stable distributions, for modeling the residual signal. We develop an expectation-maximization algorithm for parameter estimation and a Monte Carlo scheme for posterior estimation of the clean speech. Our experiments show that the proposed method outperforms relevant factorization-based algorithms by a significant margin.

Energy Disaggregation for Commercial Buildings: A Statistical Analysis
Simon Henriet, Umut Simsekli, Gael Richard, Benoît Fuentes
”, International Workshop on Non-Intrusive Load Monitoring (NILM2018), Austin, Tx, United States, March 2018.

@inproceedings{henriet:hal-02288000,
  address = {Austin, Tx, United States},
  author = {Henriet, Simon and Simsekli, Umut and Richard, Gael and Fuentes, Beno{\^i}t},
  booktitle = {{'', International Workshop on Non-Intrusive Load Monitoring (NILM2018)}},
  hal_id = {hal-02288000},
  hal_local_reference = {SH:NILM-18},
  hal_version = {v1},
  month = mar,
  title = {{Energy Disaggregation for Commercial Buildings: A Statistical Analysis}},
  url = {https://hal.telecom-paris.fr/hal-02288000},
  year = {2018}
}

Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events
Sanjeel Parekh, Slim Essid, Alexey Ozerov, Ngoc Q K Duong, Patrick Pérez, Gael Richard
CVPR Workshop, Salt Lake city, United States, 2018.
```
@inproceedings{parekh:hal-02713307,
  address = {Salt Lake city, United States},
  author = {Parekh, Sanjeel and Essid, Slim and Ozerov, Alexey and Duong, Ngoc Q K and P{\'e}rez, Patrick and Richard, Gael},
  booktitle = {{CVPR Workshop}},
  hal_id = {hal-02713307},
  hal_version = {v1},
  keywords = {Audio-visual fusion ; multimodal deep learning ; multiple in- stance learning ; event classification ; audio-visual localization},
  pdf = {https://hal.archives-ouvertes.fr/hal-02713307/file/1804.07345.pdf},
  title = {{Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events}},
  url = {https://hal.archives-ouvertes.fr/hal-02713307},
  year = {2018}
}
```
Audiovisual representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance learning. We show that the learnt representations are useful for classifying events and localizing their characteristic audiovisual elements. The system is trained using only video-level event labels without any timing information. An important feature of our method is its capacity to learn from unsynchronized audiovisual events. We achieve state-of-the-art results on a large-scale dataset of weakly-labeled audio event videos. Visualizations of localized visual regions and audio segments substantiate our system’s efficacy, especially when dealing with noisy situations where modality-specific cues appear asynchronously.

A Novel Database of Brazilian Rhythmic Instruments and Some Experiments in Computational Rhythm Analysis
L.S. Maia, P. D. Tomaz Jr., M. Fuentes, M. Rocamora, L. W. P. Biscainho, M. V. M. Costa, S. Cohen
Audio Engineering Society Latin American Conference, 2018.

@inproceedings{maia2018_brid,
  author = {Maia, L.S. and de Tomaz Jr., P. D. and Fuentes, M. and Rocamora, M. and Biscainho, L. W. P. and da Costa, M. V. M. and Cohen, S.},
  year = {2018},
  title = {A Novel Database of Brazilian Rhythmic Instruments and Some Experiments in Computational Rhythm Analysis},
  booktitle = {Audio Engineering Society Latin American Conference},
  series = {AES}
}

An ENF-Based Audio Authenticity Method Robust to MP3 Compression
P. Zinemanas, M. Fuentes, P. Cancela, J. A. Apolinário Jr.
Circuits, Systems and Signal Processing Springer, 2018.

@inproceedings{zinemanas2018_enf,
  author = {Zinemanas, P. and Fuentes, M. and Cancela, P. and Apolin\'{a}rio Jr., J. A.},
  year = {2018},
  title = {An ENF-Based Audio Authenticity Method Robust to MP3 Compression},
  booktitle = {Circuits, Systems and Signal Processing Springer}
}

Journal Articles

Student’s t Source and Mixing Models for Multichannel Audio Source Separation
Simon Leglaive, Roland Badeau, Gael Richard
IEEE/ACM Transactions on Audio, Speech and Language Processing, June 2018.
```
@article{leglaive:hal-01584755,
  author = {Leglaive, Simon and Badeau, Roland and Richard, Gael},
  hal_id = {hal-01584755},
  hal_version = {v2},
  journal = {{IEEE/ACM Transactions on Audio, Speech and Language Processing}},
  keywords = { statistical room acoustics ;  Student's t distribution ; Audio source separation ;  multichannel reverberant mixtures ;  non-negative matrix factorization ;  variational inference},
  month = jun,
  number = {6},
  pages = {1150-1164},
  pdf = {https://hal.inria.fr/hal-01584755v2/file/FinalManuscript.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Student's t Source and Mixing Models for Multichannel Audio Source Separation}},
  url = {https://hal.inria.fr/hal-01584755},
  volume = {26},
  year = {2018}
}
```
This paper presents a Bayesian framework for under-determined audio source separation in multichannel reverberant mixtures. We model the source signals as Student’s t latent random variables in a time-frequency domain. The specific structure of musical signals in this domain is exploited by means of a non-negative matrix factorization model. Conversely, we design the mixing model in the time domain. In addition to leading to an exact representation of the convolutive mixing process, this approach allows us to develop simple probabilistic priors for the mixing filters. Indeed, as those filters correspond to room responses they exhibit a simple characteristic structure in the time domain that can be used to guide their estimation. We also rely on the Student’s t distribution for modeling the impulse response of the mixing filters. From this model, we develop a variational inference algorithm in order to perform source separation. The experimental evaluation demonstrates the potential of this approach for separating multichannel reverberant mixtures.
Model-based STFT phase recovery for audio source separation
Paul Magron, Roland Badeau, Bertrand David
IEEE Transactions on Audio, Speech and Language Processing, June 2018.
```
@article{magron:hal-01718718,
  author = {Magron, Paul and Badeau, Roland and David, Bertrand},
  hal_id = {hal-01718718},
  hal_version = {v2},
  journal = {{IEEE Transactions on Audio, Speech and Language Processing}},
  keywords = {Phase recovery ; sinusoidal modeling ; phase unwrapping ; uxiliary function method ; audio source separation},
  month = jun,
  pdf = {https://hal.archives-ouvertes.fr/hal-01718718v2/file/main.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Model-based STFT phase recovery for audio source separation}},
  url = {https://hal.archives-ouvertes.fr/hal-01718718},
  volume = {26},
  year = {2018}
}
```
For audio source separation applications, it is common to estimate the magnitude of the short-time Fourier transform (STFT) of each source. In order to further synthesizing time-domain signals, it is necessary to recover the phase of the corresponding complex-valued STFT. Most authors in this field choose a Wiener-like filtering approach which boils down to using the phase of the original mixture. In this paper, a different standpoint is adopted. Many music events are partially composed of slowly varying sinusoids and the STFT phase increment over time of those frequency components takes a specific form. This allows phase recovery by an unwrapping technique once a short-term frequency estimate has been obtained. Herein, a novel iterative source separation procedure is proposed which builds upon these results. It consists in minimizing the mixing error by means of the auxiliary function method. This procedure is initialized by exploiting the unwrapping technique in order to generate estimates that benefit from a temporal continuity property. Experiments conducted on realistic music pieces show that, given accurate magnitude estimates, this procedure outperforms the state-of-the-art consistent Wiener filter.
Training and Compensation of Class-conditioned NMF Bases for Speech Enhancement
Hanwook Chung, Roland Badeau, Eric Plourde, Benoît Champagne
Neurocomputing, 2018.
```
@article{chung:hal-01682750,
  author = {Chung, Hanwook and Badeau, Roland and Plourde, Eric and Champagne, Beno{\^i}t},
  hal_id = {hal-01682750},
  hal_version = {v1},
  journal = {{Neurocomputing}},
  keywords = {classification ; variational Bayesian expectation-maximization ; Single-channel speech enhancement ; non-negative matrix factorization ; probabilistic generative model},
  pdf = {https://hal.archives-ouvertes.fr/hal-01682750/file/NEUCOMP_HC.pdf},
  publisher = {{Elsevier}},
  title = {{Training and Compensation of Class-conditioned NMF Bases for Speech Enhancement}},
  url = {https://hal.archives-ouvertes.fr/hal-01682750},
  year = {2018}
}
```
In this paper, we introduce a training and compensation algorithm of the class-conditioned basis vectors in the non-negative matrix factorization (NMF) model for single-channel speech enhancement. The main goal is to estimate the basis vectors of different signal sources in a way that prevents them from representing other sources, in order to reduce the residual noise components that have features similar to the speech signal. During the proposed training stage, the basis matrices for the clean speech and noises are estimated jointly by constraining them to belong to different classes. To this end, we employ the probabilistic generative model (PGM) of classification, specified by class-conditional densities, as an a priori distribution for the basis vectors. The update rules of the NMF and the PGM parameters of classification are jointly obtained by using the variational Bayesian expectation-maximization (VBEM) algorithm, which guarantees convergence to a stationary point. Another goal of the proposed algorithm is to handle a mismatch between the characteristics of the training and test data. This is accomplished during the proposed enhancement stage, where we implement a basis compensation scheme. Specifically, we use extra free basis vectors to capture the features which are not included in the training data. Objective experimental results for different combination of speaker and noise types show that the proposed algorithm can provide better speech enhancement performance than the benchmark algorithms under various conditions.
A Generative Model for Non-Intrusive Load Monitoring in Commercial Buildings
Simon Henriet, Umut Şimşekli, Benoît Fuentes, Gael Richard
Energy and Buildings, 2018.
```
@article{henriet:hal-02705056,
  author = {Henriet, Simon and {\c S}im{\c s}ekli, Umut and Fuentes, Beno{\^i}t and Richard, Gael},
  hal_id = {hal-02705056},
  hal_version = {v1},
  journal = {{Energy and Buildings}},
  keywords = {NILM ; Commercial buildings ; Synthetic data generation ; Source separation ; Matrix factorization ; Statistical analysis},
  pdf = {https://hal.archives-ouvertes.fr/hal-02705056/file/1803.00515%281%29.pdf},
  publisher = {{Elsevier}},
  title = {{A Generative Model for Non-Intrusive Load Monitoring in Commercial Buildings}},
  url = {https://hal.archives-ouvertes.fr/hal-02705056},
  year = {2018}
}
```
In the recent years, there has been an increasing academic and industrial interest for analyzing the electrical consumption of commercial buildings. Whilst having similarities with the Non Intrusive Load Monitoring (NILM) tasks for residential buildings, the nature of the signals that are collected from large commercial buildings introduces additional difficulties to the NILM research causing existing NILM approaches to fail. On the other hand, the amount of publicly available datasets collected from commercial buildings is very limited, which makes the NILM research even more challenging for this type of large buildings. In this study, we aim at addressing these issues. We first present an extensive statistical analysis of both commercial and residential measurements from public and private datasets and show important differences. Secondly, we develop an algorithm for generating synthetic current waveforms. We then demonstrate using real measurement and quantitative metrics that both our device model and our simulations are realistic and can be used to evaluate NILM algorithms. Finally, to encourage research on commercial buildings we release a synthesized dataset.

Technical Reports

Research report on unified stochastic reverberation modeling
Roland Badeau
February 2018.
```
@techreport{badeau:hal-01715431,
  author = {Badeau, Roland},
  hal_id = {hal-01715431},
  hal_version = {v1},
  institution = {{T{\'e}l{\'e}com ParisTech}},
  keywords = {reverberation ; room impulse response ; room frequency response ; stochastic models ; Poisson processes ; stationary processes ; Wigner distribution ; mod{\`e}les stochastiques ; r{\'e}ponse fr{\'e}quentielle de salle ; r{\'e}ponse impulsionnelle de salle ; processus de Poisson ; processus stationnnaires ; distribution de Wigner-Ville},
  month = feb,
  number = { 2018D001},
  pdf = {https://hal.archives-ouvertes.fr/hal-01715431/file/Reverberation_Report_Badeau.pdf},
  title = {{Research report on unified stochastic reverberation modeling}},
  type = {Research Report},
  url = {https://hal.archives-ouvertes.fr/hal-01715431},
  year = {2018}
}
```
In the field of room acoustics, it is well known that reverberation can be characterized statistically in a particular region of the time-frequency domain (after the mixing time and above Schroeder’s frequency). Since the 1950s, various formulas have been established, focusing on particular aspects of reverberation: exponential decay over time, correlations between frequencies, correlations between sensors at each frequency, and time-frequency distribution. In this report, we introduce a new stochastic reverberation model, that permits us to retrieve all these well-known results within a common mathematical framework. To the best of our knowledge, this is the first time that such a unification work is presented. The benefits are multiple: several new formulas generalizing the classical results are established, that jointly characterize the spatial, temporal and spectral properties of late reverberation.

2017

Journal Articles

Non-linear auto-regressive models for cross-frequency coupling in neural time series
Tom Dupré La Tour, Lucille Tallot, Laetitia Grabot, Valérie Doyère, Virginie Van Wassenhove, Yves Grenier, Alexandre Gramfort
PLoS Computational Biology, December 2017.
```
@article{duprelatour:hal-01679078,
  author = {Dupr{\'e} La Tour, Tom and Tallot, Lucille and Grabot, Laetitia and Doy{\`e}re, Val{\'e}rie and Van Wassenhove, Virginie and Grenier, Yves and Gramfort, Alexandre},
  doi = {10.1371/journal.pcbi.1005893},
  hal_id = {hal-01679078},
  hal_version = {v1},
  journal = {{PLoS Computational Biology}},
  month = dec,
  number = {12},
  pages = {e1005893},
  publisher = {{Public Library of Science}},
  title = {{Non-linear auto-regressive models for cross-frequency coupling in neural time series}},
  url = {https://hal.archives-ouvertes.fr/hal-01679078},
  volume = {13},
  year = {2017}
}
```
We address the issue of reliably detecting and quantifying cross-frequency coupling (CFC) in neural time series. Based on non-linear auto-regressive models, the proposed method provides a generative and parametric model of the time-varying spectral content of the signals. As this method models the entire spectrum simultaneously, it avoids the pitfalls related to incorrect filtering or the use of the Hilbert transform on wide-band signals. As the model is probabilistic, it also provides a score of the model "goodness of fit" via the likelihood, enabling easy and legitimate model selection and parameter comparison; this data-driven feature is unique to our model-based approach. Using three datasets obtained with invasive neurophysiological recordings in humans and rodents, we demonstrate that these models are able to replicate previous results obtained with other metrics, but also reveal new insights such as the influence of the amplitude of the slow oscillation. Using simulations, we demonstrate that our parametric method can reveal neural couplings with shorter signals than non-parametric methods. We also show how the likelihood can be used to find optimal filtering parameters, suggesting new properties on the spectrum of the driving signal, but also to estimate the optimal delay between the coupled signals, enabling a directionality estimation in the coupling.
SMART : Règles d’associations temporelles de signaux sociaux pour la synthèse d’un Agent Conversationnel Animé avec une attitude spécifique
Kévin Bailly, Chloé Clavel, thomas janssoone, Gael Richard
Revue des Sciences et Technologies de l’Information - Série RIA : Revue d’Intelligence Artificielle, July 2017.
```
@article{bailly:hal-02287610,
  author = {Bailly, K{\'e}vin and Clavel, Chlo{\'e} and janssoone, thomas and Richard, Gael},
  hal_id = {hal-02287610},
  hal_local_reference = {TJ:RIA-17},
  hal_version = {v1},
  journal = {{Revue des Sciences et Technologies de l'Information -  S{\'e}rie RIA : Revue d'Intelligence Artificielle}},
  month = jul,
  publisher = {{Lavoisier}},
  title = {{SMART : R{\`e}gles d'associations temporelles de signaux sociaux pour la synth{\`e}se d'un Agent Conversationnel Anim{\'e} avec une attitude sp{\'e}cifique}},
  url = {https://hal.telecom-paris.fr/hal-02287610},
  year = {2017}
}
```
Afin d’améliorer l’interaction entre des Humains et des Agents Conversationnels Animés (ACA), l’un des enjeux majeurs du domaine est de générer des agents crédibles socialement. Dans cet article, nous présentons une méthode, intitulée SMART pour Social Multimodal Association Rules with Timing, capable de trouver automatiquement des associations temporelles entre l’utilisation de signaux sociaux (mouvements de tête, expressions faciales, prosodie ...) issues de vidéos d’interactions d’Humains exprimant différents états affectifs (comportement, attitude, émotions, ... ). Notre système est basé sur un algorithme de fouille de séquences qui lui permet de trouver des règles d’associations temporelles entre des signaux sociaux extraits automatiquement de flux audio-vidéo. SMART va également analyser le lien de ces règles avec chaque état affectif pour ne conserver que celles qui sont pertinentes. Finalement, SMART va les enrichir afin d’assurer une animation facile d’un ACA pour qu’il exprime l’état voulu.

Dans ce papier, nous formalisons donc l’implémentation de SMART et nous justifions son intérêt par plusieurs études. Dans un premier temps, nous montrons que les règles calculées sont bien en accord avec la littérature en psychologie et sociologie. Ensuite, nous présentons les résultats d’évaluations perceptives que nous avons conduites suite à des études de corpus proposant l’expression d’attitudes sociales marquées.

Feature Learning with Matrix Factorization Applied to Acoustic Scene Classification
Victor Bisot, Romain Serizel, Slim Essid, Gael Richard
IEEE Transactions on Audio, Speech, and Language Processing (TASLP), 2017.

@article{Bisot2017,
  author = {Bisot, Victor and Serizel, Romain and Essid, Slim and Richard, Gael},
  journal = {IEEE Transactions on Audio, Speech, and Language Processing (TASLP)},
  title = {Feature Learning with Matrix Factorization Applied to Acoustic Scene Classification},
  year = {2017},
  pdf = {https://perso.telecom-paristech.fr/essid/papers/VB_TASLP-17.pdf}
}

Règles d’Associations Temporelles de signaux sociaux pour la synthèse d’Agents Conversationnels Animés : Application aux attitudes sociales
thomas janssoone, Chloé Clavel, Kevin Bailly, Gael Richard
Revue des Sciences et Technologies de l’Information - Série RIA : Revue d’Intelligence Artificielle, 2017.
```
@article{janssoone:hal-02704911,
  author = {janssoone, thomas and Clavel, Chlo{\'e} and Bailly, Kevin and Richard, Gael},
  doi = {10.3166/RIA.31.511-537},
  hal_id = {hal-02704911},
  hal_version = {v1},
  journal = {{Revue des Sciences et Technologies de l'Information -  S{\'e}rie RIA : Revue d'Intelligence Artificielle}},
  keywords = {MOTS-CL{\'E}S : R{\`e}gles d'Association Temporelle ; TITARL ; Agents virtuels ; attitudes sociales ; trai- tement du signal social KEYWORDS: Temporal Association Rules ; Virtual Agent ; interpersonal stance ; social si- gnal processing},
  publisher = {{Lavoisier}},
  title = {{R{\`e}gles d'Associations Temporelles de signaux sociaux pour la synth{\`e}se d'Agents Conversationnels Anim{\'e}s : Application aux attitudes sociales}},
  url = {https://hal.inria.fr/hal-02704911},
  year = {2017}
}
```
Afin d’améliorer l’interaction entre des humains et des agents conversationnels animés (ACA), l’un des enjeux majeurs du domaine est de générer des agents crédibles socialement. Dans cet article, nous présentons une méthode, intitulée SMART pour social multimodal association rules with timing, capable de trouver automatiquement des associations temporelles entre l’utilisation de signaux sociaux (mouvements de tête, expressions faciales, prosodie. . .) issues de vidéos d’interactions d’humains exprimant différents états affectifs (comportement, attitude, émotions,. . .). Notre système est basé sur un algorithme de fouille de séquences qui lui permet de trouver des règles d’associations temporelles entre des signaux sociaux extraits automatiquement de flux audio-vidéo. SMART va également analyser le lien de ces règles avec chaque état affectif pour ne conserver que celles qui sont pertinentes. Finalement, SMART va les enrichir afin d’assurer une animation facile d’un ACA pour qu’il exprime l’état voulu. Dans ce papier, nous formalisons donc l’implémentation de SMART et nous justifions son inté-rêt par plusieurs études. Dans un premier temps, nous montrons que les règles calculées sont bien en accord avec la littérature en psychologie et sociologie. Ensuite, nous présentons les résultats d’évaluations perceptives que nous avons conduites suite à des études de corpus pro-posant l’expression d’attitudes sociales marquées. ABSTRACT. In the field of Embodied Conversational Agent (ECA) one of the main challenges is to generate socially believable agents. The long run objective of the present study is to infer rules for the multimodal generation of agents’ socio-emotional behaviour. In this paper, we introduce the Social Multimodal Association Rules with Timing (SMART) algorithm. It proposes to Revue d’intelligence artificielle-n o 4/2017, 511-537 512 RIA. Volume 31-n o 4/2017 learn the rules from the analysis of a multimodal corpus composed by audio-video recordings of human-human interactions. The proposed methodology consists in applying a Sequence Mining algorithm using automatically extracted Social Signals such as prosody, head movements and facial muscles activation as an input. This allows us to infer Temporal Association Rules for the behaviour generation. We show that this method can automatically compute Temporal Association Rules coherent with prior results found in the literature especially in the psychology and sociology fields. The results of a perceptive evaluation confirms the ability of a Temporal Association Rules based agent to express a specific stance.

Conference Articles

UE-HRI: a new dataset for the study of user engagement in spontaneous human-robot interactions
Atef Ben-Youssef, Chloé Clavel, Slim Essid, Miriam Bilac, Marine Chamoux, Angelica Lim
the 19th ACM International Conference, Glasgow, France, November 2017.

@inproceedings{benyoussef:hal-02943475,
  address = {Glasgow, France},
  author = {Ben-Youssef, Atef and Clavel, Chlo{\'e} and Essid, Slim and Bilac, Miriam and Chamoux, Marine and Lim, Angelica},
  booktitle = {{the 19th ACM International Conference}},
  doi = {10.1145/3136755.3136814},
  hal_id = {hal-02943475},
  hal_version = {v1},
  month = nov,
  pages = {464-472},
  publisher = {{ACM Press}},
  title = {{UE-HRI: a new dataset for the study of user engagement in spontaneous human-robot interactions}},
  url = {https://hal.archives-ouvertes.fr/hal-02943475},
  year = {2017}
}

Amplitude and Phase Dereverberation of Harmonic Signals
Arthur Belhomme, Roland Badeau, Yves Grenier, Eric Humbert
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York, United States, October 2017.
```
@inproceedings{belhomme:hal-01548475,
  address = {New Paltz, New York, United States},
  author = {Belhomme, Arthur and Badeau, Roland and Grenier, Yves and Humbert, Eric},
  booktitle = {{IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}},
  hal_id = {hal-01548475},
  hal_version = {v1},
  keywords = {dereverberation ; Index Terms-dereverberation ; phase ; sinusoidal modeling},
  month = oct,
  pdf = {https://hal.archives-ouvertes.fr/hal-01548475/file/Belhomme-WASPAA-2017.pdf},
  series = {Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
  title = {{Amplitude and Phase Dereverberation of Harmonic Signals}},
  url = {https://hal.archives-ouvertes.fr/hal-01548475},
  year = {2017}
}
```
While most dereverberation methods focus on how to estimate the magnitude of an anechoic signal in the time-frequency domain, we propose a method which also takes the phase into account. By ap- plying a harmonic model to the anechoic signal, we derive a formulation to compute the amplitude and phase of each harmonic. These parameters are then estimated by our method in presence of reverberation. As we jointly estimate the amplitude and phase of the clean signal, we achieve a very strong dereverberation, resulting in a significant improvement of standard dereverberation objective measures over the state-of-the-art.
Explaining the Parameterized Wiener Filter with Alpha-Stable Processes
Mathieu Fontaine, Antoine Liutkus, Laurent Girin, Roland Badeau
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York, United States, October 2017.
```
@inproceedings{fontaine:hal-01548508,
  address = {New Paltz, New York, United States},
  author = {Fontaine, Mathieu and Liutkus, Antoine and Girin, Laurent and Badeau, Roland},
  booktitle = {{IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}},
  hal_id = {hal-01548508},
  hal_version = {v1},
  keywords = {denoising ; Wiener filtering ; alpha-stable processes ; probability theory},
  month = oct,
  pdf = {https://hal.archives-ouvertes.fr/hal-01548508/file/explaining-parameterized-wiener%284%29.pdf},
  series = {Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
  title = {{Explaining the Parameterized Wiener Filter with Alpha-Stable Processes}},
  url = {https://hal.archives-ouvertes.fr/hal-01548508},
  year = {2017}
}
```
This paper introduces a new method for single-channel denoising that sheds new light on classical early developments on this topic that occurred in the 70’s and 80’s with Wiener filtering and spectral subtraction. Operating both in the short-time Fourier transform domain, these methods consist in estimating the power spectral density (PSD) of the noise without speech. Then, the clean speech signal is obtained by manipulating the corrupted time-frequency bins thanks to these noise PSD estimates. Theoretically grounded when using power spectra, these methods were subsequently generalized to magnitude spectra, or shown to yield better performance by weighting the PSDs in the so-called parameterized Wiener filter. Both these strategies were long considered ad-hoc. To the best of our knowledge, while we recently proposed an interpretation of magnitude processing, there is still no theoretical result that would justify the better performance of parameterized Wiener filters. Here, we show how the α-stable probabilistic model for waveforms naturally leads to these weighted filters and we provide a grounded and fast algorithm to enhance corrupted audio that compares favorably with classical denoising methods.
Separating Time-Frequency Sources from Time-Domain Convolutive Mixtures Using Non-negative Matrix Factorization
Simon Leglaive, Roland Badeau, Gael Richard
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York, United States, October 2017.
```
@inproceedings{leglaive:hal-01548469,
  address = {New Paltz, New York, United States},
  author = {Leglaive, Simon and Badeau, Roland and Richard, Gael},
  booktitle = {{IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}},
  hal_id = {hal-01548469},
  hal_version = {v1},
  keywords = {Audio source separation ;  reverberant mixtures ;  non-negative matrix factorization ;  variational inference},
  month = oct,
  pdf = {https://hal.archives-ouvertes.fr/hal-01548469/file/LeglaiveBadeauRichard.pdf},
  series = {Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
  title = {{Separating Time-Frequency Sources from Time-Domain Convolutive Mixtures Using Non-negative Matrix Factorization}},
  url = {https://hal.archives-ouvertes.fr/hal-01548469},
  year = {2017}
}
```
This paper addresses the problem of under-determined audio source separation in multichannel reverberant mixtures. We target a semi- blind scenario assuming that the mixing filters are known. Source separation is performed from the time-domain mixture signals in order to accurately model the convolutive mixing process. The source signals are however modeled as latent variables in a time-frequency domain. In a previous paper we proposed to use the modified discrete cosine transform. The present paper generalizes the method to the use of the odd-frequency short-time Fourier transform. In this domain, the source coefficients are modeled as centered complex Gaussian random variables whose variances are structured by means of a non-negative matrix factorization model. The inference procedure relies on a variational expectation-maximization algorithm. In the experiments we discuss the choice of the source representation and we show that the proposed approach outperforms two methods from the literature.
Lévy NMF for Robust Nonnegative Source Separation
Paul Magron, Roland Badeau, Antoine Liutkus
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2017), New Paltz, NY, United States, October 2017.
```
@inproceedings{magron:hal-01548488,
  address = {New Paltz, NY, United States},
  author = {Magron, Paul and Badeau, Roland and Liutkus, Antoine},
  booktitle = {{IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2017)}},
  hal_id = {hal-01548488},
  hal_version = {v1},
  keywords = { audio source separation ; L{\'e}vy distribution ;  Positive alpha-stable distribution ;  nonnegative matrix factorization},
  month = oct,
  organization = {{IEEE}},
  pdf = {https://hal.archives-ouvertes.fr/hal-01548488/file/levy_waspaa17.pdf},
  title = {{L{\'e}vy NMF for Robust Nonnegative Source Separation}},
  url = {https://hal.archives-ouvertes.fr/hal-01548488},
  year = {2017}
}
```
Source separation, which consists in decomposing data into meaningful structured components, is an active research topic in music signal processing. In this paper, we introduce the Positive α-stable (PαS) distributions to model the latent sources, which are a sub- class of the stable distributions family. They notably permit us to model random variables that are both nonnegative and impulsive. Considering the Lévy distribution, the only PαS distribution whose density is tractable, we propose a mixture model called Lévy Non- negative Matrix Factorization (Lévy NMF). This model accounts for low-rank structures in nonnegative data that possibly has high variability or is corrupted by very adverse noise. The model parameters are estimated in a maximum-likelihood sense. We also derive an estimator of the sources, which extends the validity of the Wiener filtering to the PαS case. Experiments on synthetic data and realistic music signals show that Lévy NMF compares favorably with state-of-the art techniques in terms of robustness to impulsive noise and highlight its potential for decomposing nonnegative data.

Guiding Audio Source Separation by Video Object Information
Sanjeel Parekh, Slim Essid, Alexey Ozerov, Quang-Khanh-Ngoc Duong, Patrick Perez, Gael Richard
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York, United States, October 2017.

@inproceedings{parekh:hal-02287698,
  address = {New Paltz, New York, United States},
  author = {Parekh, Sanjeel and Essid, Slim and Ozerov, Alexey and Duong, Quang-Khanh-Ngoc and Perez, Patrick and Richard, Gael},
  booktitle = {{IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}},
  hal_id = {hal-02287698},
  hal_local_reference = {Parekh2017b},
  hal_version = {v1},
  month = oct,
  title = {{Guiding Audio Source Separation by Video Object Information}},
  url = {https://hal.telecom-paris.fr/hal-02287698},
  year = {2017}
}

Amplitude and Phase Dereverberation of Harmonic Signals
Arthur Belhomme, Roland Badeau, Yves Grenier, Éric Humbert
WASPAA, New Paltz, New York, USA, October 2017.
```
@inproceedings{AB:WASPAA-17,
  author = {Belhomme, Arthur and Badeau, Roland and Grenier, Yves and Humbert, {\'E}ric},
  title = {Amplitude and Phase Dereverberation of Harmonic Signals},
  booktitle = {WASPAA},
  publisher = {IEEE},
  address = {New Paltz, New York, USA},
  year = {2017},
  month = oct,
  keywords = {dereverberation, phase, sinusoidal modeling}
}
```
While most dereverberation methods focus on how to estimate the magnitude of an anechoic signal in the time-frequency domain, we propose a method which also takes the phase into account. By applying a harmonic model to the anechoic signal, we derive a formulation to compute the amplitude and phase of each harmonic. These parameters are then estimated by our method in presence of reverberation. As we jointly estimate the amplitude and phase of the clean signal, we achieve a very strong dereverberation, resulting in a significant improvement of standard dereverberation objective measures over the state-of-the-art.
Séparation de sources audio en milieu réverbérant : Factorisation en matrices non-négatives et représentation temporelle du mélange convolutif
Simon Leglaive, Roland Badeau, Gael Richard
Colloque GRETSI, Juan-Les-Pins, France, September 2017.
```
@inproceedings{leglaive:hal-01540481,
  address = {Juan-Les-Pins, France},
  author = {Leglaive, Simon and Badeau, Roland and Richard, Gael},
  booktitle = {{Colloque GRETSI}},
  hal_id = {hal-01540481},
  hal_version = {v1},
  month = sep,
  pdf = {https://hal.archives-ouvertes.fr/hal-01540481/file/LeglaiveBadeauRichard_final.pdf},
  series = {Actes du XXVI{\`e}me Colloque GRETSI},
  title = {{S{\'e}paration de sources audio en milieu r{\'e}verb{\'e}rant : Factorisation en matrices non-n{\'e}gatives et repr{\'e}sentation temporelle du m{\'e}lange convolutif}},
  url = {https://hal.archives-ouvertes.fr/hal-01540481},
  year = {2017}
}
```
Cet article traite du problème de séparation de sources audio sous-déterminé pour les mélanges réverbérants multi- canaux. Nous visons une application semi-aveugle où les filtres de mélange sont connus. La méthode proposée consiste à travailler directement avec les signaux temporels du mélange. Cette approche permet de représenter de façon exacte le processus de mélange convolutif, elle est donc adaptée pour la séparation de mélanges fortement réverbérants. Les signaux sources sont quant à eux représentés dans le domaine de la transformée en cosinus discrète modifiée, en utilisant un modèle gaussien basé sur la factorisation en matrices non-négatives. L’inférence des sources repose sur un algorithme espérance-maximisation variationnel. Nous montrons expérimentalement l’intérêt d’utiliser conjointement une représentation temporelle du mélange convolutif et un modèle de source basé sur la factorisation en matrices non-négatives.
Lévy NMF : un modèle robuste de séparation de sources non-négatives
Paul Magron, Roland Badeau, Antoine Liutkus
Colloque GRETSI, Juan-Les-Pins, France, September 2017.
```
@inproceedings{magron:hal-01540484,
  address = {Juan-Les-Pins, France},
  author = {Magron, Paul and Badeau, Roland and Liutkus, Antoine},
  booktitle = {{Colloque GRETSI}},
  hal_id = {hal-01540484},
  hal_version = {v1},
  month = sep,
  pdf = {https://hal.archives-ouvertes.fr/hal-01540484/file/levy-nmf.pdf},
  series = {Actes du XXVI{\`e}me Colloque GRETSI},
  title = {{L{\'e}vy NMF : un mod{\`e}le robuste de s{\'e}paration de sources non-n{\'e}gatives}},
  url = {https://hal.archives-ouvertes.fr/hal-01540484},
  year = {2017}
}
```
Dans cet article, nous nous intéressons à la séparation robuste de sources non-négatives. Nous introduisons les distributions Positives α-stables (PαS), un sous-ensemble de la famille des lois stables, qui modélisent les variables latentes non-négatives. Comme ces distributions sont à queue lourde, elles possèdent naturellement une propriété de robustesse aux valeurs aberrantes. En étudiant plus particulièrement la loi de Lévy, la seule loi PαS dont la densité s’exprime sous forme analytique simple, nous mettons au point un modèle de mélange dans lequel nous structurons par un modèle de factorisations en matrices non-négatives (NMF) les paramètres de dispersion des variables de Lévy. Ce modèle, appelé Lévy NMF, est estimé au sens du maximum de vraisemblance. Nous obtenons également un estimateur des sources qui généralise le filtrage de Wiener aux distributions PαS. Des expériences conduites sur des spectrogrammes musicaux et des spectres de fluorescence démontrent le potentiel de ce modèle pour décomposer des données non-négatives.
Histoire de la transformée de Mellin
Jean-Marie Nicolas, Roland Badeau
Colloque GRETSI, Juan-Les-Pins, France, September 2017.
```
@inproceedings{nicolas:hal-01540479,
  address = {Juan-Les-Pins, France},
  author = {Nicolas, Jean-Marie and Badeau, Roland},
  booktitle = {{Colloque GRETSI}},
  hal_id = {hal-01540479},
  hal_version = {v1},
  month = sep,
  pdf = {https://hal.archives-ouvertes.fr/hal-01540479/file/articlemellin.pdf},
  series = {Actes du XXVI{\`e}me Colloque GRETSI},
  title = {{Histoire de la transform{\'e}e de Mellin}},
  url = {https://hal.archives-ouvertes.fr/hal-01540479},
  year = {2017}
}
```
La transformée de Mellin est probablement la transformation intégrale la plus méconnue mais aussi une des plus fondamentales dans de nombreux domaines. Sa genèse a été fort longue, et il est difficile de donner une référence précise de son introduction dans les ouvrages scientifiques. Longtemps réduite à jouer un rôle secondaire vis à vis des transformées de Fourier et de Laplace, il est peut être temps de lui redonner la place qu’elle mérite dans les outils de traitement de signal “modernes”.

@inproceedings{TD:CS-2017,
  author = {{Dupr{\'e} La Tour}, Tom and Tallot, Lucile and Grabot, Laeticia and Doy{\`e}re, Val{\'e}rie and Van Wassenhove, Virginie and Grenier, Yves and Gramfort, Alexandre},
  title = {Non-linear auto-regressive models for cross-frequency coupling in neural time series},
  booktitle = {C3S},
  address = {Cologne, Allemagne},
  year = {2017},
  month = sep
}

Amplitude and Phase Dereverberation of Monocomponent Signals
Arthur Belhomme, Roland Badeau, Yves Grenier, Eric Humbert
25th European Signal Processing Conference (EUSIPCO), Kos, Greece, August 2017.
```
@inproceedings{belhomme:hal-01531259,
  address = {Kos, Greece},
  author = {Belhomme, Arthur and Badeau, Roland and Grenier, Yves and Humbert, Eric},
  booktitle = {{25th European Signal Processing Conference (EUSIPCO)}},
  hal_id = {hal-01531259},
  hal_version = {v1},
  month = aug,
  pages = {1320-1324},
  pdf = {https://hal.archives-ouvertes.fr/hal-01531259/file/Belhomme-EUSIPCO-17.pdf},
  series = {Proc. of 25th European Signal Processing Conference (EUSIPCO)},
  title = {{Amplitude and Phase Dereverberation of Monocomponent Signals}},
  url = {https://hal.archives-ouvertes.fr/hal-01531259},
  year = {2017}
}
```
While most dereverberation methods focus on how to estimate the amplitude of an anechoic signal, we propose a method which also takes the phase into account. By applying a sinusoidal model to the anechoic signal, we derive a formulation to compute the amplitude and phase of each sinusoid. These parameters are then estimated by our method in the reverberant case. As we jointly estimate the amplitude and phase of the clean signal, we achieve a very strong dereverberation, resulting in a significant improvement of dereverberation objective measures over the state-of-the-art.

EMOEEG: A new multimodal dataset for dynamic EEG-based emotion recognition with audiovisual elicitation
Anne-Claire Conneau, Ayoub Hajlaoui, Mohamed Chetouani, Slim Essid
2017 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, August 2017.

@inproceedings{conneau:hal-02422947,
  address = {Kos, Greece},
  author = {Conneau, Anne-Claire and Hajlaoui, Ayoub and Chetouani, Mohamed and Essid, Slim},
  booktitle = {{2017 25th European Signal Processing Conference (EUSIPCO)}},
  doi = {10.23919/EUSIPCO.2017.8081305},
  hal_id = {hal-02422947},
  hal_version = {v1},
  month = aug,
  pages = {738-742},
  publisher = {{IEEE}},
  title = {{EMOEEG: A new multimodal dataset for dynamic EEG-based emotion recognition with audiovisual elicitation}},
  url = {https://hal.sorbonne-universite.fr/hal-02422947},
  year = {2017}
}

Scalable Source Localization with Multichannel Alpha-Stable Distributions
Mathieu Fontaine, Charles Vanwynsberghe, Antoine Liutkus, Roland Badeau
25th European Signal Processing Conference (EUSIPCO), Kos, Greece, August 2017.
```
@inproceedings{fontaine:hal-01531252,
  address = {Kos, Greece},
  author = {Fontaine, Mathieu and Vanwynsberghe, Charles and Liutkus, Antoine and Badeau, Roland},
  booktitle = {{25th European Signal Processing Conference (EUSIPCO)}},
  hal_id = {hal-01531252},
  hal_version = {v1},
  keywords = {sketching ; source localization ; acoustic modeling ; alpha-stable random variables ; spectral measure},
  month = aug,
  pages = {11-15},
  pdf = {https://hal.archives-ouvertes.fr/hal-01531252/file/EUSIPCO.pdf},
  series = {Proc. of 25th European Signal Processing Conference (EUSIPCO)},
  title = {{Scalable Source Localization with Multichannel Alpha-Stable Distributions}},
  url = {https://hal.archives-ouvertes.fr/hal-01531252},
  year = {2017}
}
```
In this paper, we focus on the problem of sound source localization and we propose a technique that exploits the known and arbitrary geometry of the microphone array. While most probabilistic techniques presented in the past rely on Gaussian models, we go further in this direction and detail a method for source localization that is based on the recently proposed alpha-stable harmonizable processes. They include Cauchy and Gaussian as special cases and their remarkable feature is to allow a simple modeling of impulsive and real world sounds with few parameters. The approach we present builds on the classical convolutive mixing model and has the particularities of requiring going through the data only once, to also work in the underdetermined case of more sources than microphones and to allow massively parallelizable implementations operating in the time-frequency domain. We show that the method yields interesting performance for acoustic imaging in realistic simulations.
Semi-Blind Student’s t Source Separation for Multichannel Audio Convolutive Mixtures
Simon Leglaive, Roland Badeau, Gael Richard
25th European Signal Processing Conference (EUSIPCO), Kos, Greece, August 2017.
```
@inproceedings{leglaive:hal-01531243,
  address = {Kos, Greece},
  author = {Leglaive, Simon and Badeau, Roland and Richard, Gael},
  booktitle = {{25th European Signal Processing Conference (EUSIPCO)}},
  hal_id = {hal-01531243},
  hal_version = {v1},
  keywords = {Under-determined audio source separation ;  multichannel convolutive mixture ;  Student's t distribution ;  non-negative matrix factorization ;  variational inference.},
  month = aug,
  pages = {2323-2327},
  pdf = {https://hal.archives-ouvertes.fr/hal-01531243/file/LeglaiveBadeauRichard_final.pdf},
  series = {Proc. of 25th European Signal Processing Conference (EUSIPCO)},
  title = {{Semi-Blind Student's t Source Separation for Multichannel Audio Convolutive Mixtures}},
  url = {https://hal.archives-ouvertes.fr/hal-01531243},
  year = {2017}
}
```
This paper addresses the problem of multichannel audio source separation in under-determined convolutive mixtures. We target a semi-blind scenario assuming that the mixing filters are known. The convolutive mixing process is exactly modeled using the time-domain impulse responses of the mixing filters. We propose a Student’s t time-frequency source model based on non-negative matrix factorization (NMF). The Student’s t distribution being heavy-tailed with respect to the Gaussian, it provides some flexibility in the modeling of the sources. We also study a simpler Student’s t sparse source model within the same general source separation framework. The inference procedure relies on a variational expectation-maximization algorithm. Experiments show the advantage of using an NMF model compared with the sparse source model. While the Student’s t NMF source model leads to slightly better results than our previous Gaussian one, we demonstrate the superiority of our method over two other approaches from the literature.
Amplitude and Phase Dereverberation of Monocomponent Signals
Arthur Belhomme, Roland Badeau, Yves Grenier, Éric Humbert
EUSIPCO, Kos, Greece, August 2017.
```
@inproceedings{AB:EUSIPCO-17,
  author = {Belhomme, Arthur and Badeau, Roland and Grenier, Yves and Humbert, {\'E}ric},
  title = {Amplitude and Phase Dereverberation of Monocomponent Signals},
  booktitle = {EUSIPCO},
  address = {Kos, Greece},
  year = {2017},
  month = aug,
  pages = {1320--1324}
}
```
While most dereverberation methods focus on how to estimate the amplitude of an anechoic signal, we propose a method which also takes the phase into account. By applying a sinusoidal model to the anechoic signal, we derive a formulation to compute the amplitude and phase of each sinusoid. These parameters are then estimated by our method in the reverberant case. As we jointly estimate the amplitude and phase of the clean signal, we achieve a very strong dereverberation, resulting in a significant improvement of dereverberation objective measures over the state-of-the-art.

@inproceedings{TD:OHBM-2017,
  author = {{Dupr{\'e} La Tour}, Tom and Tallot, Lucile and Grabot, Laeticia and Doy{\`e}re, Val{\'e}rie and Van Wassenhove, Virginie and Grenier, Yves and Gramfort, Alexandre},
  title = {Non-linear auto-regressive models for cross-frequency coupling in neural time series},
  booktitle = {OHBM},
  address = {Vancouver, Canada},
  year = {2017},
  month = jun
}

Overlapping sound event detection with supervised Nonnegative Matrix Factorization
Victor Bisot, Slim Essid, Gael Richard
2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, France, March 2017.

@inproceedings{bisot:hal-02713341,
  address = {New Orleans, France},
  author = {Bisot, Victor and Essid, Slim and Richard, Gael},
  booktitle = {{2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  doi = {10.1109/ICASSP.2017.7951792},
  hal_id = {hal-02713341},
  hal_version = {v1},
  month = mar,
  pages = {31-35},
  publisher = {{IEEE}},
  title = {{Overlapping sound event detection with supervised Nonnegative Matrix Factorization}},
  url = {https://hal.inria.fr/hal-02713341},
  year = {2017}
}

Parametric estimation of spectrum driven by an exogenous signal
Tom Dupré La Tour, Yves Grenier, Alexandre Gramfort
42nd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017) , La Nouvelle Orléans, LA, United States, March 2017.
```
@inproceedings{duprelatour:hal-01448603,
  address = {La Nouvelle Orl{\'e}ans, LA, United States},
  author = {Dupr{\'e} La Tour, Tom and Grenier, Yves and Gramfort, Alexandre},
  booktitle = {{42nd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017) }},
  hal_id = {hal-01448603},
  hal_version = {v2},
  keywords = {non-linear auto-regressive models ; spectrum estimation ; electrophysiology ; phase-amplitude coupling},
  month = mar,
  pdf = {https://hal.archives-ouvertes.fr/hal-01448603v2/file/duprelatour2017.pdf},
  series = {Proc. in ICASSP},
  title = {{Parametric estimation of spectrum driven by an exogenous signal}},
  url = {https://hal.archives-ouvertes.fr/hal-01448603},
  year = {2017}
}
```
In this paper, we introduce new parametric generative driven auto-regressive (DAR) models. DAR models provide a non-linear and non-stationary spectral estimation of a signal, conditionally to another exogenous signal. We detail how inference can be done efficiently while guaranteeing model stability. We show how model comparison and hyper-parameter selection can be done using likelihood estimates. We also point out the limits of DAR models when the exogenous signal contains too high frequencies. Finally, we illustrate how DAR models can be applied on neuro-physiologic signals to characterize phase-amplitude coupling.

Parametric estimation of spectrum driven by an exogenous signal
Tom Dupré La Tour, Yves Grenier, Alexandre Gramfort
ICASSP, New Orleans, March 2017.

@inproceedings{TD:ICASSP-17,
  author = {{Dupr{\'e} La Tour}, Tom and Grenier, Yves and Gramfort, Alexandre},
  title = {Parametric estimation of spectrum driven by an exogenous signal},
  booktitle = {ICASSP},
  address = {New Orleans},
  year = {2017},
  month = mar,
  keywords = {non-linear auto-regressive models, nonstationary, spectrum estimation, electrophysiology, phase-amplitude coupling}
}

Nonnegative Matrix Factorisation for multimodal data analysis
Slim Essid
Dipartimento di Elettronica, Informazione e Bioingegeria (DEIB), Politecnico di Milano, Milan, Italy, February 2017.

@inproceedings{essid:hal-02288528,
  address = {Milan, Italy},
  author = {Essid, Slim},
  booktitle = {{Dipartimento di Elettronica, Informazione e Bioingegeria (DEIB), Politecnico di Milano}},
  hal_id = {hal-02288528},
  hal_local_reference = {SE:POLIMI17},
  hal_version = {v1},
  month = feb,
  title = {{Nonnegative Matrix Factorisation for multimodal data analysis}},
  url = {https://hal.telecom-paris.fr/hal-02288528},
  year = {2017}
}

Parametric models of phase-amplitude coupling in neural time series
Tom Dupré La Tour, Yves Grenier, Alexandre Gramfort
BASP, Villars-sur-Ollon, Switzerland, January 2017.

@inproceedings{duprelatour:hal-02287880,
  address = {Villars-sur-Ollon, Switzerland},
  author = {Dupr{\'e} La Tour, Tom and Grenier, Yves and Gramfort, Alexandre},
  booktitle = {{BASP}},
  hal_id = {hal-02287880},
  hal_local_reference = {TD:BASP-2017},
  hal_version = {v1},
  month = jan,
  title = {{Parametric models of phase-amplitude coupling in neural time series}},
  url = {https://hal.telecom-paris.fr/hal-02287880},
  year = {2017}
}

EMOEEG: a New Multimodal Dataset for Dynamic EEG-based Emotion Recognition with Audiovisual Elicitation
Anne-Claire Conneau, Ayoub Hajlaoui, Mohamed Chetouani, Slim Essid
The European Signal Processing Conference (EUSIPCO), Kos island, Greece, 2017.

@inproceedings{conneau:hal-02288498,
  address = {Kos island, Greece},
  author = {Conneau, Anne-Claire and Hajlaoui, Ayoub and Chetouani, Mohamed and Essid, Slim},
  booktitle = {{The European Signal Processing Conference (EUSIPCO)}},
  hal_id = {hal-02288498},
  hal_local_reference = {Conneau2017},
  hal_version = {v1},
  title = {{EMOEEG: a New Multimodal Dataset for Dynamic EEG-based Emotion Recognition with Audiovisual Elicitation}},
  url = {https://hal.telecom-paris.fr/hal-02288498},
  year = {2017}
}

Sketching for nearfield acoustic imaging of heavy-tailed sources
Mathieu Fontaine, Charles Vanwynsberghe, Antoine Liutkus, Roland Badeau
International Conference on Latent Variable Analysis and Signal Separation, 2017.

@inproceedings{fontaine_sketching_2017,
  title = {Sketching for nearfield acoustic imaging of heavy-tailed sources},
  copyright = {Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC-BY-NC-SA)},
  booktitle = {International {Conference} on {Latent} {Variable} {Analysis} and {Signal} {Separation}},
  publisher = {Springer},
  author = {Fontaine, Mathieu and Vanwynsberghe, Charles and Liutkus, Antoine and Badeau, Roland},
  year = {2017},
  pages = {80--88}
}

patent

Procédé et dispositif pour estimer un signal déréverbéré
Arthur Belhomme, Roland Badeau, Yves Grenier, Eric Humbert
France, May 2017.

@patent{belhomme:hal-02287693,
  address = {France},
  author = {Belhomme, Arthur and Badeau, Roland and Grenier, Yves and Humbert, Eric},
  hal_id = {hal-02287693},
  hal_local_reference = {ABE-17},
  hal_version = {v1},
  month = may,
  number = {FR1751073},
  pages = {28},
  title = {{Proc{\'e}d{\'e} et dispositif pour estimer un signal d{\'e}r{\'e}verb{\'e}r{\'e}}},
  url = {https://hal.telecom-paris.fr/hal-02287693},
  year = {2017}
}

2016

patent

Procédé et dispositif pour estimer la réverbération acoustique
Arthur Belhomme, Roland Badeau, Yves Grenier, Eric Humbert
France, December 2016.

@patent{belhomme:hal-02287692,
  address = {France},
  author = {Belhomme, Arthur and Badeau, Roland and Grenier, Yves and Humbert, Eric},
  hal_id = {hal-02287692},
  hal_local_reference = {ABE:16},
  hal_version = {v1},
  month = dec,
  number = {PCT/FR2016/053034},
  pages = {20},
  title = {{Proc{\'e}d{\'e} et dispositif pour estimer la r{\'e}verb{\'e}ration acoustique}},
  url = {https://hal.telecom-paris.fr/hal-02287692},
  year = {2016}
}

Dispositif a Casque Audio Perfectionne
Slim Essid, Raphael Blouet
November 2016.

@patent{SE:patent16,
  author = {Essid, Slim and Blouet, Raphael},
  title = {Dispositif a Casque Audio Perfectionne},
  year = {2016},
  month = nov,
  url = {https://perso.telecom-paristech.fr/essid/papers/FR3059191A1.pdf},
  number = {1661324}
}

Conference Articles

Anechoic phase estimation from reverberant signals
Arthur Belhomme, Yves Grenier, Roland Badeau, Eric Humbert
15th International Workshop on Acoustic Signal Enhancement (IWAENC), Xi’an, China, September 2016.
```
@inproceedings{belhomme:hal-01337860,
  address = {Xi'an, China},
  author = {Belhomme, Arthur and Grenier, Yves and Badeau, Roland and Humbert, Eric},
  booktitle = {{15th International Workshop on Acoustic Signal Enhancement (IWAENC)}},
  hal_id = {hal-01337860},
  hal_version = {v1},
  keywords = {Dereverberation ;  phase ;  reassignment ;  sinusoidal modeling},
  month = sep,
  pdf = {https://hal.archives-ouvertes.fr/hal-01337860/file/abe_iwaenc2016_paper.pdf},
  series = {Proc. of 15th International Workshop on Acoustic Signal Enhancement (IWAENC)},
  title = {{Anechoic phase estimation from reverberant signals}},
  url = {https://hal.archives-ouvertes.fr/hal-01337860},
  year = {2016}
}
```
Most dereverberation methods aim to reconstruct the anechoic magnitude spectrogram, given a reverberant signal. Regardless of the method, the dereverberated signal is systematically synthesized with the reverberant phase. This corrupted phase reintroduces reverberation and distortion in the signal. This is why we intend to also reconstruct the anechoic phase, given a reverberant signal. Before processing speech signals, we propose in this paper a method for estimating the anechoic phase of reverberant chirp signals. Our method presents an accurate estimation of the instantaneous phase and improves objective measures of dereverberation.
SUPERVISED NONNEGATIVE MATRIX FACTORIZATION FOR ACOUSTIC SCENE CLASSIFICATION
Victor Bisot, Romain Serizel, Slim Essid, Gael Richard
IEEE international evaluation campaign on detection and classification of acousitc scenes and events (DCASE 2016), Budapest, Hungary, September 2016.
```
@inproceedings{bisot:hal-02943480,
  address = {Budapest, Hungary},
  author = {Bisot, Victor and Serizel, Romain and Essid, Slim and Richard, Gael},
  booktitle = {{IEEE international evaluation campaign on detection and classification of acousitc scenes and events (DCASE 2016)}},
  hal_id = {hal-02943480},
  hal_version = {v1},
  keywords = {Acoustic Scene Classification ; Feature learning ; Matrix Factorization ; Acoustic Scene Classification},
  month = sep,
  pdf = {https://hal.archives-ouvertes.fr/hal-02943480/file/VB_DCASE-16.pdf},
  title = {{SUPERVISED NONNEGATIVE MATRIX FACTORIZATION FOR ACOUSTIC SCENE CLASSIFICATION}},
  url = {https://hal.archives-ouvertes.fr/hal-02943480},
  year = {2016}
}
```
This report describes our contribution to the 2016 IEEE AASP DCASE challenge for the acoustic scene classification task. We propose a feature learning approach following the idea of decomposing time-frequency representations with nonnegative matrix factoriza-tion. We aim at learning a common dictionary representing the data and use projections on this dictionary as features for classification. Our system is based on a novel supervised extension of nonnegative matrix factorization. In the approach we propose, the dictionary and the classifier are optimized jointly in order to find a suited representation to minimize the classification cost. The proposed method significantly outperforms the baseline and provides improved results compared to unsupervised nonnegative matrix factorization.
Feature Adapted Convolutional Neural Networks for Downbeat Tracking
Simon Durand, Juan P. Bello, Bertrand David, Gael Richard
ICASSP 2016, Shanghai, China, September 2016.
```
@inproceedings{durand:hal-02287268,
  address = {Shanghai, China},
  author = {Durand, Simon and Bello, Juan P. and David, Bertrand and Richard, Gael},
  booktitle = {{ICASSP 2016}},
  hal_id = {hal-02287268},
  hal_local_reference = {DBDR:ICASSP-16},
  hal_version = {v1},
  month = sep,
  title = {{Feature Adapted Convolutional Neural Networks for Downbeat Tracking}},
  url = {https://hal.telecom-paris.fr/hal-02287268},
  year = {2016}
}
```
We define a novel system for the automatic estimation of downbeat positions from audio music signals. New rhythm and melodic features are introduced and feature adapted convolutional neural networks are used to take advantage of their specificity. Indeed, invariance to melody transposition, chroma data augmentation and length-specific rhythmic patterns prove to be useful to learn downbeat likelihood. After the data is segmented in tatums, complementary features related to melody, rhythm and harmony are extracted and the likelihood of a tatum being at a downbeat position is computed with the aforementioned neural networks. The downbeat sequence is then extracted with a flexible temporal hidden Markov model. We then show the efficiency and robustness of our approach with a comparative evaluation conducted on 9 datasets.
Using Temporal Association Rules For the synthesis of Embodied Conversational Agent With a specific stance.
thomas janssoone, Chloé Clavel, Kévin Bailly, Gael Richard
International Conference on Intelligent Virtual Agents, Los Angeles, United States, September 2016.
```
@inproceedings{janssoone:hal-02287401,
  address = {Los Angeles, United States},
  author = {janssoone, thomas and Clavel, Chlo{\'e} and Bailly, K{\'e}vin and Richard, Gael},
  booktitle = {{International Conference on Intelligent Virtual Agents}},
  hal_id = {hal-02287401},
  hal_local_reference = {TJ:IVA-16},
  hal_version = {v1},
  month = sep,
  number = {16th},
  title = {{Using Temporal Association Rules For the synthesis of Embodied Conversational Agent With a specific stance.}},
  url = {https://hal.telecom-paris.fr/hal-02287401},
  year = {2016}
}
```
In the field of Embodied Conversational Agent (ECA) one of the main challenges is to generate socially believable agents. The long run objective of the present study is to infer rules for the multimodal generation of agents’ socio-emotional behaviour. In this paper, we introduce the Social Multimodal Association Rules with Timing (SMART) algorithm. It proposes to learn the rules from the analysis of a multimodal corpus composed by audio-video recordings of human-human interactions. The proposed methodology consists in applying a Sequence Mining algorithm using automatically extracted Social Signals such as prosody, head movements and facial muscles activation as an input. This allows us to infer Temporal Association Rules for the behaviour generation. We show that this method can automatically compute Temporal Association Rules coherent with prior results found in the literature especially in the psychology and sociology fields. The results of a perceptive evaluation confirms the ability of a Temporal Association Rules based agent to express a specific stance.
Downbeat Detection with Conditional Random Fields and Deep Learned Features
Simon Durand, Slim Essid
International Society for Music Information Retrieval (ISMIR), New York City, United States, August 2016.
```
@inproceedings{durand:hal-02288480,
  address = {New York City, United States},
  author = {Durand, Simon and Essid, Slim},
  booktitle = {{International Society for Music Information Retrieval (ISMIR)}},
  hal_id = {hal-02288480},
  hal_local_reference = {SD:ISMIR-16},
  hal_version = {v1},
  month = aug,
  pages = {386-392},
  title = {{Downbeat Detection with Conditional Random Fields and Deep Learned Features}},
  url = {https://hal.telecom-paris.fr/hal-02288480},
  year = {2016}
}
```
In this paper, we introduce a novel Conditional Random Field (CRF) system that detects the downbeat sequence of musical audio signals. Feature functions are computed from four deep learned representations based on harmony, rhythm, melody and bass content to take advantage of the high-level and multi-faceted aspect of this task. Downbeats being dynamic, the powerful CRF classification system allows us to combine our features with an adapted temporal model in a fully data-driven fashion. Some meters being under-represented in our training set, we show that data augmentation enables a statistically significant improvement of the results by taking into account class imbalance. An evaluation of different configurations of our system on nine datasets shows its efficiency and potential over a heuristic based approach and four downbeat tracking algo- rithms.

Research on Nonnegative Matrix Factorisation at Telecom ParisTech
Slim Essid
Spotify Research Seminar, New York, United States, August 2016.

@inproceedings{essid:hal-02288525,
  address = {New York, United States},
  author = {Essid, Slim},
  booktitle = {{Spotify Research Seminar}},
  hal_id = {hal-02288525},
  hal_local_reference = {SE:Spotify16},
  hal_version = {v1},
  month = aug,
  title = {{Research on Nonnegative Matrix Factorisation at Telecom ParisTech}},
  url = {https://hal.telecom-paris.fr/hal-02288525},
  year = {2016}
}

Analyse et reconnaissance multimodale de signaux sociaux : application à la synthèse d’attitudes sociales d’un agent conversationnel animé
thomas janssoone, Chloé Clavel, Kévin Bailly, Gael Richard
WACAI, Brest, France, June 2016.

@inproceedings{janssoone:hal-02287370,
  address = {Brest, France},
  author = {janssoone, thomas and Clavel, Chlo{\'e} and Bailly, K{\'e}vin and Richard, Gael},
  booktitle = {{WACAI}},
  hal_id = {hal-02287370},
  hal_local_reference = {TJ:WACAI-16},
  hal_version = {v1},
  month = jun,
  title = {{Analyse et reconnaissance multimodale de signaux sociaux : application {\`a} la synth{\`e}se d'attitudes sociales d'un agent conversationnel anim{\'e}}},
  url = {https://hal.telecom-paris.fr/hal-02287370},
  year = {2016}
}

Acoustic scene classification with matrix factorization for unsupervised feature learning
Victor Bisot, Romain Serizel, Slim Essid, Gael Richard
ICASSP, Shangai, China, March 2016.

@inproceedings{bisot:hal-02287267,
  address = {Shangai, China},
  author = {Bisot, Victor and Serizel, Romain and Essid, Slim and Richard, Gael},
  booktitle = {{ICASSP}},
  hal_id = {hal-02287267},
  hal_local_reference = {bisot2016unsupervised},
  hal_version = {v1},
  keywords = {Acoustic scene classification ; unsupervised feature learning ; matrix factorization},
  month = mar,
  title = {{Acoustic scene classification with matrix factorization for unsupervised feature learning}},
  url = {https://hal.telecom-paris.fr/hal-02287267},
  year = {2016}
}

Formant shifting for speech Intelligibility improvement in car noise environment
Karan Nathwani, Morgane Daniel, Gael Richard, Bertrand David, Vincent Roussarie
ICASSP, Shanghai, China, March 2016.

@inproceedings{nathwani:hal-02287313,
  address = {Shanghai, China},
  author = {Nathwani, Karan and Daniel, Morgane and Richard, Gael and David, Bertrand and Roussarie, Vincent},
  booktitle = {{ICASSP}},
  hal_id = {hal-02287313},
  hal_local_reference = {KN:Icassp16},
  hal_version = {v1},
  month = mar,
  title = {{Formant shifting for speech Intelligibility improvement in car noise environment}},
  url = {https://hal.telecom-paris.fr/hal-02287313},
  year = {2016}
}

Group nonnegative matrix factorisation with speaker and session variability compensation for speaker identification
Romain Serizel, Slim Essid, Gael Richard
ICASSP, Shangai, China, March 2016.
```
@inproceedings{serizel:hal-02288453,
  address = {Shangai, China},
  author = {Serizel, Romain and Essid, Slim and Richard, Gael},
  booktitle = {{ICASSP}},
  hal_id = {hal-02288453},
  hal_local_reference = {Serizel2016a},
  hal_version = {v1},
  keywords = {Nonnegative matrix factorisation ; spectrogram factorisation ; feature learning ; speaker variability ; speaker identification},
  month = mar,
  pages = {5470 - 5474},
  title = {{Group nonnegative matrix factorisation with speaker and session variability compensation for speaker identification}},
  url = {https://hal.telecom-paris.fr/hal-02288453},
  year = {2016}
}
```
This paper presents a feature learning approach for speaker identification that is based on nonnegative matrix factorisation. Recent studies have shown that with such models, the dictionary atoms can represent well the speaker identity. The approaches proposed so far focused only on speaker variability and not on session variability. However, this later point is a crucial aspect in the success of the I-vector approach that is now the state-of-the-art in speaker identification.

This paper proposes a method that relies on group nonnegative matrix factorisation and that is inspired by the I-vector training procedure. By doing so the proposed approach intends to capture both the speaker variability and the session variability. Results on a small corpus prove that the proposed approach can be competitive with I-vectors.
Blind estimation of room acoustic parameters using kernel regression
Arthur Belhomme, Yves Grenier, Roland Badeau, Eric Humbert
AES 60th Conference, Leuven, Belgium, February 2016.
```
@inproceedings{belhomme:hal-01248010,
  address = {Leuven, Belgium},
  author = {Belhomme, Arthur and Grenier, Yves and Badeau, Roland and Humbert, Eric},
  booktitle = {{AES 60th Conference}},
  hal_id = {hal-01248010},
  hal_version = {v1},
  month = feb,
  pdf = {https://hal.archives-ouvertes.fr/hal-01248010/file/paper_abe_aes60.pdf},
  title = {{Blind estimation of room acoustic parameters using kernel regression}},
  url = {https://hal.archives-ouvertes.fr/hal-01248010},
  year = {2016}
}
```
Room acoustic parameters are key information for dereverberation or speech recognition. Usually, when one needs to assess the level of reverberation, only the reverberation time RT60 or a direct to reverberant sounds index Dτ is estimated. Yet, methods which blindly estimate the reverberation time from reverberant recorded speech do not always differentiate the RT60 from the Dτ to evaluate the level of reverberation. That is why we propose a method to jointly blindly estimate these parameters, from the signal energy decay rate distribution, by means of kernel regression. Evaluation is carried out with real and simulated room impulse responses to generate noise-free reverberant speech signals. The results show this new method outperforms baseline approaches in our evaluation.

Technical Reports

An iterative algorithm for recovering the phase of complex components from their mixture
Paul Magron, Roland Badeau, Bertrand David
June 2016.
```
@techreport{magron:hal-01325625,
  author = {Magron, Paul and Badeau, Roland and David, Bertrand},
  hal_id = {hal-01325625},
  hal_version = {v5},
  institution = {{T{\'e}l{\'e}com ParisTech}},
  keywords = {linear unwrapping ; source separation ; auxiliary function method ; Phase reconstruction ; s{\'e}paration de sources ; Reconstruction de phase ; m{\'e}thode de la fonction auxiliaire ; d{\'e}roul{\'e} lin{\'e}aire},
  month = jun,
  pdf = {https://hal.archives-ouvertes.fr/hal-01325625v5/file/source_sep_algo.pdf},
  title = {{An iterative algorithm for recovering the phase of complex components from their mixture}},
  type = {Research Report},
  url = {https://hal.archives-ouvertes.fr/hal-01325625},
  year = {2016}
}
```
This report addresses the problem of estimating complex components from their mixture in the Time-Frequency (TF) domain. Traditional techniques, which consist in non-iteratively optimizing a cost function measuring the difference between the mixture and the model, do not lead to satisfactorily sounding results. Thus, we propose to optimize this cost function by means of an iterative algorithm, which allows us to incorporate some prior phase information in the procedure. We provide a mathematical proof of the non-increasing property of the error function over the update rules of this algorithm. In addition, we show that the algorithm must be carefully initialized to avoid getting stuck in a local minimum and to output satisfying results.

2015

Conference Articles

MELODY EXTRACTION BY CONTOUR CLASSIFICATION
Rachel M Bittner, Justin Salamon, Slim Essid, Juan P Bello
International Conference on Music Information Retrieval (ISMIR), Malaga, Spain, September 2015.
```
@inproceedings{bittner:hal-02943532,
  address = {Malaga, Spain},
  author = {Bittner, Rachel M and Salamon, Justin and Essid, Slim and Bello, Juan P},
  booktitle = {{International Conference on Music Information Retrieval (ISMIR)}},
  hal_id = {hal-02943532},
  hal_version = {v1},
  month = sep,
  pdf = {https://hal.archives-ouvertes.fr/hal-02943532/file/RB_ISMIR-15.pdf},
  title = {{MELODY EXTRACTION BY CONTOUR CLASSIFICATION}},
  url = {https://hal.archives-ouvertes.fr/hal-02943532},
  year = {2015}
}
```
Due to the scarcity of labeled data, most melody extraction algorithms do not rely on fully data-driven processing blocks but rather on careful engineering. For example, the Melodia melody extraction algorithm employs a pitch contour selection stage that relies on a number of heuristics for selecting the melodic output. In this paper we explore the use of a discriminative model to perform purely data-driven melodic contour selection. Specifically, a discrim-inative binary classifier is trained to distinguish melodic from non-melodic contours. This classifier is then used to predict likelihoods for a track’s extracted contours, and these scores are decoded to generate a single melody output. The results are compared with the Melodia algorithm and with a generative model used in a previous study. We show that the discriminative model outperforms the gen-erative model in terms of contour classification accuracy, and the melody output from our proposed system performs comparatively to Melodia. The results are complemented with error analysis and avenues for future improvements.

Multipitch estimation using a PLCA-based model: Impact of partial user annotation
Camila Andrade Scatolini, Gael Richard, Benoît Fuentes
ICASSP 2015 - 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, France, April 2015.

@inproceedings{deandradescatolini:hal-02713516,
  address = {South Brisbane, France},
  author = {de Andrade Scatolini, Camila and Richard, Gael and Fuentes, Beno{\^i}t},
  booktitle = {{ICASSP 2015 - 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  doi = {10.1109/ICASSP.2015.7177957},
  hal_id = {hal-02713516},
  hal_version = {v1},
  month = apr,
  pages = {186-190},
  publisher = {{IEEE}},
  title = {{Multipitch estimation using a PLCA-based model: Impact of partial user annotation}},
  url = {https://hal.inria.fr/hal-02713516},
  year = {2015}
}

A conditional random field system for beat tracking
Thomas Fillon, C. Joder, Simon Durand, Slim Essid
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, April 2015.

@inproceedings{fillon:hal-02288433,
  address = {Brisbane, Australia},
  author = {Fillon, Thomas and Joder, C. and Durand, Simon and Essid, Slim},
  booktitle = {{IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  hal_id = {hal-02288433},
  hal_local_reference = {TF:ICASSP-15},
  hal_version = {v1},
  month = apr,
  title = {{A conditional random field system for beat tracking}},
  url = {https://hal.telecom-paris.fr/hal-02288433},
  year = {2015}
}

Nonnegative matrix Factorisation for Audiovisual Document Analysis
Slim Essid
Seminaire Traitement du Langage Parle, LIMSI, Orsay, France, 2015.

@inproceedings{essid:hal-02287882,
  address = {Orsay, France},
  author = {Essid, Slim},
  booktitle = {{Seminaire Traitement du Langage Parle, LIMSI}},
  hal_id = {hal-02287882},
  hal_local_reference = {SE:LIMSI15},
  hal_version = {v1},
  title = {{Nonnegative matrix Factorisation for Audiovisual Document Analysis}},
  url = {https://hal.telecom-paris.fr/hal-02287882},
  year = {2015}
}

Technical Reports

Phase reconstruction of spectrograms with linear unwrapping : application to audio signal restoration
Paul Magron, Roland Badeau, Bertrand David
April 2015.

@techreport{magron:hal-02287339,
  author = {Magron, Paul and Badeau, Roland and David, Bertrand},
  hal_id = {hal-02287339},
  hal_version = {v1},
  institution = {{T{\'e}l{\'e}com ParisTech}},
  keywords = {Phase reconstruction ; sinusoidal modeling ; linear unwrapping ; phase consistency ; audio restoration},
  month = apr,
  number = {2015D002},
  title = {{Phase reconstruction of spectrograms with linear unwrapping : application to audio signal restoration}},
  type = {Research Report},
  url = {https://hal.telecom-paris.fr/hal-02287339},
  year = {2015}
}

Journal Articles

TPT-Dance&Actions : un corpus multimodal d’activités humaines
Aymeric Masurelle, Ahmed Rida Sekkat, Slim Essid, Gael Richard
Revue Traitement du Signal (Presse universitaire de Grenoble), April 2015.

@article{masurelle:hal-02704820,
  author = {Masurelle, Aymeric and Sekkat, Ahmed Rida and Essid, Slim and Richard, Gael},
  doi = {10.3166/TS.32.443-475},
  hal_id = {hal-02704820},
  hal_version = {v1},
  journal = {{Revue Traitement du Signal (Presse universitaire de Grenoble)}},
  month = apr,
  title = {{TPT-Dance\&Actions : un corpus multimodal d'activit{\'e}s humaines}},
  url = {https://hal.inria.fr/hal-02704820},
  year = {2015}
}

patent

Procédé de suppression de la réverbération tardive d’un signal sonore
Nicolás López, Yves Grenier, Gael Richard
France, January 2015.

@patent{lopez:hal-02412540,
  address = {France},
  author = {L{\'o}pez, Nicol{\'a}s and Grenier, Yves and Richard, Gael},
  hal_id = {hal-02412540},
  hal_local_reference = {NL-Brevet-2015},
  hal_version = {v1},
  month = jan,
  number = {WO2015011078},
  title = {{Proc{\'e}d{\'e} de suppression de la r{\'e}verb{\'e}ration tardive d'un signal sonore}},
  url = {https://hal.telecom-paris.fr/hal-02412540},
  year = {2015}
}

2010 - 2014 [104 publications]

2014

Journal Articles

Multichannel high resolution NMF for modelling convolutive mixtures of non-stationary signals in the time-frequency domain
Roland Badeau, Mark D. Plumbley
IEEE Transactions on Audio, Speech and Language Processing, November 2014.
```
@article{badeau:hal-01061578,
  author = {Badeau, Roland and Plumbley, Mark D.},
  hal_id = {hal-01061578},
  hal_version = {v1},
  journal = {{IEEE Transactions on Audio, Speech and Language Processing}},
  keywords = {Non-stationary signal modelling ; Timefrequency analysis ; Nonnegative matrix factorisation ; Multichannel signal analysis ; Variational EM algorithm},
  month = nov,
  number = {11},
  pages = {1670-1680},
  pdf = {https://hal.archives-ouvertes.fr/hal-01061578/file/Badeau-Plumbley-HRNMF.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Multichannel high resolution NMF for modelling convolutive mixtures of non-stationary signals in the time-frequency domain}},
  url = {https://hal.archives-ouvertes.fr/hal-01061578},
  volume = {22},
  year = {2014}
}
```
Several probabilistic models involving latent components have been proposed for modelling time-frequency (TF) representations of audio signals such as spectrograms, notably in the nonnegative matrix factorization (NMF) literature. Among them, the recent high resolution NMF (HR-NMF) model is able to take both phases and local correlations in each frequency band into account, and its potential has been illustrated in applications such as source separation and audio inpainting. In this paper, HR-NMF is extended to multichannel signals and to convolutive mixtures. The new model can represent a variety of stationary and non-stationary signals, including autoregressive moving average (ARMA) processes and mixtures of damped sinusoids. A fast variational expectation-maximization (EM) algorithm is proposed to estimate the enhanced model. This algorithm is applied to piano signals, and proves capable of accurately modelling reverberation, restoring missing observations, and separating pure tones with close frequencies.

Conference Articles

Romeo2 Project: Humanoid Robot Assistant and Companion for Everyday Life: I. Situation Assessment for Social Intelligence
Amit Kumar Pandey, Rodolphe Gelin, Rachid Alami, Renaud Viry, Axel Buendia, Roland Meertens, Mohamed Chetouani, Laurence Devillers, Marie Tahon, David Filliat, Yves Grenier, Mounira Maazaoui, Abderrahmane Kheddar, Frédéric Lerasle, Laurent Fitte-Duval
AIC: Artificial Intelligence and Cognition, Torino, Italy, November 2014.
```
@inproceedings{kumarpandey:hal-01096094,
  address = {Torino, Italy},
  author = {Kumar Pandey, Amit and Gelin, Rodolphe and Alami, Rachid and Viry, Renaud and Buendia, Axel and Meertens, Roland and Chetouani, Mohamed and Devillers, Laurence and Tahon, Marie and Filliat, David and Grenier, Yves and Maazaoui, Mounira and Kheddar, Abderrahmane and Lerasle, Fr{\'e}d{\'e}ric and Fitte-Duval, Laurent},
  booktitle = {{AIC: Artificial Intelligence and Cognition}},
  hal_id = {hal-01096094},
  hal_local_reference = {Rapport LAAS n{\textdegree} 14665},
  hal_version = {v1},
  keywords = {Robot Companion ; Human Robot Interaction ; Socially Intelligent Robot ; Situation Assessment},
  month = nov,
  pages = {140-147},
  pdf = {https://hal.archives-ouvertes.fr/hal-01096094/file/Romeo2_perception_AIC_2014_Camera_Ready_Final.pdf},
  publisher = {{CEUR Workshop Proceedings (CEUR-WS.org)}},
  title = {{Romeo2 Project: Humanoid Robot Assistant and Companion for Everyday Life: I. Situation Assessment for Social Intelligence}},
  url = {https://hal.archives-ouvertes.fr/hal-01096094},
  volume = {1315},
  year = {2014}
}
```
For a socially intelligent robot, different levels of situation as-sessment are required, ranging from basic processing of sensor input to high-level analysis of semantics and intention. However, the attempt to combine them all prompts new research challenges and the need of a co-herent framework and architecture. This paper presents the situation assessment aspect of Romeo2, a unique project aiming to bring multi-modal and multi-layered perception on a single system and targeting for a unified theoretical and functional frame-work for a robot companion for everyday life. It also discusses some of the innovation potentials, which the combination of these various perception abilities adds into the robot’s socio-cognitive capabilities.
Template adaptation for improving automatic music transcription
Emmanouil Benetos, Roland Badeau, Tillman Weyde, Gael Richard
ISMIR 2014 The 15th International Society for Music Information Retrieval Conference, Taipei, Taiwan, October 2014.
```
@inproceedings{benetos:hal-01083552,
  address = {Taipei, Taiwan},
  author = {Benetos, Emmanouil and Badeau, Roland and Weyde, Tillman and Richard, Gael},
  booktitle = {{ISMIR 2014 The 15th International Society for Music Information Retrieval Conference}},
  hal_id = {hal-01083552},
  hal_version = {v1},
  month = oct,
  pages = {6},
  pdf = {https://hal.archives-ouvertes.fr/hal-01083552/file/ISMIR14_template_adaptation.pdf},
  title = {{Template adaptation for improving automatic music transcription}},
  url = {https://hal.archives-ouvertes.fr/hal-01083552},
  year = {2014}
}
```
In this work, we propose a system for automatic music transcription which adapts dictionary templates so that they closely match the spectral shape of the instrument sources present in each recording. Current dictionary-based automatic transcription systems keep the input dictionary fixed, thus the spectral shape of the dictionary components might not match the shape of the test instrument sources. By performing a conservative transcription pre-processing step, the spectral shape of detected notes can be extracted and utilized in order to adapt the template dictionary. We propose two variants for adaptive transcription, namely for single-instrument transcription and for multiple-instrument transcription. Experiments are carried out using the MAPS and Bach10 databases. Results in terms of multi-pitch de- tection and instrument assignment show that there is a clear and consistent improvement when adapting the dictionary in contrast with keeping the dictionary fixed.
Controlling the Convergence Rate to Help Parameter Estimation in a PLCA-based Model
Benoît Fuentes, Roland Badeau, Gael Richard
EUSIPCO, Lisbon, Portugal, September 2014.
```
@inproceedings{fuentes:hal-01061572,
  address = {Lisbon, Portugal},
  author = {Fuentes, Beno{\^i}t and Badeau, Roland and Richard, Gael},
  booktitle = {{EUSIPCO}},
  hal_id = {hal-01061572},
  hal_version = {v1},
  keywords = {PLCA ; NMF ; EM algorithm ; multipitch estimation},
  month = sep,
  pages = {5 pages},
  pdf = {https://hal.archives-ouvertes.fr/hal-01061572/file/Fuentes-EUSIPCO-2014.pdf},
  title = {{Controlling the Convergence Rate to Help Parameter Estimation in a PLCA-based Model}},
  url = {https://hal.archives-ouvertes.fr/hal-01061572},
  year = {2014}
}
```
Probabilistic Latent Component Analysis (PLCA) is a tool similar to Non-negative Matrix Factorization (NMF), which is used to model non-negative data such as non-negative time-frequency representations of audio. In this paper, we put forward a trick to help the corresponding parameter estimation algorithm to converge toward more meaningful solutions, based on the new concept of brakes. The idea is to control the convergence rate of the parameters of a PLCA-based model within the estimation algorithm: the parameters which are known to be properly initialized are braked in order to stay close to their initial values, whereas the other ones keep a regular convergence rate. This is an effective way to better account for a relevant initialization. In this paper, these brakes are implemented in the framework of PLCA, and they are tested in an application of multipitch estimation. Results show that the use of brakes can significantly influence the decomposition and thus the performance, making them a powerful tool to boost any kind of PLCA-based algorithm.

A tutorial on Nonnegative Matrix Factorisation with applications to audiovisual content analysis
Slim Essid, Alexey Ozerov
Tutorial at ICME 2014, Chengdu, China, July 2014.

@inproceedings{essid:hal-02287869,
  address = {Chengdu, China},
  author = {Essid, Slim and Ozerov, Alexey},
  booktitle = {{Tutorial at ICME 2014}},
  hal_id = {hal-02287869},
  hal_local_reference = {SE:ICME14},
  hal_version = {v1},
  month = jul,
  title = {{A tutorial on Nonnegative Matrix Factorisation with applications to audiovisual content analysis}},
  url = {https://hal.telecom-paris.fr/hal-02287869},
  year = {2014}
}

Assessment of new spectral features for eeg-based emotion recognition.
Anne-Claire Conneau, Slim Essid
International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy, May 2014.

@inproceedings{conneau:hal-02287334,
  address = {Florence, Italy},
  author = {Conneau, Anne-Claire and Essid, Slim},
  booktitle = {{International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}},
  hal_id = {hal-02287334},
  hal_local_reference = {ACC:ICASSP-14},
  hal_version = {v1},
  month = may,
  title = {{Assessment of new spectral features for eeg-based emotion recognition.}},
  url = {https://hal.telecom-paris.fr/hal-02287334},
  year = {2014}
}

Enhancing downbeat detection when facing different music styles
Simon Durand, Bertrand David, Gael Richard
ICASSP, Florence, Italy, May 2014.
```
@inproceedings{durand:hal-02286904,
  address = {Florence, Italy},
  author = {Durand, Simon and David, Bertrand and Richard, Gael},
  booktitle = {{ICASSP}},
  hal_id = {hal-02286904},
  hal_local_reference = {SD:ICASSP-14},
  hal_version = {v1},
  keywords = {Downbeat tracking ; Music information retrieval ; Music signal processing},
  month = may,
  pages = {3152-3156},
  title = {{Enhancing downbeat detection when facing different music styles}},
  url = {https://hal.telecom-paris.fr/hal-02286904},
  year = {2014}
}
```
This paper focuses on the automatic rhythm analysis of musical audio at the bar level. We propose a novel approach for robust downbeat detection. It uses well-chosen complementary features, inspired by musical considerations. In particular, a note accentuation model and a detection of pattern changes are introduced. We estimate the time signature by examining the similarity of frames at the beat level. The features are selected through a linear SVM model or a weighted sum. The whole system is evaluated on five different datasets of various musical styles and shows improvement over the state of the art.
Towards complex matrix decomposition of spectrograms based on the relative phase offsets of harmonic sounds
Holger Kirchhoff, Roland Badeau, Simon Dixon
Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, May 2014.
```
@inproceedings{kirchhoff:hal-00945295,
  address = {Florence, Italy},
  author = {Kirchhoff, Holger and Badeau, Roland and Dixon, Simon},
  booktitle = {{Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  hal_id = {hal-00945295},
  hal_version = {v1},
  keywords = {nonnegative matrix factorisation ; Harmonic signals ; relative phase offsets of partials ; complex matrix decomposition ; nonnegative matrix factorisation.},
  month = may,
  pages = {1591-1595},
  pdf = {https://hal.inria.fr/hal-00945295/file/Kirchhoff-ICASSP2014.pdf},
  publisher = {{IEEE}},
  title = {{Towards complex matrix decomposition of spectrograms based on the relative phase offsets of harmonic sounds}},
  url = {https://hal.inria.fr/hal-00945295},
  year = {2014}
}
```
In this paper we study the relative phase offsets between partials in the sustained part of harmonic sounds and investigate their suitability for complex matrix decomposition of spectrograms. We formally introduce this property in a sinusoidal model and visualise the phase relations of a musical instrument. A model of complex matrix decomposition in the time-frequency domain is derived and equations for the estimation of the model parameters are provided in the monophonic case. We illustrate the model with the analysis of a monophonic saxophone signal. The results suggest that the phase offset is able to capture inherent time-invariant phase properties of harmonic sounds and outline its potential use for complex matrix decomposition.
Single Channel Reverberation Suppression Based on Sparse Linear Prediction
Nicolás López, Yves Grenier, Gael Richard, Ivan Bourmeyster
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy, May 2014.
```
@inproceedings{lopez:hal-02286852,
  address = {Florence, Italy},
  author = {L{\'o}pez, Nicol{\'a}s and Grenier, Yves and Richard, Gael and Bourmeyster, Ivan},
  booktitle = {{IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}},
  hal_id = {hal-02286852},
  hal_local_reference = {NL:ICASSP-14},
  hal_version = {v1},
  keywords = {Single Channel Speech Enhancement ; Late Reverberation Estimation ; Lasso ; Sparse Linear Prediction},
  month = may,
  publisher = {{IEEE}},
  title = {{Single Channel Reverberation Suppression Based on Sparse Linear Prediction}},
  url = {https://hal.telecom-paris.fr/hal-02286852},
  year = {2014}
}
```
Reverberation degrades speech intelligibility in telecommunications as well as it increases the word error rate in automatic speech recognition tasks. Several dereverberation methods have been proposed recently in order to counter these effects. In the single microphone case, the dereverberation problem is underdetermined and reverberation suppression approaches are preferred. In this paper we propose a novel method for single channel reverberation suppression. Late reverberation is estimated in the time-frequency domain as a sparse linear combination of previous frames. The predictors associated to the model are determined in a Lasso framework and a spectral subtraction filter is designed to produce the enhanced signal. This model does not require any additional information about the room acoustics and it is well suited for real-time applications. The method has state-of-the-art performance in terms of both reverberation suppression and spectral distortion.

Piecewise constant nonnegative matrix factorization
N. Seichepine, Slim Essid, C. Fevotte, O. Cappe
ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, France, May 2014.

@inproceedings{seichepine:hal-02943536,
  address = {Florence, France},
  author = {Seichepine, N. and Essid, Slim and Fevotte, C. and Cappe, O.},
  booktitle = {{ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  doi = {10.1109/ICASSP.2014.6854901},
  hal_id = {hal-02943536},
  hal_version = {v1},
  month = may,
  pages = {6721-6725},
  publisher = {{IEEE}},
  title = {{Piecewise constant nonnegative matrix factorization}},
  url = {https://hal.telecom-paris.fr/hal-02943536},
  year = {2014}
}

Single Channel Reverberation Suppression Based on Sparse Linear Prediction
Nicolas López, Yves Grenier, Gaël Richard, Ivan Bourmeyster
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy, May 2014.
```
@inproceedings{NL:ICASSP-14,
  author = {L{\'o}pez, Nicolas and Grenier, Yves and Richard, Ga{\"e}l and Bourmeyster, Ivan},
  title = {Single Channel Reverberation Suppression Based on Sparse Linear Prediction},
  booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  publisher = {IEEE},
  address = {Florence, Italy},
  year = {2014},
  month = may,
  keywords = {Single Channel Speech Enhancement; Late Reverberation Estimation; Lasso; Sparse Linear Prediction}
}
```
Reverberation degrades speech intelligibility in telecommunications as well as it increases the word error rate in automatic speech recognition tasks. Several dereverberation methods have been proposed recently in order to counter these effects. In the single microphone case, the dereverberation problem is underdetermined and reverberation suppression approaches are preferred. In this paper we propose a novel method for single channel reverberation suppression. Late reverberation is estimated in the time-frequency domain as a sparse linear combination of previous frames. The predictors associated to the model are determined in a Lasso framework and a spectral subtraction filter is designed to produce the enhanced signal. This model does not require any additional information about the room acoustics and it is well suited for real-time applications. The method has state-of-the-art performance in terms of both reverberation suppression and spectral distortion.

Informed Audio source Separation
Gael Richard
AES International Conference on Semantic Audio, Londres, United Kingdom, 2014.

@inproceedings{richard:hal-02713605,
  address = {Londres, United Kingdom},
  author = {Richard, Gael},
  booktitle = {{AES International Conference on Semantic Audio}},
  hal_id = {hal-02713605},
  hal_version = {v1},
  title = {{Informed Audio source Separation}},
  url = {https://hal.inria.fr/hal-02713605},
  year = {2014}
}

Gesture recognition using a NMF-based representation of motion-traces extracted from depth silhouettes
A. Masurelle, S. Essid, G. Richard
Proceedings of conference on Acoustics, Speech, and Signal Processing, 2014.

@inproceedings{Masurelle14,
  title = {Gesture recognition using a NMF-based representation of motion-traces extracted from depth silhouettes},
  author = {Masurelle, A. and Essid, S. and Richard, G.},
  booktitle = {Proceedings of conference on Acoustics, Speech, and Signal Processing},
  year = {2014},
  series = {ICASSP'14},
  pages = {1275--1279}
}

Technical Reports

Proof of Wiener-like linear regression of isotropic complex symmetric alpha-stable random variables
Roland Badeau, Antoine Liutkus
September 2014.
```
@techreport{badeau:hal-01069612,
  author = {Badeau, Roland and Liutkus, Antoine},
  hal_id = {hal-01069612},
  hal_version = {v1},
  month = sep,
  pdf = {https://hal.archives-ouvertes.fr/hal-01069612/file/proof_alpha_wiener.pdf},
  title = {{Proof of Wiener-like linear regression of isotropic complex symmetric alpha-stable random variables}},
  url = {https://hal.archives-ouvertes.fr/hal-01069612},
  year = {2014}
}
```
This document features supplementary materials to the reference paper [1]. It provides the proof of equation (8) in [1]. This proof concerns a particular regression property of complex isotropic symmetric alpha-stable random variables (see [2]). In [1], this property is shown paramount in building efficient filters for separating symmetric alpha-stable processes. Such processes exhibit very large dynamic ranges while being locally stationary, and have been shown appropriate for audio modeling.

Scale-invariant probabilistic latent component analysis
Romain Hennequin, Bertrand David, Roland Badeau
March 2014. Rapport inte....

@techreport{hennequin:hal-00960765,
  author = {Hennequin, Romain and David, Bertrand and Badeau, Roland},
  hal_id = {hal-00960765},
  hal_version = {v1},
  keywords = {shift-invariant decomposition ; Non-negative decomposition ; Nonnegative matrix factorization ; probabilistic latent component analysis ; shift-invariant decomposition.},
  month = mar,
  note = {Rapport interne T{\'e}l{\'e}com ParisTech n{\textdegree}2011D003},
  pdf = {https://hal.archives-ouvertes.fr/hal-00960765/file/publication-227.pdf},
  title = {{Scale-invariant probabilistic latent component analysis}},
  url = {https://hal.archives-ouvertes.fr/hal-00960765},
  year = {2014}
}

2013

Conference Articles

Multimodal Signal Analysis at Telecom ParisTech
Slim Essid
Seminaire scienti\unmatchedfb01que de Technicolor R&D, Rennes, France, December 2013.

@inproceedings{essid:hal-02288526,
  address = {Rennes, France},
  author = {Essid, Slim},
  booktitle = {{Seminaire scienti\unmatchedfb01que de Technicolor R\&D}},
  hal_id = {hal-02288526},
  hal_local_reference = {SE:Technicolor13},
  hal_version = {v1},
  month = dec,
  title = {{Multimodal Signal Analysis at Telecom ParisTech}},
  url = {https://hal.telecom-paris.fr/hal-02288526},
  year = {2013}
}

An Extended Audio-Fingerprint Method with Capabilities for Similar Music Detection
Sébastien Fenet, Yves Grenier, Gael Richard
ISMIR, Curitiba, Brazil, November 2013.

@inproceedings{fenet:hal-02286820,
  address = {Curitiba, Brazil},
  author = {Fenet, S{\'e}bastien and Grenier, Yves and Richard, Gael},
  booktitle = {{ISMIR}},
  hal_id = {hal-02286820},
  hal_local_reference = {SF:ISMIR-13},
  hal_version = {v1},
  month = nov,
  pages = {569-574},
  title = {{An Extended Audio-Fingerprint Method with Capabilities for Similar Music Detection}},
  url = {https://hal.telecom-paris.fr/hal-02286820},
  year = {2013}
}

Nonnegative Tensor Factorization for Single-Channel EEG Artifact Rejection
Cécilia Damon, Antoine Liutkus, Alexandre Gramfort, Slim Essid
IEEE International Workshop on Machine Learning for Signal Processing, Southampton, United Kingdom, September 2013.

@inproceedings{damon:hal-02288386,
  address = {Southampton, United Kingdom},
  author = {Damon, C{\'e}cilia and Liutkus, Antoine and Gramfort, Alexandre and Essid, Slim},
  booktitle = {{IEEE International Workshop on Machine Learning for Signal Processing}},
  hal_id = {hal-02288386},
  hal_local_reference = {CD:MLSP-13},
  hal_version = {v1},
  keywords = {EEG ; NTF ; NMF},
  month = sep,
  title = {{Nonnegative Tensor Factorization for Single-Channel EEG Artifact Rejection}},
  url = {https://hal.telecom-paris.fr/hal-02288386},
  year = {2013}
}

Does dereverberation help multichannel blind source separation? A study case
Nicolás López, Mounira Maazaoui, Yves Grenier, Gael Richard, Ivan Bourmeyster
European Signal Processing Conference (EUSIPCO), Marrakech, Morocco, September 2013.
```
@inproceedings{lopez:hal-02286737,
  address = {Marrakech, Morocco},
  author = {L{\'o}pez, Nicol{\'a}s and Maazaoui, Mounira and Grenier, Yves and Richard, Gael and Bourmeyster, Ivan},
  booktitle = {{European Signal Processing Conference (EUSIPCO)}},
  hal_id = {hal-02286737},
  hal_local_reference = {NL:EUSIPCO-13},
  hal_version = {v1},
  keywords = {Blind source separation ; speech dereverberation ; spectral subtraction ; microphone array.},
  month = sep,
  title = {{Does dereverberation help multichannel blind source separation? A study case}},
  url = {https://hal.telecom-paris.fr/hal-02286737},
  year = {2013}
}
```
Multichannel blind source separation performances rapidly degrade when the mixtures are highly reverberated. In fact, blind source separation algorithms usually focus on the separation task without dealing with the dereverberation problem. Some recent studies attempted to reduce the reverberation by introducing a dereverberation module before or after the blind source separation but only limited success was obtained in improving the separation performance in highly reverberant rooms. In this article, we conduct a number of experiments combining state of the art spectral enhancement-based dereverberation and source separation algorithms showing that, in this particular case, speech enhancement does not improve the performance of blind source separation. Index Terms— Blind source separation, speech
Co-factorisation douce en matrices non-négatives. Application au regroupement multimodal de locuteurs
Nicolas Seichepine, Slim Essid, Cédric Févotte, Olivier Cappé
GRETSI, Brest, France, September 2013.
```
@inproceedings{seichepine:hal-02286798,
  address = {Brest, France},
  author = {Seichepine, Nicolas and Essid, Slim and F{\'e}votte, C{\'e}dric and Capp{\'e}, Olivier},
  booktitle = {{GRETSI}},
  hal_id = {hal-02286798},
  hal_local_reference = {gretsi-13},
  hal_version = {v1},
  keywords = {NMF ; multimodal},
  month = sep,
  title = {{Co-factorisation douce en matrices non-n{\'e}gatives. Application au regroupement multimodal de locuteurs}},
  url = {https://hal.telecom-paris.fr/hal-02286798},
  year = {2013}
}
```
Nous présentons ici une nouvelle méthode pour une co-factorisation bi-modale en matrices non-négatives. Cette méthode est adaptée aux situations où deux modalités sont liées par une même information sous-jacente. Elle permet une co-factorisation dite douce, qui prend en compte la relation entre les modalités tout en évitant l’hypothèse forte d’un même facteur en commun. Cette méthode n’impose pas que les données associées à chaque modalité aient la même dimension ou soient exprimées dans un même espace. La co-factorisation est obtenue par résolution d’un problème d’optimisation, via une méthode de majoration-minimisation ; puis une application au regroupement multimodal de locuteurs est présentée, où la co-factorisation douce sert à exploiter la corrélation entre les pistes audio et vidéo dans des débats télévisés.

Probabilistic dance performance alignment by fusion of multimodal features
Angelique Dremeau, Slim Essid
IEEE Int’l Conf. on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, May 2013.

@inproceedings{dremeau:hal-02288353,
  address = {Vancouver, Canada},
  author = {Dremeau, Angelique and Essid, Slim},
  booktitle = {{IEEE Int'l Conf. on Acoustics, Speech and Signal Processing (ICASSP)}},
  hal_id = {hal-02288353},
  hal_local_reference = {Dremeau2013a},
  hal_version = {v1},
  month = may,
  title = {{Probabilistic dance performance alignment by fusion of multimodal features}},
  url = {https://hal.telecom-paris.fr/hal-02288353},
  year = {2013}
}

Soft nonnegative matrix co-factorizationwith application to multimodal speaker diarization
N. Seichepine, Slim Essid, C. Fevotte, O. Cappe
ICASSP 2013 - 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, France, May 2013.

@inproceedings{seichepine:hal-02943543,
  address = {Vancouver, France},
  author = {Seichepine, N. and Essid, Slim and Fevotte, C. and Cappe, O.},
  booktitle = {{ICASSP 2013 - 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  doi = {10.1109/ICASSP.2013.6638316},
  hal_id = {hal-02943543},
  hal_version = {v1},
  month = may,
  pages = {3537-3541},
  publisher = {{IEEE}},
  title = {{Soft nonnegative matrix co-factorizationwith application to multimodal speaker diarization}},
  url = {https://hal.archives-ouvertes.fr/hal-02943543},
  year = {2013}
}

Variational Bayesian EM algorithm for modeling mixtures of non-stationary signals in the time-frequency domain (HR-NMF)
Roland Badeau, Angélique Dremeau
ICASSP, Vancouver, Canada, 2013.
```
@inproceedings{badeau:hal-00945276,
  address = {Vancouver, Canada},
  author = {Badeau, Roland and Dremeau, Ang{\'e}lique},
  booktitle = {{ICASSP}},
  hal_id = {hal-00945276},
  hal_version = {v1},
  keywords = {Variational inference ; Nonnegative Matrix Factorization ; High Resolution methods ; Expectation-Maximization algorithm ; Variational inference.},
  pages = {6171--6175},
  pdf = {https://hal.inria.fr/hal-00945276/file/badeau-icassp2013.pdf},
  publisher = {{IEEE}},
  title = {{Variational Bayesian EM algorithm for modeling mixtures of non-stationary signals in the time-frequency domain (HR-NMF)}},
  url = {https://hal.inria.fr/hal-00945276},
  year = {2013}
}
```
We recently introduced the high-resolution nonnegative matrix factorization (HR-NMF) model for analyzing mixtures of nonstationary signals in the time-frequency domain, and highlighted its capability to both reach high spectral resolution and reconstruct high quality audio signals. In order to estimate the model parameters and the latent components, we proposed to resort to an expectation-maximization (EM) algorithm based on a Kalman filter/ smoother. The approach proved to be appropriate for modeling audio signals in applications such as source separation and audio inpainting. However, its computational cost is high, dominated by the Kalman filter/smoother, and may be prohibitive when dealing with high-dimensional signals. In this paper, we consider two different alternatives, using the variational Bayesian EM algorithm and two mean-field approximations. We show that, while significantly reducing the complexity of the estimation, these novel approaches do not alter its quality.
Probabilistic Time-Frequency Source-Filter Decomposition of Non-Stationary Signals
Roland Badeau, Mark. D. Plumbley
EUSIPCO, Marrakech, Morocco, 2013.
```
@inproceedings{badeau:hal-00945277,
  address = {Marrakech, Morocco},
  author = {Badeau, Roland and Plumbley, Mark. D.},
  booktitle = {{EUSIPCO}},
  hal_id = {hal-00945277},
  hal_version = {v1},
  keywords = {Nonnegative matrix factorisation ; Probabilistic modelling ; Non-stationary processes ; Time-frequency analysis ; Source-filter models ; Nonnegative matrix factorisation.},
  pdf = {https://hal.inria.fr/hal-00945277/file/Badeau-Plumbley-EUSIPCO-2013.pdf},
  title = {{Probabilistic Time-Frequency Source-Filter Decomposition of Non-Stationary Signals}},
  url = {https://hal.inria.fr/hal-00945277},
  year = {2013}
}
```
Probabilistic modelling of non-stationary signals in the timefrequency (TF) domain has been an active research topic recently. Various models have been proposed, notably in the nonnegative matrix factorization (NMF) literature. In this paper, we propose a new TF probabilistic model that can represent a variety of stationary and non-stationary signals, such as autoregressive moving average (ARMA) processes, uncorrelated noise, damped sinusoids, and transient signals. This model also generalizes and improves both the Itakura-Saito (IS)-NMF and high resolution (HR)-NMF models.
Multichannel HR-NMF for modelling convolutive mixtures of non-stationary signals in the time-frequency domain
Roland Badeau, Mark. D. Plumbley
WASPAA, New Paltz, New York, United States, 2013.
```
@inproceedings{badeau:hal-00945278,
  address = {New Paltz, New York, United States},
  author = {Badeau, Roland and Plumbley, Mark. D.},
  booktitle = {{WASPAA}},
  hal_id = {hal-00945278},
  hal_version = {v1},
  keywords = {Variational EM algorithm ; Non-stationary signal modelling ; Time-frequency analysis ; Separation of convolutive mixtures ; Multichannel signal analysis ; Variational EM algorithm.},
  pdf = {https://hal.inria.fr/hal-00945278/file/WASPAA2013.pdf},
  publisher = {{IEEE}},
  title = {{Multichannel HR-NMF for modelling convolutive mixtures of non-stationary signals in the time-frequency domain}},
  url = {https://hal.inria.fr/hal-00945278},
  year = {2013}
}
```
In the literature, several probabilistic models involving latent components have been proposed for modelling time-frequency (TF) representations of audio signals (such as spectrograms), notably in the nonnegative matrix factorization (NMF) literature. Among them, the recent high resolution (HR)-NMF model is able to take both phases and local correlations in each frequency band into account, and its potential has been illustrated in applications such as source separation and audio inpainting. In this paper, HR-NMF is extended to multichannel signals and to convolutive mixtures. A fast variational expectation-maximization (EM) algorithm is proposed to estimate the enhanced model. This algorithm is applied to a stereophonic piano signal, and proves capable of accurately modelling reverberation and restoring missing observations.

Fast multilinear SVD for structured tensors and applications to Harmonic analysis and Volterra serie
Remy Boyer, Roland Badeau, Gérard Favier
Assemblée Générale du GdR ISIS 2013, France, 2013.

@inproceedings{boyer:hal-01006183,
  address = {France},
  author = {Boyer, Remy and Badeau, Roland and Favier, G{\'e}rard},
  booktitle = {{Assembl{\'e}e G{\'e}n{\'e}rale du GdR ISIS 2013}},
  hal_id = {hal-01006183},
  hal_version = {v1},
  title = {{Fast multilinear SVD for structured tensors and applications to Harmonic analysis and Volterra serie}},
  url = {https://hal-supelec.archives-ouvertes.fr/hal-01006183},
  year = {2013}
}

Outil d’analyse temps-fréquence multi-résolution appliqué aux signaux audio
Thomas Fillon, Jacques Prado, Roland Badeau
Colloque GRETSI 2013, Brest, France, 2013.
```
@inproceedings{fillon:hal-00945239,
  address = {Brest, France},
  author = {Fillon, Thomas and Prado, Jacques and Badeau, Roland},
  booktitle = {{Colloque GRETSI 2013}},
  hal_id = {hal-00945239},
  hal_version = {v1},
  pdf = {https://hal.inria.fr/hal-00945239/file/Fillon-GRETSI2013.pdf},
  title = {{Outil d'analyse temps-fr{\'e}quence multi-r{\'e}solution appliqu{\'e} aux signaux audio}},
  url = {https://hal.inria.fr/hal-00945239},
  year = {2013}
}
```
Cet article présente un outil d’analyse temps-fréquence multi-résolution généralisant la transformée à Q constant. Grâce à cette approche, il est possible de spécifier de manière ad-hoc les paramètres de fréquences centrales et de résolutions fréquentielles.
Low bitrate informed source separation of realistic mixtures
Antoine Liutkus, Roland Badeau, Gael Richard
ICASSP, Vancouver, Canada, 2013.
```
@inproceedings{liutkus:hal-00945299,
  address = {Vancouver, Canada},
  author = {Liutkus, Antoine and Badeau, Roland and Richard, Gael},
  booktitle = {{ICASSP}},
  doi = {10.1109/ICASSP.2013.6637610},
  hal_id = {hal-00945299},
  hal_version = {v1},
  keywords = {audio upmixing ; Wiener filtering ; spatial audio object coding ; informed source separation},
  pages = {66--70},
  pdf = {https://hal.inria.fr/hal-00945299/file/ICASSP-demixing.pdf},
  publisher = {{IEEE}},
  title = {{Low bitrate informed source separation of realistic mixtures}},
  url = {https://hal.inria.fr/hal-00945299},
  year = {2013}
}
```
Demixing consists in recovering the sounds that compose a multichannel mix. Important applications include karaoke or respatialization. Several approaches to this problem have been proposed in a coding/decoding framework, which are denoted either as spatial audio object coding or informed source separation. They assume that the constituent sounds are available at an encoding stage and used to compute a side-information transmitted to the end-user. At a decoding stage, only the mixtures and the side information are used to recover the sources. Here, we propose an advanced model, which encompasses many practical scenarios and permits to reach bitrates as low as 0:5kbps/source. First, the sources may be mono or multichannel. Second, the mixing process is not assumed to be linearinstantaneous or convolutive as is usual, but rather diffuse, permitting professional mixes to be processed. Third, the signals to be recovered may either be the original sources or their images.

Débruitage Aveugle par Décompositions Parcimonieuses et Aléatoires,
Manuel Moussallam, Alexandre Gramfort, Gael Richard, Laurent Daudet
GRETSI, Brest, France, 2013.

@inproceedings{moussallam:hal-02713678,
  address = {Brest, France},
  author = {Moussallam, Manuel and Gramfort, Alexandre and Richard, Gael and Daudet, Laurent},
  booktitle = {{GRETSI}},
  hal_id = {hal-02713678},
  hal_version = {v1},
  title = {{D{\'e}bruitage Aveugle par D{\'e}compositions Parcimonieuses et Al{\'e}atoires,}},
  url = {https://hal.inria.fr/hal-02713678},
  year = {2013}
}

Multimodal Classification of Dance Movements using Body Joint Trajectories and Step Sounds
A. Masurelle, S. Essid, G. Richard
Proceedings of workshop on Image and Audio Analysis for Multimedia Interactive Services , 2013.

@inproceedings{Masurelle13,
  author = {Masurelle, A. and Essid, S. and Richard, G.},
  title = {Multimodal Classification of Dance Movements using Body Joint Trajectories and Step Sounds},
  booktitle = {Proceedings of workshop on Image and Audio Analysis for Multimedia Interactive Services },
  year = {2013},
  series = {WIAMIS'13},
  pages = {1--4}
}

Technical Reports

Estimating an AR Model with Exogenous Driver
Yves Grenier
October 2013.
```
@techreport{grenier:hal-00875064,
  author = {Grenier, Yves},
  hal_id = {hal-00875064},
  hal_version = {v1},
  month = oct,
  pdf = {https://hal-imt.archives-ouvertes.fr/hal-00875064/file/Driven-AR-Estimation.pdf},
  title = {{Estimating an AR Model with Exogenous Driver}},
  url = {https://hal-imt.archives-ouvertes.fr/hal-00875064},
  year = {2013}
}
```
In this paper, we introduce an autoregressive model which has an evolution that is driven by an exogenous pilot signal. This model shares some properties with TAR (Threshold Auto Regressive) models and STAR (Smooth Transition Auto Regressive) models. This text defines the model, it presents an estimator for this model, and an estimator for the variance of the innovation, which is not constant in this model. An exact computation of the likelihood of this driven autoregressive model is then presented. Two appendices present a state-space realization of this model and the expression of a Kalman filter for such a model.
Multichannel high resolution NMF for modelling convolutive mixtures of non-stationary signals in the time-frequency domain
Roland Badeau, Mark. D. Plumbley
2013.
```
@techreport{badeau:hal-00945249,
  author = {Badeau, Roland and Plumbley, Mark. D.},
  hal_id = {hal-00945249},
  hal_version = {v1},
  keywords = {Non-stationary signal modelling ; Timefrequency analysis ; Nonnegative matrix factorisation ; Multichannel signal analysis ; Variational EM algorithm},
  number = {EECSRR-13-03},
  pdf = {https://hal.inria.fr/hal-00945249/file/Badeau-Plumbley2013.pdf},
  title = {{Multichannel high resolution NMF for modelling convolutive mixtures of non-stationary signals in the time-frequency domain}},
  type = {Research Report},
  url = {https://hal.inria.fr/hal-00945249},
  year = {2013}
}
```
Several probabilistic models involving latent components have been proposed for modelling time-frequency (TF) representations of audio signals such as spectrograms, notably in the nonnegative matrix factorization (NMF) literature. Among them, the recent high resolution NMF (HR-NMF) model is able to take both phases and local correlations in each frequency band into account, and its potential has been illustrated in applications such as source separation and audio inpainting. In this paper, HR-NMF is extended to multichannel signals and to convolutive mixtures. The new model can represent a variety of stationary and non-stationary signals, including autoregressive moving average (ARMA) processes and mixtures of damped sinusoids. A fast variational expectation-maximization (EM) algorithm is proposed to estimate the enhanced model. This algorithm is applied to a stereophonic piano signal, and proves capable of accurately modelling reverberation and restoring missing observations.

Journal Articles

Learning Optimal Features for Polyphonic Audio-to-Score Alignment
Cyril Joder, Slim Essid, Gael Richard
IEEE Transactions on Audio, Speech and Language Processing, October 2013.
```
@article{joder:hal-02704714,
  author = {Joder, Cyril and Essid, Slim and Richard, Gael},
  hal_id = {hal-02704714},
  hal_version = {v1},
  journal = {{IEEE Transactions on Audio, Speech and Language Processing}},
  month = oct,
  pdf = {https://hal.archives-ouvertes.fr/hal-02704714/file/2013-Joder-TSALP.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Learning Optimal Features for Polyphonic Audio-to-Score Alignment}},
  url = {https://hal.archives-ouvertes.fr/hal-02704714},
  year = {2013}
}
```
This paper addresses the design of feature functions for the matching of a musical recording to the symbolic representation of the piece (the score). These feature functions are defined as dissimilarity measures between the audio observations and template vectors corresponding to the score. By expressing the template construction as a linear mapping from the symbolic to the audio representation, one can learn the feature functions by optimizing the linear transformation. In this paper, we explore two different learning strategies. The first one uses a best-fit criterion (minimum divergence), while the second one exploits a discriminative framework based on a Conditional Random Fields model (maximum likelihood criterion). We evaluate the influence of the feature functions in an audio-to-score alignment task, on a large database of popular and classical polyphonic music. The results show that with several types of models, using different temporal constraints, the learned mappings have the potential to outperform the classic heuristic mappings. Several representations of the audio observations, along with several distance functions are compared in this alignment task. Our experiments elect the symmetric Kullback-Leibler divergence. Moreover, both the spectrogram and a CQT-based representation turn out to provide very accurate alignments, detecting more than 97% of the onsets with a precision of 100 ms with our most complex system.

A Multimodal Approach to Speaker Diarization on TV Talk-Shows
Félicien Vallet, Slim Essid, Jean Carrive
IEEE Transactions on Multimedia, April 2013.

@article{vallet:hal-02943545,
  author = {Vallet, F{\'e}licien and Essid, Slim and Carrive, Jean},
  doi = {10.1109/TMM.2012.2233724},
  hal_id = {hal-02943545},
  hal_version = {v1},
  journal = {{IEEE Transactions on Multimedia}},
  month = apr,
  number = {3},
  pages = {509-520},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{A Multimodal Approach to Speaker Diarization on TV Talk-Shows}},
  url = {https://hal.archives-ouvertes.fr/hal-02943545},
  volume = {15},
  year = {2013}
}

Smooth Nonnegative Matrix Factorization for Unsupervised Audiovisual Document Structuring
Slim Essid, Cédric Févotte
IEEE Transactions on Multimedia, February 2013.

@article{essid:hal-02943541,
  author = {Essid, Slim and F{\'e}votte, C{\'e}dric},
  doi = {10.1109/TMM.2012.2228474},
  hal_id = {hal-02943541},
  hal_version = {v1},
  journal = {{IEEE Transactions on Multimedia}},
  month = feb,
  number = {2},
  pages = {415-425},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Smooth Nonnegative Matrix Factorization for Unsupervised Audiovisual Document Structuring}},
  url = {https://hal.archives-ouvertes.fr/hal-02943541},
  volume = {15},
  year = {2013}
}

Harmonic Adaptive Latent Component Analysis of Audio and Application to Music Transcription
Benoît Fuentes, Roland Badeau, Gael Richard
IEEE_J_ASLP, 2013.
```
@article{fuentes:hal-00945197,
  author = {Fuentes, Beno{\^i}t and Badeau, Roland and Richard, Gael},
  hal_id = {hal-00945197},
  hal_version = {v1},
  journal = {{IEEE\_J\_ASLP}},
  number = {9},
  pages = {1854--1866},
  pdf = {https://hal.inria.fr/hal-00945197/file/article-2013-13579-5.pdf},
  publisher = {{IEEE}},
  title = {{Harmonic Adaptive Latent Component Analysis of Audio and Application to Music Transcription}},
  url = {https://hal.inria.fr/hal-00945197},
  volume = {21},
  year = {2013}
}
```
Recently, new methods for smart decomposition of time-frequency representations of audio have been proposed in order to address the problem of blind automatic music transcription. However those techniques are not necessarily suitable for notes having variations of both pitch and spectral envelope over time. The HALCA (Harmonic Adaptive Latent Component Analysis) model presented in this article allows considering those two kinds of variations simultaneously. Each note in a constant-Q transform is locally modeled as a weighted sum of fixed narrowband harmonic spectra, spectrally convolved with some impulse that defines the pitch. All parameters are estimated by means of the expectation-maximization (EM) algorithm, in the framework of Probabilistic Latent Component Analysis. Interesting priors over the parameters are also introduced in order to help the EM algorithm converging towards a meaningful solution. We applied this model for automatic music transcription: the onset time, duration and pitch of each note in an audio file are inferred from the estimated parameters. The system has been evaluated on two different databases and obtains very promising results.

patent

Génération d’une Signature d’un Signal Audio Musical
Sébastien Fenet, Yves Grenier, Gael Richard
France, February 2013.

@patent{fenet:hal-02412531,
  address = {France},
  author = {Fenet, S{\'e}bastien and Grenier, Yves and Richard, Gael},
  hal_id = {hal-02412531},
  hal_local_reference = {SF:PATENT-2013},
  hal_version = {v1},
  month = feb,
  number = {1351752},
  title = {{G{\'e}n{\'e}ration d'une Signature d'un Signal Audio Musical}},
  url = {https://hal.telecom-paris.fr/hal-02412531},
  year = {2013}
}

2012

Conference Articles

Analysis of dance movements using gaussian processes
Antoine Liutkus, Angélique Drémeau, Dimitrios Alexiadis, Slim Essid, Petros Daras
the 20th ACM international conference, Nara, France, October 2012.

@inproceedings{liutkus:hal-02943555,
  address = {Nara, France},
  author = {Liutkus, Antoine and Dr{\'e}meau, Ang{\'e}lique and Alexiadis, Dimitrios and Essid, Slim and Daras, Petros},
  booktitle = {{the 20th ACM international conference}},
  doi = {10.1145/2393347.2396492},
  hal_id = {hal-02943555},
  hal_version = {v1},
  month = oct,
  pages = {1375},
  publisher = {{ACM Press}},
  title = {{Analysis of dance movements using gaussian processes}},
  url = {https://hal.archives-ouvertes.fr/hal-02943555},
  year = {2012}
}

Decomposing the video editing structure of a talk-show using nonnegative matrix factorization
Slim Essid, C. Fevotte
2012 19th IEEE International Conference on Image Processing (ICIP 2012), Orlando, France, September 2012.

@inproceedings{essid:hal-02943553,
  address = {Orlando, France},
  author = {Essid, Slim and Fevotte, C.},
  booktitle = {{2012 19th IEEE International Conference on Image Processing (ICIP 2012)}},
  doi = {10.1109/ICIP.2012.6467557},
  hal_id = {hal-02943553},
  hal_version = {v1},
  month = sep,
  pages = {3105-3108},
  publisher = {{IEEE}},
  title = {{Decomposing the video editing structure of a talk-show using nonnegative matrix factorization}},
  url = {https://hal.archives-ouvertes.fr/hal-02943553},
  year = {2012}
}

Low variance blind estimation of the reverberation time
Nicolás López, Yves Grenier, Gael Richard, Ivan Bourmeyster
13th International Workshop on Acoustic Signal Enhancement (IWAENC 2012), Aachen, Germany, September 2012.
```
@inproceedings{lopez:hal-02288328,
  address = {Aachen, Germany},
  author = {L{\'o}pez, Nicol{\'a}s and Grenier, Yves and Richard, Gael and Bourmeyster, Ivan},
  booktitle = {{13th International Workshop on Acoustic Signal Enhancement (IWAENC 2012)}},
  hal_id = {hal-02288328},
  hal_local_reference = {NL:IWAENC-12},
  hal_version = {v1},
  keywords = {reverberation time ; blind estimation ; decay rate distribution ; low variance},
  month = sep,
  title = {{Low variance blind estimation of the reverberation time}},
  url = {https://hal.telecom-paris.fr/hal-02288328},
  year = {2012}
}
```
The reverberation time is a key feature for describing the acoustic properties of a reverberant room. It can be computed from a measured Room Impulse Response but in many applications it has to be estimated blindly. Existing blind methods give accurate estimates but they often exhibit high variance across different speakers. In this paper, a low variance blind estimator of the reverberation time is derived from the decay rate distribution of the signal. The influence of the reverberation time on the statistical moments of the distribution is analyzed and one relevant moment is taken as an estimator. The variance of the estimator is reduced thanks to a prewhitening filter and a modification of the decay rate distribution. Experimental results confirm the accuracy of the method when the observed signal is sufficiently long.
Low variance blind estimation of the reverberation time
Nicolas López, Yves Grenier, Gaël Richard, Ivan Bourmeyster
13th International Workshop on Acoustic Signal Enhancement (IWAENC 2012), Aachen, Germany, September 2012.
```
@inproceedings{NL:IWAENC-12,
  author = {L{\'o}pez, Nicolas and Grenier, Yves and Richard, Ga{\"e}l and Bourmeyster, Ivan},
  title = {Low variance blind estimation of the reverberation time},
  booktitle = {13th International Workshop on Acoustic Signal Enhancement (IWAENC 2012)},
  address = {Aachen, Germany},
  year = {2012},
  month = sep,
  keywords = {reverberation time; blind estimation; decay rate distribution; low variance}
}
```
The reverberation time is a key feature for describing the acoustic properties of a reverberant room. It can be computed from a measured Room Impulse Response but in many applications it has to be estimated blindly. Existing blind methods give accurate estimates but they often exhibit high variance across different speakers. In this paper, a low variance blind estimator of the reverberation time is derived from the decay rate distribution of the signal. The influence of the reverberation time on the statistical moments of the distribution is analyzed and one relevant moment is taken as an estimator. The variance of the estimator is reduced thanks to a prewhitening filter and a modification of the decay rate distribution. Experimental results confirm the accuracy of the method when the observed signal is sufficiently long.
A Framework for Fingerprint-Based Detection of Repeating Objects in Multimedia Streams
Sébastien Fenet, Manuel Moussallam, Yves Grenier, Gael Richard, Laurent Daudet
EUSIPCO, Bucharest, Romania, August 2012.
```
@inproceedings{fenet:hal-00731828,
  address = {Bucharest, Romania},
  author = {Fenet, S{\'e}bastien and Moussallam, Manuel and Grenier, Yves and Richard, Gael and Daudet, Laurent},
  booktitle = {{EUSIPCO}},
  hal_id = {hal-00731828},
  hal_local_reference = {SF:EUSIPCO-12},
  hal_version = {v1},
  month = aug,
  pages = {1464-1468},
  title = {{A Framework for Fingerprint-Based Detection of Repeating Objects in Multimedia Streams}},
  url = {https://hal-imt.archives-ouvertes.fr/hal-00731828},
  year = {2012}
}
```
We present an original framework for the detection of repeating objects in multimedia streams. This framework is designed so that it can work with any fingerprint model. A fingerprint is extracted for each incoming frame of the multimedia stream. The framework then manages this fingerprint so that if one similar frame comes later in the stream, it will be identified as a repetition. The framework has been tested with two distinct fingerprint models on simulated and ’realworld’ data. The results show that the framework performs well with both presented models and that it is suitable for industrial use-cases.
A Framework for Fingerprint-Based Detection of Repeating Objects in Multimedia Streams
Sébastien Fenet, Manuel Moussallam, Yves Grenier, Gaël Richard, Laurent Daudet
EUSIPCO, Bucharest, Romania, August 2012.
```
@inproceedings{SF:EUSIPCO-12,
  author = {Fenet, S{\'e}bastien and Moussallam, Manuel and Grenier, Yves and Richard, Ga{\"e}l and Daudet, Laurent},
  title = {A Framework for Fingerprint-Based Detection of Repeating Objects in Multimedia Streams},
  booktitle = {EUSIPCO},
  address = {Bucharest, Romania},
  year = {2012},
  month = aug,
  pages = {1464--1468}
}
```
We present an original framework for the detection of repeating objects in multimedia streams. This framework is designed so that it can work with any fingerprint model. A fingerprint is extracted for each incoming frame of the multimedia stream. The framework then manages this fingerprint so that if one similar frame comes later in the stream, it will be identified as a repetition. The framework has been tested with two distinct fingerprint models on simulated and ’realworld’ data. The results show that the framework performs well with both presented models and that it is suitable for industrial use-cases.
Adaptive blind source separation with HRTFs beamforming preprocessing
Mounira Maazaoui, Karim Abed-Meraim, Yves Grenier
The seventh IEEE Sensor Array and Multichannel Signal Processing Workshop, United States, June 2012.
```
@inproceedings{maazaoui:hal-00683567,
  address = {United States},
  author = {Maazaoui, Mounira and Abed-Meraim, Karim and Grenier, Yves},
  booktitle = {{The seventh IEEE Sensor Array and Multichannel Signal Processing Workshop}},
  hal_id = {hal-00683567},
  hal_version = {v1},
  keywords = {Adaptive blind source separation ; beamforming ; HRTF},
  month = jun,
  pdf = {https://hal.archives-ouvertes.fr/hal-00683567/file/1569549665_FINAL.pdf},
  title = {{Adaptive blind source separation with HRTFs beamforming preprocessing}},
  url = {https://hal.archives-ouvertes.fr/hal-00683567},
  year = {2012}
}
```
We propose an adaptive blind source separation algorithm in the context of robot audition using a microphone array. Our algorithm presents two steps: a fixed beamforming step to reduce the reverberation and the background noise and a source separation step. In the fixed beamforming preprocessing, we build the beamforming filters using the Head Related Transfer Functions (HRTFs) which allows us to take into consideration the effect of the robot’s head on the near acoustic field. In the source separation step, we use a separation algorithm based on the l1 norm minimization. We evaluate the performance of the proposed algorithm in a total adaptive way with real data and varying number of sources and show good separation and source number estimation results.
Adaptive blind source separation with HRTFs beamforming preprocessing and varying number of sources
Mounira Maazaoui, Karim Abed-Meraim, Yves Grenier
The seventh IEEE Sensor Array and Multichannel Signal Processing Workshop, New Jersey, United States, June 2012.
```
@inproceedings{maazaoui:hal-02286271,
  address = {New Jersey, United States},
  author = {Maazaoui, Mounira and Abed-Meraim, Karim and Grenier, Yves},
  booktitle = {{The seventh IEEE Sensor Array and Multichannel Signal Processing Workshop}},
  hal_id = {hal-02286271},
  hal_local_reference = {Maazaoui-Meraim-Grenier-2012a},
  hal_version = {v1},
  keywords = {adaptive blind source separation ; fixed beamforming ; head related transfert functions},
  month = jun,
  title = {{Adaptive blind source separation with HRTFs beamforming preprocessing and varying number of sources}},
  url = {https://hal.telecom-paris.fr/hal-02286271},
  year = {2012}
}
```
We propose an adaptive blind source separation algorithm in the context of robot audition using a microphone array. Our algorithm presents two steps: a fixed beamforming step to reduce the reverberation and the background noise and a source separation step. In the fixed beamforming preprocessing, we build the beamforming filters using the Head Related Transfer Functions (HRTFs) which allows us to take into consideration the effect of the robot’s head on the near acoustic field. In the source separation step, we use a separation algorithm based on the l1 norm minimization. We evaluate the performance of the proposed algorithm in a total adaptive way with real data and varying number of sources and show good separation and source number estimation results.
Adaptive blind source separation with HRTFs beamforming preprocessing and varying number of sources
Mounira Maazaoui, Karim Abed-Meraim, Yves Grenier
The seventh IEEE Sensor Array and Multichannel Signal Processing Workshop, New Jersey, USA, June 2012.
```
@inproceedings{Maazaoui-Meraim-Grenier-2012a,
  author = {Maazaoui, Mounira and Abed-Meraim, Karim and Grenier, Yves},
  title = {Adaptive blind source separation with HRTFs beamforming preprocessing and varying number of sources},
  booktitle = {The seventh IEEE Sensor Array and Multichannel Signal Processing Workshop},
  address = {New Jersey, USA},
  year = {2012},
  month = jun,
  keywords = {adaptive blind source separation, fixed beamforming, head related transfert functions}
}
```
We propose an adaptive blind source separation algorithm in the context of robot audition using a microphone array. Our algorithm presents two steps: a fixed beamforming step to reduce the reverberation and the background noise and a source separation step. In the fixed beamforming preprocessing, we build the beamforming filters using the Head Related Transfer Functions (HRTFs) which allows us to take into consideration the effect of the robot’s head on the near acoustic field. In the source separation step, we use a separation algorithm based on the l1 norm minimization. We evaluate the performance of the proposed algorithm in a total adaptive way with real data and varying number of sources and show good separation and source number estimation results.
From Binaural to Multichannel Blind Source Separation using Fixed Beamforming with HRTFs
Mounira Maazaoui, Yves Grenier, Karim Abed-Meraim
The 19th International Conference on Systems, Signals and Image Processing, IWSSIP 2012, Austria, April 2012.
```
@inproceedings{maazaoui:hal-00683570,
  address = {Austria},
  author = {Maazaoui, Mounira and Grenier, Yves and Abed-Meraim, Karim},
  booktitle = {{The 19th International Conference on Systems, Signals and Image Processing, IWSSIP 2012}},
  hal_id = {hal-00683570},
  hal_version = {v1},
  keywords = {blind source separation ; binaural ; microphone array ; HRTF},
  month = apr,
  pdf = {https://hal.archives-ouvertes.fr/hal-00683570/file/IWSSIP_2012_Binaural_Multisensors_final.pdf},
  title = {{From Binaural to Multichannel Blind Source Separation using Fixed Beamforming with HRTFs}},
  url = {https://hal.archives-ouvertes.fr/hal-00683570},
  year = {2012}
}
```
In this article, we are interested in the problem of blind source separation (BSS) for the robot audition, we study the performance of blind source separation with a varying number of sensors in a microphone array placed in the head of an infant size dummy. We propose a two stage blind source separation algorithm based on a fixed beamforming preprocessing using the head related transfer functions (HRTF) of the dummy and a separation algorithm using a sparsity criterion. We show that in the case of robot audition, the use of a multisensor array improves significantly the performance of the source separation algorithm, as compared to the binaural case, up to a limit number of microphones studied in this paper.
From Binaural to Multichannel Blind Source Separation using Fixed Beamforming with HRTFs
Mounira Maazaoui, Yves Grenier, Karim Abed-Meraim
The 19th International Conference on Systems, Signals and Image Processing, IWSSIP 2012, Vienne, Autriche, April 2012.
```
@inproceedings{mmazaoui-grenier-abed-2012,
  author = {Maazaoui, Mounira and Grenier, Yves and Abed-Meraim, Karim},
  title = {From Binaural to Multichannel Blind Source Separation using Fixed Beamforming with HRTFs},
  booktitle = {The 19th International Conference on Systems, Signals and Image Processing, IWSSIP 2012},
  address = {Vienne, Autriche},
  year = {2012},
  month = apr,
  keywords = {blind source separation, beamforming, binaural BSS, multisensors BSS, robot audition}
}
```
In this article, we are interested in the problem of blind source separation (BSS) for the robot audition, we study the performance of blind source separation with a varying number of sensors in a microphone array placed in the head of an infant size dummy. We propose a two stage blind source separation algorithm based on a fixed beamforming preprocessing using the head related transfer functions (HRTF) of the dummy and a separation algorithm using a sparsity criterion. We show that in the case of robot audition, the use of a multisensor array improves significantly the performance of the source separation algorithm, as compared to the binaural case, up to a limit number of microphones studied in this paper.

AN ADVANCED VIRTUAL DANCE PERFORMANCE EVALUATOR
Slim Essid, Dimitrios Alexiadis, Robin Tournemenne, Marc Gowing, Philip Kelly, David Monhagan, Petros Daras, Angelique Dremeau, N. E. O’Connor
IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, March 2012.

@inproceedings{essid:hal-02288313,
  address = {Kyoto, Japan},
  author = {Essid, Slim and Alexiadis, Dimitrios and Tournemenne, Robin and Gowing, Marc and Kelly, Philip and Monhagan, David and Daras, Petros and Dremeau, Angelique and O'Connor, N. E.},
  booktitle = {{IEEE International Conference on Acoustics, Speech and Signal Processing}},
  hal_id = {hal-02288313},
  hal_local_reference = {SE:ICASSP-12b},
  hal_version = {v1},
  month = mar,
  title = {{AN ADVANCED VIRTUAL DANCE PERFORMANCE EVALUATOR}},
  url = {https://hal.telecom-paris.fr/hal-02288313},
  year = {2012}
}

A probabilistic approach to simultaneous extraction of beats and downbeats
Maksim Khadkevich, Thomas Fillon, Gael Richard, Maurizio Omologo
ICASSP 2012 - 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, France, March 2012.

@inproceedings{khadkevich:hal-02713776,
  address = {Kyoto, France},
  author = {Khadkevich, Maksim and Fillon, Thomas and Richard, Gael and Omologo, Maurizio},
  booktitle = {{ICASSP 2012 - 2012 IEEE International Conference on Acoustics, Speech and Signal Processing}},
  doi = {10.1109/ICASSP.2012.6287912},
  hal_id = {hal-02713776},
  hal_version = {v1},
  month = mar,
  pages = {445-448},
  publisher = {{IEEE}},
  title = {{A probabilistic approach to simultaneous extraction of beats and downbeats}},
  url = {https://hal.inria.fr/hal-02713776},
  year = {2012}
}

Blind Harmonic Adaptive Decomposition Applied to Supervised Source Separation
Benoît Fuentes, Roland Badeau, Gael Richard
20th European Signal Processing Conference (EUSIPCO), Bucharest, Romania, 2012.
```
@inproceedings{fuentes:hal-00945288,
  address = {Bucharest, Romania},
  author = {Fuentes, Beno{\^i}t and Badeau, Roland and Richard, Gael},
  booktitle = {{20th European Signal Processing Conference (EUSIPCO)}},
  hal_id = {hal-00945288},
  hal_version = {v1},
  keywords = {Audio Source Separation ; Harmonic Decomposition ; PLCA ; NTF},
  pages = {2654--2658},
  pdf = {https://hal.inria.fr/hal-00945288/file/Eusipco_2012.pdf},
  publisher = {{EURASIP}},
  title = {{Blind Harmonic Adaptive Decomposition Applied to Supervised Source Separation}},
  url = {https://hal.inria.fr/hal-00945288},
  year = {2012}
}
```
In this paper, a new supervised source separation system is introduced. The Constant-Q Transform (CQT) of an audio signal is first analyzed through an algorithm called Blind Harmonic Adaptive Decomposition (BHAD). This algorithm provides an estimation of the polyphonic pitch content of the input signal, from which the user can select the notes to be extracted. The system then automatically separates the corresponding source from the audio mixture, by means of time-frequency masking of the CQT. The system has been evaluated both in a task of multipitch estimation in order to measure the quality of the decomposition, and in a task of user-guided melody extraction to assess the quality of the separation. The very promising results obtained highlight the reliability of the proposed model.
Probabilistic model for main melody extraction using constant-Q transform
Benoît Fuentes, Antoine Liutkus, Roland Badeau, Gael Richard
37th International Conference on Acoustics, Speech, and Signal Processing ICASSP’12, Kyoto, Japan, 2012.
```
@inproceedings{fuentes:hal-00945290,
  address = {Kyoto, Japan},
  author = {Fuentes, Beno{\^i}t and Liutkus, Antoine and Badeau, Roland and Richard, Gael},
  booktitle = {{37th International Conference on Acoustics, Speech, and Signal Processing ICASSP'12}},
  hal_id = {hal-00945290},
  hal_version = {v1},
  keywords = {Audio source separation ; NTF ; PLCA ; CQT},
  pages = {5357--5360},
  pdf = {https://hal.inria.fr/hal-00945290/file/fuentes_ICASSP-2012.pdf},
  publisher = {{IEEE}},
  title = {{Probabilistic model for main melody extraction using constant-Q transform}},
  url = {https://hal.inria.fr/hal-00945290},
  year = {2012}
}
```
Dimension reduction techniques such as Nonnegative Tensor Factorization are now classical for both source separation and estimation of multiple fundamental frequencies in audio mixtures. Still, few studies jointly addressed these tasks so far, mainly because separation is often based on the Short Term Fourier Transform (STFT) whereas recent music analysis algorithms are rather based on the Constant-Q Transform (CQT). The CQT is practical for pitch estimation because a pitch shift amounts to a translation of the CQT representation, whereas it produces a scaling of the STFT. Conversely, no simple inversion of the CQT was available until recently, preventing it from being used for source separation. Benefiting from advances both in the inversion of the CQT and in statistical modeling, we show how recent techniques designed for music analysis can also be used for source separation with encouraging results, thus opening the path to many crossovers between separation and analysis.
Adaptive filtering for music/voice separation exploiting the repeating musical structure
Antoine Liutkus, Zafar Rafii, Roland Badeau, Bryan Pardo, Gael Richard
37th International Conference on Acoustics, Speech, and Signal Processing ICASSP’12, Kyoto, Japan, 2012.
```
@inproceedings{liutkus:hal-00945300,
  address = {Kyoto, Japan},
  author = {Liutkus, Antoine and Rafii, Zafar and Badeau, Roland and Pardo, Bryan and Richard, Gael},
  booktitle = {{37th International Conference on Acoustics, Speech, and Signal Processing ICASSP'12}},
  doi = {10.1109/ICASSP.2012.6287815},
  hal_id = {hal-00945300},
  hal_version = {v1},
  keywords = {Music/voice separation ; repeating pattern ; time-frequency masking ; adaptive algorithms},
  pages = {53--56},
  pdf = {https://hal.inria.fr/hal-00945300/file/adaptive_repet.pdf},
  publisher = {{IEEE}},
  title = {{Adaptive filtering for music/voice separation exploiting the repeating musical structure}},
  url = {https://hal.inria.fr/hal-00945300},
  year = {2012}
}
```
The separation of the lead vocals from the background accompani- ment in audio recordings is a challenging task. Recently, an efficient method called REPET (REpeating Pattern Extraction Technique) has been proposed to extract the repeating background from the non- repeating foreground. While effective on individual sections of a song, REPET does not allow for variations in the background (e.g. verse vs. chorus), and is thus limited to short excerpts only. We overcome this limitation and generalize REPET to permit the processing of complete musical tracks. The proposed algorithm tracks the period of the repeating structure and computes local estimates of the background pattern. Separation is performed by soft time-frequency masking, based on the deviation between the current observation and the estimated background pattern. Evaluation on a dataset of 14 complete tracks shows that this method can perform at least as well as a recent competitive music/voice separation method, while being computationally efficient.

Journal Articles

A multi-modal dance corpus for research into interaction between humans in virtual environments
Slim Essid, Marc Gowing, Georgios Kordelas, Anil Aksay, P. Kelly, Thomas Fillon, Qianqian Zhang, Alfred Dielmann, Gael Richard
Journal on Multimodal User Interfaces, August 2012.

@article{essid:hal-02286487,
  author = {Essid, Slim and Gowing, Marc and Kordelas, Georgios and Aksay, Anil and Kelly, P. and Fillon, Thomas and Zhang, Qianqian and Dielmann, Alfred and Richard, Gael},
  doi = {10.1007/s12193-012-0109-5},
  hal_id = {hal-02286487},
  hal_local_reference = {SE:JMUI-12},
  hal_version = {v1},
  journal = {{Journal on Multimodal User Interfaces}},
  month = aug,
  pages = {1-14},
  publisher = {{Springer}},
  title = {{A multi-modal dance corpus for research into interaction between humans in virtual environments}},
  url = {https://hal.telecom-paris.fr/hal-02286487},
  year = {2012}
}

Blind Source Separation for Robot Audition using fixed HRTF beamforming
Mounira Maazaoui, Karim Abed-Meraim, Yves Grenier
EURASIP Journal on Advances in Signal Processing, March 2012.
```
@article{maazaoui:hal-00683198,
  author = {Maazaoui, Mounira and Abed-Meraim, Karim and Grenier, Yves},
  doi = {10.1186/1687-6180-2012-58},
  hal_id = {hal-00683198},
  hal_version = {v1},
  journal = {{EURASIP Journal on Advances in Signal Processing}},
  keywords = {blind source separation ; robot audition ; head related transfer function ; beamforming},
  month = mar,
  number = {58},
  pages = {1687-6180},
  pdf = {https://hal.archives-ouvertes.fr/hal-00683198/file/1687-6180-2012-58.pdf},
  publisher = {{SpringerOpen}},
  title = {{Blind Source Separation for Robot Audition using fixed HRTF beamforming}},
  url = {https://hal.archives-ouvertes.fr/hal-00683198},
  year = {2012}
}
```
In this article, we present a two-stage blind source separation (BSS) algorithm for robot audition. The first stage consists in a fixed beamforming preprocessing to reduce the reverberation and the environmental noise. Since we are in a robot audition context, the manifold of the sensor array in this case is hard to model due to the presence of the head of the robot, so we use pre-measured head related transfer functions (HRTFs) to estimate the beamforming filters. The use of the HRTF to estimate the beamformers allows to capture the effect of the head on the manifold of the microphone array. The second stage is a BSS algorithm based on a sparsity criterion which is the minimization of the l1 norm of the sources. We present different configuration of our algorithm and we show that it has promising results and that the fixed beamforming preprocessing improves the separation results.
Blind Source Separation for Robot Audition using fixed HRTF beamforming
Mounira Maazaoui, Yves Grenier, Karim Abed-Meraim
EURASIP Journal on Advances in Signal Processing , March 2012.
```
@article{maazaoui-grenier-abedmeraim-2011c,
  author = {Maazaoui, Mounira and Grenier, Yves and Abed-Meraim, Karim},
  title = {Blind Source Separation for Robot Audition using fixed HRTF beamforming },
  journal = {EURASIP Journal on Advances in Signal Processing },
  year = {2012},
  month = mar,
  number = {58},
  keywords = {blind source separation,robot audition, beamforming}
}
```
In this article, we present a two-stage blind source separation algorithm for robot audition. The first stage consists in a fixed beamforming preprocessing to reduce the reverberation and the environmental noise. Since we are in a robot audition context, the manifold of the sensor array in this case is hard to model due to the presence of the head of the robot, so we use pre-measured Head Related Transfer Functions (HRTFs) to estimate the beamforming filters. The use of the HRTF to estimate the beamformers allows to capture the effect of the head on the manifold of the microphone array. The second stage is a blind source separation algorithm based on a sparsity criterion which is the minimization of the l1 norm of the sources. We present different configuration of our algorithm and we show that it has promising results and that the fixed beamforming preprocessing improves the separation results.

2011

Conference Articles

An audio-driven virtual dance-teaching assistant
Slim Essid, Yves Grenier, Mounira Maazaoui, Gael Richard, Robin Tournemenne
the 19th ACM international conference, Scottsdale, France, November 2011.

@inproceedings{essid:hal-02713825,
  address = {Scottsdale, France},
  author = {Essid, Slim and Grenier, Yves and Maazaoui, Mounira and Richard, Gael and Tournemenne, Robin},
  booktitle = {{the 19th ACM international conference}},
  doi = {10.1145/2072298.2072416},
  hal_id = {hal-02713825},
  hal_version = {v1},
  month = nov,
  pages = {675},
  publisher = {{ACM Press}},
  title = {{An audio-driven virtual dance-teaching assistant}},
  url = {https://hal.inria.fr/hal-02713825},
  year = {2011}
}

Enhanced visualisation of dance performance from automatically synchronised multimodal recordings
Marc Gowing, Xinyu Lin, Qianni Zhang, Philip Kell, Noel O’Connor, Cyril Concolato, Slim Essid, Jean Lefeuvre, Robin Tournemenne, Ebroul Izquierdo, Vlado Kitanovski
The 19th ACM international conference, Scottsdale, France, November 2011.

@inproceedings{gowing:hal-02943617,
  address = {Scottsdale, France},
  author = {Gowing, Marc and Lin, Xinyu and Zhang, Qianni and Kell, Philip and O'Connor, Noel and Concolato, Cyril and Essid, Slim and Lefeuvre, Jean and Tournemenne, Robin and Izquierdo, Ebroul and Kitanovski, Vlado},
  booktitle = {{The 19th ACM international conference}},
  doi = {10.1145/2072298.2072414},
  hal_id = {hal-02943617},
  hal_version = {v1},
  month = nov,
  pages = {667},
  publisher = {{ACM Press}},
  title = {{Enhanced visualisation of dance performance from automatically synchronised multimodal recordings}},
  url = {https://hal.telecom-paris.fr/hal-02943617},
  year = {2011}
}

An audio-driven virtual dance-teaching assistant
Slim Essid, Yves Grenier, Mounira Maazaoui, Gaël Richard, Robin Tournemenne
ACM Multimedia, Scottsdale, Arizona, USA, November 2011.

@inproceedings{SE:ACM-MM-GC-2011,
  author = {Essid, Slim and Grenier, Yves and Maazaoui, Mounira and Richard, Ga{\"e}l and Tournemenne, Robin},
  title = {An audio-driven virtual dance-teaching assistant},
  booktitle = {ACM Multimedia},
  address = {Scottsdale, Arizona, USA},
  year = {2011},
  month = nov
}

A Scalable Audio Fingerprint Method with Robustness to Pitch-Shifting
Sébastien Fenet, Gael Richard, Yves Grenier
ISMIR, Miami, United States, October 2011.
```
@inproceedings{fenet:hal-00657657,
  address = {Miami, United States},
  author = {Fenet, S{\'e}bastien and Richard, Gael and Grenier, Yves},
  booktitle = {{ISMIR}},
  hal_id = {hal-00657657},
  hal_local_reference = {SF:ISMIR-11},
  hal_version = {v1},
  month = oct,
  pages = {121-126},
  title = {{A Scalable Audio Fingerprint Method with Robustness to Pitch-Shifting}},
  url = {https://hal-imt.archives-ouvertes.fr/hal-00657657},
  year = {2011}
}
```
Audio fingerprint techniques should be robust to a variety of distortions due to noisy transmission channels or specific sound processing. Although most of nowadays techniques are robust to the majority of them, the quasi-systematic use of a spectral representation makes them possibly sensitive to pitch-shifting. This distortion indeed induces a modification of the spectral content of the signal. In this paper, we propose a novel fingerprint technique, relying on a hashing technique coupled with a CQT-based fingerprint, with a strong robustness to pitch-shifting. Furthermore, we have associated this method with an efficient post-processing for the removal of false alarms. We also present the adaptation of a database pruning technique to our specific context. We have evaluated our approach on a real-life broadcast monitoring scenario. The analyzed data consisted of 120 hours of real radio broadcast (thus containing all the distortions that would be found in an industrial context). The reference database consisted of 30.000 songs. Our method, thanks to its increased robustness to pitch-shifting, shows an excellent detection score.

Optimizing the mapping from a symbolic to an audio representation for music-to-score alignment
Cyril Joder, Slim Essid, Gael Richard
2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, France, October 2011.

@inproceedings{joder:hal-02943613,
  address = {New Paltz, France},
  author = {Joder, Cyril and Essid, Slim and Richard, Gael},
  booktitle = {{2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}},
  doi = {10.1109/ASPAA.2011.6082330},
  hal_id = {hal-02943613},
  hal_version = {v1},
  month = oct,
  pages = {121-124},
  publisher = {{IEEE}},
  title = {{Optimizing the mapping from a symbolic to an audio representation for music-to-score alignment}},
  url = {https://hal.telecom-paris.fr/hal-02943613},
  year = {2011}
}

A Scalable Audio Fingerprint Method with Robustness to Pitch-Shifting
Sébastien Fenet, Gaël Richard, Yves Grenier
ISMIR, Miami, USA, October 2011.
```
@inproceedings{SF:ISMIR-11,
  author = {Fenet, S{\'e}bastien and Richard, Ga{\"e}l and Grenier, Yves},
  title = {A Scalable Audio Fingerprint Method with Robustness to Pitch-Shifting},
  booktitle = {ISMIR},
  address = {Miami, USA},
  year = {2011},
  month = oct,
  pages = {121--126}
}
```
Audio fingerprint techniques should be robust to a variety of distortions due to noisy transmission channels or specific sound processing. Although most of nowadays techniques are robust to the majority of them, the quasi-systematic use of a spectral representation makes them possibly sensitive to pitch-shifting. This distortion indeed induces a modification of the spectral content of the signal. In this paper, we propose a novel fingerprint technique, relying on a hashing technique coupled with a CQT-based fingerprint, with a strong robustness to pitch-shifting. Furthermore, we have associated this method with an efficient post-processing for the removal of false alarms. We also present the adaptation of a database pruning technique to our specific context. We have evaluated our approach on a real-life broadcast monitoring scenario. The analyzed data consisted of 120 hours of real radio broadcast (thus containing all the distortions that would be found in an industrial context). The reference database consisted of 30.000 songs. Our method, thanks to its increased robustness to pitch-shifting, shows an excellent detection score.
Une empreinte audio à base de CQT appliquée à la surveillance de flux radiophoniques
Sébastien Fenet, Yves Grenier, Gael Richard
GRETSI, Bordeaux, France, September 2011.
```
@inproceedings{fenet:hal-00657659,
  address = {Bordeaux, France},
  author = {Fenet, S{\'e}bastien and Grenier, Yves and Richard, Gael},
  booktitle = {{GRETSI}},
  hal_id = {hal-00657659},
  hal_local_reference = {SF:GRESTI-11},
  hal_version = {v1},
  month = sep,
  pages = {NA},
  title = {{Une empreinte audio {\`a} base de CQT appliqu{\'e}e {\`a} la surveillance de flux radiophoniques}},
  url = {https://hal-imt.archives-ouvertes.fr/hal-00657659},
  year = {2011}
}
```
Lextraction dempreinte audio sinscrit dans la problématique plus large de lidentification audio, qui consiste à retrouver des méta données à partir dun extrait audio. Dans cet article, nous présentons notre algorithme dextraction dempreinte audio, appliqué à la détection dévènements référencés dans un flux. Tout en sappuyant sur la méthode de Wang [8], notre approche présente une robustesse accrue au glissement fréquentiel grâce notamment à lutilisation dune "transformée à Q constant" [2]. Nous montrons finalement que le changement de représentation proposé permet de bien meilleurs taux de détection dans un cas concret de surveillance de flux radiophonique.
Blind Source Separation for Robot Audition using Fixed Beamforming with HRTFs
Mounira Maazaoui, Yves Grenier, Karim Abed-Meraim
12th Annual Conference of the International Speech Communication Association (Interspeech-2011), Florence, Italy, September 2011.
```
@inproceedings{maazaoui:hal-00683452,
  address = {Florence, Italy},
  author = {Maazaoui, Mounira and Grenier, Yves and Abed-Meraim, Karim},
  booktitle = {{12th Annual Conference of the International Speech Communication Association (Interspeech-2011)}},
  hal_id = {hal-00683452},
  hal_version = {v1},
  month = sep,
  pdf = {https://hal.archives-ouvertes.fr/hal-00683452/file/Interspeech_2011cameraReady.pdf},
  title = {{Blind Source Separation for Robot Audition using Fixed Beamforming with HRTFs}},
  url = {https://hal.archives-ouvertes.fr/hal-00683452},
  year = {2011}
}
```
We present a two stage blind source separation (BSS) algorithm for robot audition. The algorithm is based on a beamforming preprocessing and a BSS algorithm using a sparsity separation criterion. Before the BSS step, we filter the sensors outputs by beamforming filters to reduce the reverberation and the environmental noise. As we are in a robot audition context, the manifold of the sensor array in this case is hard to model, so we use pre-measured Head Related Transfer Functions (HRTFs) to estimate the beamforming filters. In this article, we show the good performance of this method as compared to a single stage BSS only method.
Une empreinte audio à base de CQT appliquée à la surveillance de flux radiophoniques
Sébastien Fenet, Yves Grenier, Gaël Richard
GRETSI, Bordeaux, France, September 2011.
```
@inproceedings{SF:GRESTI-11,
  author = {Fenet, S{\'e}bastien and Grenier, Yves and Richard, Ga{\"e}l},
  title = {Une empreinte audio {\`a} base de CQT appliqu{\'e}e {\`a} la surveillance de flux radiophoniques},
  booktitle = {GRETSI},
  address = {Bordeaux, France},
  year = {2011},
  month = sep,
  pages = {NA}
}
```
L’extraction d’empreinte audio s’inscrit dans la problématique plus large de l’identification audio, qui consiste à retrouver des méta données à partir d’un extrait audio. Dans cet article, nous présentons notre algorithme d’extraction d’empreinte audio, appliqué à la détection d’évènements référencés dans un flux. Tout en s’appuyant sur la méthode de Wang [8], notre approche présente une robustesse accrue au glissement fréquentiel grâce notamment à l’utilisation d’une \textacutedbltransformée à Q constant\textacutedbl [2]. Nous montrons finalement que le changement de représentation proposé permet de bien meilleurs taux de détection dans un cas concret de surveillance de flux radiophonique.
Frequency Domain Blind Source Separation for Robot Audition Using a Parameterized Sparsity Criterion
Mounira Maazaoui, Yves Grenier, Karim Abed-Meraim
The European Signal Processing Conference (EUSIPCO-2011), Barcelone, Espagne, September 2011.
```
@inproceedings{maazaoui-grenier-abedmeraim-2011,
  author = {Maazaoui, Mounira and Grenier, Yves and Abed-Meraim, Karim},
  title = {Frequency Domain Blind Source Separation for Robot Audition Using a Parameterized Sparsity Criterion},
  booktitle = {The European Signal Processing Conference (EUSIPCO-2011)},
  address = {Barcelone, Espagne},
  year = {2011},
  month = sep,
  pages = {1869--1873},
  keywords = {Blind source separation, sparsity criterion, lp norm minimization}
}
```
In this paper, we introduce a modified lp norm blind source separation criterion based on the source sparsity in the time-frequency domain. We study the effect of making the sparsity constraint harder through the optimization process, making the parameter p of the lp norm vary from 1 to nearly 0 according to a sigmoid function. The sigmoid introduces a smooth lp norm variation which avoids the divergence of the algorithm. We compared this algorithm to the regular l1 norm minimization and an ICA based one and we obtained promising results.
Blind Source Separation for Robot Audition using Fixed Beamforming with HRTFs
Mounira Maazaoui, Yves Grenier, Karim Abed-Meraim
12th Annual Conference of the International Speech Communication Association (Interspeech-2011), Florence, Italie, September 2011.
```
@inproceedings{maazaoui-grenier-abedmeraim-2011a,
  author = {Maazaoui, Mounira and Grenier, Yves and Abed-Meraim, Karim},
  title = {Blind Source Separation for Robot Audition using Fixed Beamforming with HRTFs},
  booktitle = {12th Annual Conference of the International Speech Communication Association (Interspeech-2011)},
  address = {Florence, Italie},
  year = {2011},
  month = sep,
  keywords = {Blind source separation, fixed beamforming, multi-sensors HRTFs}
}
```
We present a two stage blind source separation (BSS) algorithm for robot audition. The algorithm is based on a beamforming preprocessing and a BSS algorithm using a sparsity separation criterion. Before the BSS step, we filter the sensors outputs by beamforming filters to reduce the reverberation and the environmental noise. As we are in a robot audition context, the manifold of the sensor array in this case is hard to model, so we use pre-measured Head Related Transfer Functions (HRTFs) to estimate the beamforming filters. In this article, we show the good performance of this method as compared to a single stage BSS only method.
Frequency Domain Blind Source Separation for Robot Audition Using a Parameterized Sparsity Criterion
Mounira Maazaoui, Yves Grenier, Karim Abed-Meraim
The European Signal Processing Conference (EUSIPCO-2011), Spain, August 2011.
```
@inproceedings{maazaoui:hal-00683450,
  address = {Spain},
  author = {Maazaoui, Mounira and Grenier, Yves and Abed-Meraim, Karim},
  booktitle = {{The European Signal Processing Conference (EUSIPCO-2011)}},
  hal_id = {hal-00683450},
  hal_version = {v1},
  keywords = {blind source separation ; microphone array ; sparsity criterion ; l1 ; lp ; sigmoid},
  month = aug,
  pdf = {https://hal.archives-ouvertes.fr/hal-00683450/file/FINAL_1569427321-5.pdf},
  title = {{Frequency Domain Blind Source Separation for Robot Audition Using a Parameterized Sparsity Criterion}},
  url = {https://hal.archives-ouvertes.fr/hal-00683450},
  year = {2011}
}
```
In this paper, we introduce a modified lp norm blind source separation criterion based on the source sparsity in the timefrequency domain. We study the effect of making the sparsity constraint harder through the optimization process, making the parameter p of the lp norm vary from 1 to nearly 0 according to a sigmoid function. The sigmoid introduces a smooth lp norm variation which avoids the divergence of the algorithm. We compared this algorithm to the regular l1 norm minimization and an ICA based one and we obtained promising results.

Hidden Discrete Tempo Model: A tempo-aware timing model for audio-to-score alignment
Cyril Joder, Slim Essid, Gael Richard
ICASSP 2011 - 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, France, May 2011.

@inproceedings{joder:hal-02714059,
  address = {Prague, France},
  author = {Joder, Cyril and Essid, Slim and Richard, Gael},
  booktitle = {{ICASSP 2011 - 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  doi = {10.1109/ICASSP.2011.5946424},
  hal_id = {hal-02714059},
  hal_version = {v1},
  month = may,
  pages = {397-400},
  publisher = {{IEEE}},
  title = {{Hidden Discrete Tempo Model: A tempo-aware timing model for audio-to-score alignment}},
  url = {https://hal.inria.fr/hal-02714059},
  year = {2011}
}

Combining monaural source separation with Long Short-Term Memory for increased robustness in vocalist gender recognition
Felix Weninger, Jean-Louis Durrieu, Florian Eyben, Gael Richard, Björn Schuller
ICASSP 2011 - 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, France, May 2011.

@inproceedings{weninger:hal-02714095,
  address = {Prague, France},
  author = {Weninger, Felix and Durrieu, Jean-Louis and Eyben, Florian and Richard, Gael and Schuller, Bj{\"o}rn},
  booktitle = {{ICASSP 2011 - 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  doi = {10.1109/ICASSP.2011.5946764},
  hal_id = {hal-02714095},
  hal_version = {v1},
  month = may,
  pages = {2196-2199},
  publisher = {{IEEE}},
  title = {{Combining monaural source separation with Long Short-Term Memory for increased robustness in vocalist gender recognition}},
  url = {https://hal.inria.fr/hal-02714095},
  year = {2011}
}

Gaussian modeling of mixtures of non-stationary signals in the time-frequency domain (HR-NMF)
Roland Badeau
Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York, United States, 2011.
```
@inproceedings{badeau:hal-00945270,
  address = {New Paltz, New York, United States},
  author = {Badeau, Roland},
  booktitle = {{Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}},
  hal_id = {hal-00945270},
  hal_version = {v1},
  pages = {253--256},
  pdf = {https://hal.inria.fr/hal-00945270/file/badeau2.pdf},
  title = {{Gaussian modeling of mixtures of non-stationary signals in the time-frequency domain (HR-NMF)}},
  url = {https://hal.inria.fr/hal-00945270},
  year = {2011}
}
```
Nonnegative Matrix Factorization (NMF) is a powerful tool for decomposing mixtures of non-stationary signals in the Time-Frequency (TF) domain. However, unlike the High Resolution (HR) methods dedicated to mixtures of exponentials, its spectral resolution is limited by that of the underlying TF representation. In this paper, we propose a unified probabilistic model called HR-NMF, that permits to overcome this limit by taking both phases and local correlations in each frequency band into account. This model is estimated with a recursive implementation of the EM algorithm, that is successfully applied to source separation and audio inpainting.
Analyse des structures harmoniques dans les signaux audio : modéliser les variations de hauteur et d’enveloppe spectrale
Benoit Fuentes, Roland Badeau, Gael Richard
Actes du XXIIIème Colloque GRETSI, Bordeaux, France, 2011.
```
@inproceedings{fuentes:hal-00945240,
  address = {Bordeaux, France},
  author = {Fuentes, Benoit and Badeau, Roland and Richard, Gael},
  booktitle = {{Actes du XXIII{\`e}me Colloque GRETSI}},
  hal_id = {hal-00945240},
  hal_version = {v1},
  pdf = {https://hal.inria.fr/hal-00945240/file/gretsi2011.pdf},
  title = {{Analyse des structures harmoniques dans les signaux audio : mod{\'e}liser les variations de hauteur et d'enveloppe spectrale}},
  url = {https://hal.inria.fr/hal-00945240},
  year = {2011}
}
```
De nombreuses méthodes d’analyse et de décomposition intelligente des représentations temps-fréquence de signaux musicaux ont été développées ces derniers temps. Cependant, les outils utilisés ne sont pas forcément adaptés aux signaux polyphoniques dont les notes présentent des variations continues de fréquence fondamentale et d’enveloppe spectrale. Nous proposons un nouveau modèle d’analyse des structures harmoniques, permettant de considérer conjointement ces deux types de variations. Chaque note dans une transformée à Q constant est modélisée localement comme une somme pondérée de spectres harmoniques à bande étroite, et les paramètres du modèle sont estimés grâce à l’analyse probabiliste en composantes latentes. L’algorithme a été testé dans une tâche d’estimation de hauteur simple et les très bons résultats obtenus mettent en valeur la fiabilité et la robustesse du modèle proposé
Adaptive harmonic decomposition using shift-invariant PLCA
Benoit Fuentes, Roland Badeau, Gael Richard
Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Prague, Czech Republic, 2011.
```
@inproceedings{fuentes:hal-00945289,
  address = {Prague, Czech Republic},
  author = {Fuentes, Benoit and Badeau, Roland and Richard, Gael},
  booktitle = {{Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}},
  hal_id = {hal-00945289},
  hal_version = {v1},
  pdf = {https://hal.inria.fr/hal-00945289/file/fuentes_icassp2011.pdf},
  title = {{Adaptive harmonic decomposition using shift-invariant PLCA}},
  url = {https://hal.inria.fr/hal-00945289},
  year = {2011}
}
```
This paper presents a new algorithm based on shift-invariant probabilistic latent component analysis that analyzes harmonic structures in an audio signal. Each note in a constant-Q transform is modeled as a weighted sum of narrowband parametric spectra, and a positive deconvolution is performed to obtain both pitch and timbre signature. The algorithm has been tested in a task of monopitch and multipitch estimation and shows very promising results.
AN INTERACTIVE SYSTEM FOR ELECTRO-ACOUSTIC MUSIC ANALYSIS
Sébastien Gulluni, Slim Essid, Olivier Buisson, Gael Richard
ISMIR, Miami, United States, 2011.
```
@inproceedings{gulluni:hal-02713906,
  address = {Miami, United States},
  author = {Gulluni, S{\'e}bastien and Essid, Slim and Buisson, Olivier and Richard, Gael},
  booktitle = {{ISMIR}},
  hal_id = {hal-02713906},
  hal_version = {v1},
  title = {{AN INTERACTIVE SYSTEM FOR ELECTRO-ACOUSTIC MUSIC ANALYSIS}},
  url = {https://hal.inria.fr/hal-02713906},
  year = {2011}
}
```
This paper, presents an interactive approach for the analysis of electro-acoustic music. An original classification scheme is devised using relevance feedback and active-learning segment selection in an interactive loop. Validation and correction information given by the user is injected in the learning process at each iteration to achieve more accurate classification. An experimental study is conducted to evaluate and compare the different classification and relevance feedback approaches that are envisaged, using a database of poly-phonic pieces (with a varying degree of polyphony). The results show that the different approaches are adapted to different applications and they achieve satisfying performance in a reasonable number of iterations.
Interactive Classification of Sound Objects for Polyphonic Electro-Acoustic Music Annotation
Sébastien Gulluni, Slim Essid, Olivier Buisson, Gael Richard
AES Conference, Ilmenau, Germany, 2011.
```
@inproceedings{gulluni:hal-02713989,
  address = {Ilmenau, Germany},
  author = {Gulluni, S{\'e}bastien and Essid, Slim and Buisson, Olivier and Richard, Gael},
  booktitle = {{AES Conference}},
  hal_id = {hal-02713989},
  hal_version = {v1},
  title = {{Interactive Classification of Sound Objects for Polyphonic Electro-Acoustic Music Annotation}},
  url = {https://hal.inria.fr/hal-02713989},
  year = {2011}
}
```
In this paper, we present an interactive approach for the classification of sound objects in electro-acoustic music. For this purpose, we use relevance feedback combined with active-learning segment selection in an interactive loop. Validation and correction information given by the user is injected in the learning process at each iteration to achieve more accurate classification. Three active learning criteria are compared in the evaluation of a system classifying polyphonic pieces (with a varying degree of polyphony). The results show that the interactive approach achieves satisfying performance in a reasonable number of iterations.
Scale-invariant probabilistic latent component analysis
Romain Hennequin, Roland Badeau, Bertrand David
Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York, United States, 2011.
```
@inproceedings{hennequin:hal-00945291,
  address = {New Paltz, New York, United States},
  author = {Hennequin, Romain and Badeau, Roland and David, Bertrand},
  booktitle = {{Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}},
  hal_id = {hal-00945291},
  hal_version = {v1},
  pages = {129--132},
  pdf = {https://hal.inria.fr/hal-00945291/file/articleWASPAA.pdf},
  title = {{Scale-invariant probabilistic latent component analysis}},
  url = {https://hal.inria.fr/hal-00945291},
  year = {2011}
}
```
In this paper, we present a new method to decompose musical spectrograms. This method transposes shift-invariant probabilistic latent component analysis (PLCA) which permits to decompose constant-Q spectrograms (with a logarithmic frequency resolution) to standard short-time Fourier transform spectrograms (with a linear frequency resolution). This makes it possible to easily use the method to reconstruct the latent signals (which can be useful for source separation).
Score informed audio source separation using a parametric model of non-negative spectrogram
Romain Hennequin, Bertrand David, Roland Badeau
Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Prague, Czech Republic, 2011.
```
@inproceedings{hennequin:hal-00945294,
  address = {Prague, Czech Republic},
  author = {Hennequin, Romain and David, Bertrand and Badeau, Roland},
  booktitle = {{Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}},
  hal_id = {hal-00945294},
  hal_version = {v1},
  pdf = {https://hal.inria.fr/hal-00945294/file/hennequin_icassp2011.pdf},
  title = {{Score informed audio source separation using a parametric model of non-negative spectrogram}},
  url = {https://hal.inria.fr/hal-00945294},
  year = {2011}
}
```
In this paper we present a new technique for monaural source separation in musical mixtures, which uses the knowledge of the musical score. This information is used to initialize an algorithm which computes a parametric decomposition of the spectrogram based on non-negative matrix factorization (NMF). This algorithm provides time-frequency masks which are used to separate the sources with Wiener filtering.

Journal Articles

A Conditional Random Field Framework for Robust and Scalable Audio-to-Score Matching
Cyril Joder, Slim Essid, Gael Richard
IEEE Transactions on Audio, Speech and Language Processing, November 2011.
```
@article{joder:hal-02653026,
  author = {Joder, Cyril and Essid, Slim and Richard, Gael},
  hal_id = {hal-02653026},
  hal_version = {v1},
  journal = {{IEEE Transactions on Audio, Speech and Language Processing}},
  month = nov,
  pdf = {https://hal.archives-ouvertes.fr/hal-02653026/file/Joder_TASLP2011.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{A Conditional Random Field Framework for Robust and Scalable Audio-to-Score Matching}},
  url = {https://hal.archives-ouvertes.fr/hal-02653026},
  year = {2011}
}
```
In the present work, we introduce the use of Conditional Random Fields (CRFs) for the audio-to-score alignment task. This framework encompasses the statistical models which are used in the literature and allows for more flexible dependency structures. In particular, it allows observation functions to be computed from several analysis frames. Three different CRF models are proposed for our task, for different choices of tradeoff between accuracy and complexity. Three types of features are used, characterizing the local harmony , note attacks and tempo. We also propose a novel hierarchical approach, which takes advantage of the score structure for an approximate decoding of the statistical model. This strategy reduces the complexity, yielding a better overall efficiency than the classic beam search method used in HMM-based models. Experiments run on a large database of classical piano and popular music exhibit very accurate alignments. Indeed, with the best performing system, more than 95% of the note onsets are detected with a precision finer than 100 ms. We additionally show how the proposed framework can be modified in order to be robust to possible structural differences between the score and the musical performance.
Probabilistic template-based chord recognition
Laurent Oudre, Cédric Févotte, Yves Grenier
IEEE Transactions on Audio, Speech and Language Processing, November 2011.
```
@article{OFG:IEEE-11,
  author = {Oudre, Laurent and F{\'e}votte, C{\'e}dric and Grenier, Yves},
  title = {Probabilistic template-based chord recognition},
  journal = {IEEE Transactions on Audio, Speech and Language Processing},
  year = {2011},
  month = nov,
  volume = {19},
  number = {8},
  pages = {2249--2259},
  keywords = {music signal processing}
}
```
This paper describes a probabilistic approach to template-based chord recognition in music signals. The algorithm only takes chromagram data and a user-defined dictionary of chord templates as input data. No training or musical information such as key, rhythm or chord transition models is required. The chord occurrences are treated as probabilistic events, whose probabilities are learned from the song using an Expectation-Maximization (EM) algorithm. The adaptative estimation of these probabilities (together with an ad-hoc post-processing filtering) has the desirable effect of smoothing out spurious chords that would occur in our previous baseline work. Our algorithm is compared to various methods that entered the Music Information Retrieval Evaluation eXchange (MIREX) in 2008 and 2009, using a diverse set of evaluation metrics, some of which are new. The systems are tested on two evaluation corpuses; the first one is composed of the Beatles catalog (180 pop-rock songs) and the other one is constituted of 20 songs from various artists and music genres. Results show that our method outperforms state-of-the-art chord recognition systems.
A musically motivated mid-level representation for pitch estimation and musical audio source separation
Jean-Louis Durrieu, Bertrand David, Gael Richard
IEEE Journal on Selected Topics in Signal Processing, October 2011.
```
@article{durrieu:hal-02653051,
  author = {Durrieu, Jean-Louis and David, Bertrand and Richard, Gael},
  hal_id = {hal-02653051},
  hal_version = {v1},
  journal = {{IEEE Journal on Selected Topics in Signal Processing}},
  keywords = {Index Terms-Non-negative Matrix Factorization (NMF) ; au- dio signal representation ; pitch estimation ; audio melody extrac- tion ; musical audio source separation},
  month = oct,
  pdf = {https://hal.archives-ouvertes.fr/hal-02653051/file/JSTSP11_Durrieu.pdf},
  title = {{A musically motivated mid-level representation for pitch estimation and musical audio source separation}},
  url = {https://hal.archives-ouvertes.fr/hal-02653051},
  year = {2011}
}
```
When designing an audio processing system, the target tasks often influence the choice of a data representation or transformation. Low-level time-frequency representations such as the short-time Fourier transform (STFT) are popular, because they offer a meaningful insight on sound properties for a low computational cost. Conversely, when higher level semantics, such as pitch, timbre or phoneme, are sought after, representations usually tend to enhance their discriminative characteristics, at the expense of their invertibility. They become so-called mid-level representations. In this paper, a source/filter signal model which provides a mid-level representation is proposed. This representation makes the pitch content of the signal as well as some timbre information available, hence keeping as much information from the raw data as possible. This model is successfully used within a main melody extraction system and a lead instrument/accompaniment separation system. Both frameworks obtained top results at several international evaluation campaigns.
Décompositions en éléments sonores et applications musicales
Mathieu Lagrange, Roland Badeau, Bertrand David, Nancy Bertin, Olivier Derrien, Sylvain Marchand, Laurent Daudet
Traitement du Signal, October 2011.
```
@article{lagrange:hal-00809496,
  author = {Lagrange, Mathieu and Badeau, Roland and David, Bertrand and Bertin, Nancy and Derrien, Olivier and Marchand, Sylvain and Daudet, Laurent},
  hal_id = {hal-00809496},
  hal_version = {v1},
  journal = {{Traitement du Signal}},
  month = oct,
  number = {6},
  pages = {665-689},
  pdf = {https://hal.archives-ouvertes.fr/hal-00809496/file/lagrangeTs11.pdf},
  publisher = {{Lavoisier}},
  title = {{D{\'e}compositions en {\'e}l{\'e}ments sonores et applications musicales}},
  url = {https://hal.archives-ouvertes.fr/hal-00809496},
  volume = {28},
  year = {2011}
}
```
Dans cet article sont présentés de manière synthétique les résultats du projet ANR DE-SAM (Décompositions en Éléments Sonores et Applications Musicales). Ce projet comportait deux parties, la première portant sur des avancées théoriques de techniques de décompositions de signaux audionumériques et la seconde traitant d’applications musicales de ces décompo-sitions. La plupart des aspects abordés dans le projet ont donné lieu à de nouvelles méthodes et algorithmes qui sont regroupés au sein d’une boîte à outils, la DESAM Toolbox. Celle-ci rassemble un ensemble de fonctions Matlab® dédiées à l’estimation de modèles spectraux très utilisés pour les signaux musicaux. Les méthodes étudiées dans ce projet peuvent bien sûr être utiles pour la recherche automatique d’informations dans les signaux musicaux, mais elles constituent avant tout une collection d’outils récents pour décomposer les signaux selon dif-férents modèles, avec pour résultat des représentations mi-niveau variées, pouvant être utiles dans d’autres domaines d’application.
Signal Processing for Music Analysis
Meinard Müller, Daniel P.W. Ellis, Anssi Klapuri, Gael Richard
IEEE Journal of Selected Topics in Signal Processing, October 2011.
```
@article{muller:hal-02653036,
  author = {M{\"u}ller, Meinard and Ellis, Daniel P.W. and Klapuri, Anssi and Richard, Gael},
  doi = {10.1109/JSTSP.2011.2112333},
  hal_id = {hal-02653036},
  hal_version = {v1},
  journal = {{IEEE Journal of Selected Topics in Signal Processing}},
  keywords = {Index Terms-Beat ; digital signal processing ; harmony ; melody ; music analysis ; music information retrieval ; music signals ; pitch ; rhythm ; source separation ; timbre ; voice separation},
  month = oct,
  publisher = {{IEEE}},
  title = {{Signal Processing for Music Analysis}},
  url = {https://hal.inria.fr/hal-02653036},
  volume = {5},
  year = {2011}
}
```
Music signal processing may appear to be the junior relation of the large and mature field of speech signal processing, not least because many techniques and representations originally developed for speech have been applied to music, often with good results. However, music signals possess specific acoustic and structural characteristics that distinguish them from spoken language or other nonmusical signals. This paper provides an overview of some signal analysis techniques that specifically address musical dimensions such as melody, harmony, rhythm, and timbre. We will examine how particular characteristics of music signals impact and determine these techniques, and we highlight a number of novel music analysis and retrieval tasks that such processing makes possible. Our goal is to demonstrate that, to be successful, music audio signal processing techniques must be informed by a deep and thorough insight into the nature of music itself.

Chord recognition by fitting rescaled chroma vectors to chord templates
Laurent Oudre, Yves Grenier, Cédric Févotte
IEEE Transactions on Audio, Speech and Language Processing, September 2011.

@article{OGF:IEEE-11,
  author = {Oudre, Laurent and Grenier, Yves and F{\'e}votte, C{\'e}dric},
  title = {Chord recognition by fitting rescaled chroma vectors to chord templates},
  journal = {IEEE Transactions on Audio, Speech and Language Processing},
  year = {2011},
  month = sep,
  volume = {19},
  number = {7},
  pages = {2222--2233}
}

Greedy sparse decompositions: a comparative study
Przemyslaw Dymarski, Nicolas Moreau, Gael Richard
EURASIP Journal on Advances in Signal Processing, 2011.
```
@article{dymarski:hal-02653068,
  author = {Dymarski, Przemyslaw and Moreau, Nicolas and Richard, Gael},
  hal_id = {hal-02653068},
  hal_version = {v1},
  journal = {{EURASIP Journal on Advances in Signal Processing}},
  keywords = {speech and audio coding ; greedy sparse decomposition ; matching pursuit ; orthogonal matching pursuit},
  pdf = {https://hal.archives-ouvertes.fr/hal-02653068/file/EURASIP_JASP2011.pdf},
  publisher = {{SpringerOpen}},
  title = {{Greedy sparse decompositions: a comparative study}},
  url = {https://hal.archives-ouvertes.fr/hal-02653068},
  year = {2011}
}
```
The purpose of this article is to present a comparative study of sparse greedy algorithms that were separately introduced in speech and audio research communities. It is particularly shown that the Matching Pursuit (MP) family of algorithms (MP, OMP, and OOMP) are equivalent to multi-stage gain-shape vector quantization algorithms previously designed for speech signals coding. These algorithms are comparatively evaluated and their merits in terms of trade-off between complexity and performances are discussed. This article is completed by the introduction of the novel methods that take their inspiration from this unified view and recent study in audio sparse decomposition.
NMF with time-frequency activations to model non-stationary audio events
Romain Hennequin, Roland Badeau, Bertrand David
IEEE_J_ASLP, 2011.
```
@article{hennequin:hal-00945201,
  author = {Hennequin, Romain and Badeau, Roland and David, Bertrand},
  hal_id = {hal-00945201},
  hal_version = {v1},
  journal = {{IEEE\_J\_ASLP}},
  number = {4},
  pages = {744--753},
  pdf = {https://hal.inria.fr/hal-00945201/file/Hennequin-TASLP2011.pdf},
  publisher = {{IEEE}},
  title = {{NMF with time-frequency activations to model non-stationary audio events}},
  url = {https://hal.inria.fr/hal-00945201},
  volume = {19},
  year = {2011}
}
```
Real world sounds often exhibit time-varying spectral shapes, as observed in the spectrogram of a harpsichord tone or that of a transition between two pronounced vowels. Whereas the standard Non-negative Matrix Factorization (NMF) assumes fixed spectral atoms, an extension is proposed where the temporal activations (coefficients of the decomposition on the spectral atom basis) become frequency dependent and follow a timevarying ARMA modeling. This extension can thus be interpreted with the help of a source/filter paradigm and is referred to as source/filter factorization. This factorization leads to an efficient single-atom decomposition for a single audio event with strong spectral variation (but with constant pitch). The new algorithm is tested on real audio data and shows promising results.
Beta-divergence as a subclass of Bregman divergence
Romain Hennequin, Bertrand David, Roland Badeau
IEEE Signal Processing Letters, 2011.
```
@article{hennequin:hal-00945202,
  author = {Hennequin, Romain and David, Bertrand and Badeau, Roland},
  hal_id = {hal-00945202},
  hal_version = {v1},
  journal = {{IEEE Signal Processing Letters}},
  number = {2},
  pages = {83--86},
  pdf = {https://hal.inria.fr/hal-00945202/file/Hennequin-SPL2011.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Beta-divergence as a subclass of Bregman divergence}},
  url = {https://hal.inria.fr/hal-00945202},
  volume = {18},
  year = {2011}
}
```
In this paper, we present a complete proof that the beta-divergence is a particular case of Bregman divergence. This little-known result makes it possible to straightforwardly apply theorems about Bregman divergences to beta-divergences. This is of interest for numerous applications since these divergences are widely used, for instance in non-negative matrix factorization (NMF).

2010

Conference Articles

Descripteurs visuels robustes pour l’identification de locuteurs dans des émissions televisées de talk-shows
Vallet Félicien, Slim Essid, Jean Carrive, Gaël Richard
Compression et Représentation des Signaux Audiovisuels (CORESA), Lyon, France, October 2010.

@inproceedings{felicien:hal-02943621,
  address = {Lyon, France},
  author = {F{\'e}licien, Vallet and Essid, Slim and Carrive, Jean and Richard, Ga{\"e}l},
  booktitle = {{Compression et Repr{\'e}sentation des Signaux Audiovisuels (CORESA)}},
  hal_id = {hal-02943621},
  hal_version = {v1},
  month = oct,
  title = {{Descripteurs visuels robustes pour l'identification de locuteurs dans des {\'e}missions televis{\'e}es de talk-shows}},
  url = {https://hal.telecom-paris.fr/hal-02943621},
  year = {2010}
}

A conditional random field viewpoint of symbolic audio-to-score matching
Cyril Joder, Slim Essid, Gael Richard
the international conference, Firenze, France, October 2010.

@inproceedings{joder:hal-02747590,
  address = {Firenze, France},
  author = {Joder, Cyril and Essid, Slim and Richard, Gael},
  booktitle = {{the international conference}},
  doi = {10.1145/1873951.1874100},
  hal_id = {hal-02747590},
  hal_version = {v1},
  month = oct,
  pages = {871},
  publisher = {{ACM Press}},
  title = {{A conditional random field viewpoint of symbolic audio-to-score matching}},
  url = {https://hal.archives-ouvertes.fr/hal-02747590},
  year = {2010}
}

Approche hiérarchique pour un alignement musique-sur-partition efficace
Cyril Joder, Slim Essid, Gael Richard
Compression et Représentation des Signaux Audiovisuels (CORESA), Lyon, France, October 2010. Prix du meil....

@inproceedings{joder:hal-02943620,
  address = {Lyon, France},
  author = {Joder, Cyril and Essid, Slim and Richard, Gael},
  booktitle = {{Compression et Repr{\'e}sentation des Signaux Audiovisuels (CORESA)}},
  hal_id = {hal-02943620},
  hal_version = {v1},
  month = oct,
  note = {Prix du meilleur papier de doctorant},
  title = {{Approche hi{\'e}rarchique pour un alignement musique-sur-partition efficace}},
  url = {https://hal.telecom-paris.fr/hal-02943620},
  year = {2010}
}

How sparsely can a signal be approximated while keeping its class identity?
Manuel Moussallam, Thomas Fillon, Gael Richard, Laurent Daudet
3rd international workshop, Firenze, France, October 2010.

@inproceedings{moussallam:hal-02747610,
  address = {Firenze, France},
  author = {Moussallam, Manuel and Fillon, Thomas and Richard, Gael and Daudet, Laurent},
  booktitle = {{3rd international workshop}},
  doi = {10.1145/1878003.1878012},
  hal_id = {hal-02747610},
  hal_version = {v1},
  month = oct,
  pages = {25},
  publisher = {{ACM Press}},
  title = {{How sparsely can a signal be approximated while keeping its class identity?}},
  url = {https://hal.archives-ouvertes.fr/hal-02747610},
  year = {2010}
}

Probabilistic framework for template-based chord recognition
Laurent Oudre, Cédric Févotte, Yves Grenier
IEEE International Workshop on Multimedia Signal Processing (MMSP), St Malo, France, October 2010.

@inproceedings{OFG:MMSP-10,
  author = {Oudre, Laurent and F{\'e}votte, C{\'e}dric and Grenier, Yves},
  title = {Probabilistic framework for template-based chord recognition},
  booktitle = {IEEE International Workshop on Multimedia Signal Processing (MMSP)},
  address = {St Malo, France},
  year = {2010},
  month = oct,
  pages = {183--187}
}

Robust visual features for the multimodal identification of unregistered speakers in TV talk-shows
Félicien Vallet, Slim Essid, Jean Carrive, Gael Richard
2010 17th IEEE International Conference on Image Processing (ICIP 2010), Hong Kong, France, September 2010.

@inproceedings{vallet:hal-02747558,
  address = {Hong Kong, France},
  author = {Vallet, F{\'e}licien and Essid, Slim and Carrive, Jean and Richard, Gael},
  booktitle = {{2010 17th IEEE International Conference on Image Processing (ICIP 2010)}},
  doi = {10.1109/ICIP.2010.5653393},
  hal_id = {hal-02747558},
  hal_version = {v1},
  month = sep,
  pages = {1469-1472},
  publisher = {{IEEE}},
  title = {{Robust visual features for the multimodal identification of unregistered speakers in TV talk-shows}},
  url = {https://hal.archives-ouvertes.fr/hal-02747558},
  year = {2010}
}

Robust frequency-based Audio Fingerprinting
Elsa Dupraz, Gael Richard
2010 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010, Dallas, France, March 2010.

@inproceedings{dupraz:hal-02747744,
  address = {Dallas, France},
  author = {Dupraz, Elsa and Richard, Gael},
  booktitle = {{2010 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010}},
  doi = {10.1109/ICASSP.2010.5495944},
  hal_id = {hal-02747744},
  hal_version = {v1},
  month = mar,
  pages = {281-284},
  publisher = {{IEEE}},
  title = {{Robust frequency-based Audio Fingerprinting}},
  url = {https://hal.archives-ouvertes.fr/hal-02747744},
  year = {2010}
}

A comparative study of tonal acoustic features for a symbolic level music-to-score alignment
Cyril Joder, Slim Essid, Gael Richard
2010 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010, Dallas, France, March 2010.

@inproceedings{joder:hal-02747785,
  address = {Dallas, France},
  author = {Joder, Cyril and Essid, Slim and Richard, Gael},
  booktitle = {{2010 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010}},
  doi = {10.1109/ICASSP.2010.5495784},
  hal_id = {hal-02747785},
  hal_version = {v1},
  month = mar,
  pages = {409-412},
  publisher = {{IEEE}},
  title = {{A comparative study of tonal acoustic features for a symbolic level music-to-score alignment}},
  url = {https://hal.archives-ouvertes.fr/hal-02747785},
  year = {2010}
}

A MULTIMODAL APPROACH TO INITIALISATION FOR TOP-DOWN SPEAKER DIARIZATION OF TELEVISION SHOWS
Simon Bozonnet, Félicien Vallet, Nicholas Evans, Slim Essid, Gael Richard, Jean Carrive
Eusipco, aalborg, Denmark, 2010.
```
@inproceedings{bozonnet:hal-02747730,
  address = {aalborg, Denmark},
  author = {Bozonnet, Simon and Vallet, F{\'e}licien and Evans, Nicholas and Essid, Slim and Richard, Gael and Carrive, Jean},
  booktitle = {{Eusipco}},
  hal_id = {hal-02747730},
  hal_version = {v1},
  title = {{A MULTIMODAL APPROACH TO INITIALISATION FOR TOP-DOWN SPEAKER DIARIZATION OF TELEVISION SHOWS}},
  url = {https://hal.archives-ouvertes.fr/hal-02747730},
  year = {2010}
}
```
This paper presents a new multimodal approach to speaker diarization of TV show data. We hypothesize that the intra-speaker variation in visual information might be less than that in the corresponding acoustic information and therefore might be better suited to the task of speaker model initialisa-tion. This is an acknowledged weakness of the computation-ally efficient top-down approach to speaker diarization that is used here. Experimental results show that a recently proposed approach to purification and the new multimodal approach to initialisation together deliver 22% and 17% relative improvements in diarization performance over the baseline system on independent development and evaluation datasets respectively.
Time-dependent parametric and harmonic templates in non-negative matrix factorization
Romain Hennequin, Roland Badeau, Bertrand David
Proc. of the 13th International Conference on Digital Audio Effects (DAFx), Graz, Austria, 2010.
```
@inproceedings{hennequin:hal-00945292,
  address = {Graz, Austria},
  author = {Hennequin, Romain and Badeau, Roland and David, Bertrand},
  booktitle = {{Proc. of the 13th International Conference on Digital Audio Effects (DAFx)}},
  hal_id = {hal-00945292},
  hal_version = {v1},
  pdf = {https://hal.inria.fr/hal-00945292/file/hennequin_dafx10.pdf},
  title = {{Time-dependent parametric and harmonic templates in non-negative matrix factorization}},
  url = {https://hal.inria.fr/hal-00945292},
  year = {2010}
}
```
This paper presents a new method to decompose musical spectrograms derived from Non-negative Matrix Factorization (NMF). This method uses time-varying harmonic templates (atoms) which are parametric: these atoms correspond to musical notes. Templates are synthesized from the values of the parameters which are learnt in an NMF framework. This parameterization permits to accurately model some musical effects (such as vibrato) which are inaccurately modeled by NMF.
NMF with time-frequency activations to model non-stationary audio events
Romain Hennequin, Roland Badeau, Bertrand David
Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Dallas, Texas, United States, 2010.
```
@inproceedings{hennequin:hal-00945293,
  address = {Dallas, Texas, United States},
  author = {Hennequin, Romain and Badeau, Roland and David, Bertrand},
  booktitle = {{Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  hal_id = {hal-00945293},
  hal_version = {v1},
  pages = {445--448},
  pdf = {https://hal.inria.fr/hal-00945293/file/Hennequin-ICASSP2010.pdf},
  title = {{NMF with time-frequency activations to model non-stationary audio events}},
  url = {https://hal.inria.fr/hal-00945293},
  year = {2010}
}
```
In this article we propose an extension of non-negative matrix factorization (NMF) based on a source/filter underlying model which permits to well decompose audio signals which contain redundant sounds with strong spectral variations. Our algorithm permits notably to use only one atom per musical event (such as a note) for such non-stationary sounds whereas NMF needs several atoms to represent a non-stationary sound. We have tested our algorithm on real audio data and shown how the proposed decomposition provides a smarter representation of audio spectrograms than standard NMF.
AN IMPROVED HIERARCHICAL APPROACH FOR MUSIC-TO-SYMBOLIC SCORE ALIGNMENT
Cyril Joder, Slim Essid, Gael Richard
ISMIR, Utrecht, Netherlands, 2010.
```
@inproceedings{joder:hal-02747659,
  address = {Utrecht, Netherlands},
  author = {Joder, Cyril and Essid, Slim and Richard, Gael},
  booktitle = {{ISMIR}},
  hal_id = {hal-02747659},
  hal_version = {v1},
  title = {{AN IMPROVED HIERARCHICAL APPROACH FOR MUSIC-TO-SYMBOLIC SCORE ALIGNMENT}},
  url = {https://hal.archives-ouvertes.fr/hal-02747659},
  year = {2010}
}
```
We present an efficient approach for an off-line alignment of a symbolic score to a recording of the same piece, using a statistical model. A hidden state model is built from the score, which allows for the use of two different kinds of features, namely chroma vectors and an onset detection function (spectral flux) with specific production models , in a simple manner. We propose a hierarchical pruning method for an approximate decoding of this statistical model. This strategy reduces the search space in an adap-tive way, yielding a better overall efficiency than the tested state-of-the art method. Experiments run on a large database of 94 pop songs show that the resulting system obtains higher recognition rates than the dynamic programming algorithm (DTW), with a significantly lower complexity, even though the rhythmic information is not used for the alignment.
Robust similarity metrics between audio signals based on asymmetrical spectral envelope matching
Mathieu Lagrange, Roland Badeau, Gael Richard
Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Dallas, Texas, United States, 2010.
```
@inproceedings{lagrange:hal-00945296,
  address = {Dallas, Texas, United States},
  author = {Lagrange, Mathieu and Badeau, Roland and Richard, Gael},
  booktitle = {{Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  hal_id = {hal-00945296},
  hal_version = {v1},
  pages = {405--408},
  pdf = {https://hal.inria.fr/hal-00945296/file/Lagrange-ICASSP2010.pdf},
  title = {{Robust similarity metrics between audio signals based on asymmetrical spectral envelope matching}},
  url = {https://hal.inria.fr/hal-00945296},
  year = {2010}
}
```
In this paper, a new type of metric that defines the similarity between musical audio signals is proposed. Based on the spectral flatness criterion, those metrics achieve low computational cost and low sensitivity to acoustical degradations. Validation is performed by studying the ability of the proposed metric to determine whether two audio signals have been played by the same musical instrument. For this task, proposed metrics are shown to overcome metrics based on the comparison of standard spectral features especially when the request and the records of the database are of different acoustical properties.
YAAFE, AN EASY TO USE AND EFFICIENT AUDIO FEATURE EXTRACTION SOFTWARE
Benoît Mathieu, Slim Essid, Thomas Fillon, Jacques Prado, Gael Richard
ISMIR, Utrecht, Netherlands, 2010.
```
@inproceedings{mathieu:hal-02747689,
  address = {Utrecht, Netherlands},
  author = {Mathieu, Beno{\^i}t and Essid, Slim and Fillon, Thomas and Prado, Jacques and Richard, Gael},
  booktitle = {{ISMIR}},
  hal_id = {hal-02747689},
  hal_version = {v1},
  title = {{YAAFE, AN EASY TO USE AND EFFICIENT AUDIO FEATURE EXTRACTION SOFTWARE}},
  url = {https://hal.archives-ouvertes.fr/hal-02747689},
  year = {2010}
}
```
Music Information Retrieval systems are commonly built on a feature extraction stage. For applications involving automatic classification (e.g. speech/music discrimination , music genre or mood recognition, ...), traditional approaches will consider a large set of audio features to be extracted on a large dataset. In some cases, this will lead to computationally intensive systems and there is, therefore, a strong need for efficient feature extraction. In this paper, a new audio feature extraction software, YAAFE 1 , is presented and compared to widely used libraries. The main advantage of YAAFE is a significantly lower complexity due to the appropriate exploitation of redundancy in the feature calculation. YAAFE remains easy to configure and each feature can be parameterized independently. Finally, the YAAFE framework and most of its core feature library are released in source code under the GNU Lesser General Public License.

patent

Method and device for forming a digital audio mixed signal, method and device for separating signals, and corresponding signal
Laurent Girin, Antoine Liutkus, Gael Richard, Roland Badeau
France, October 2010.

@patent{girin:hal-02651076,
  address = {France},
  author = {Girin, Laurent and Liutkus, Antoine and Richard, Gael and Badeau, Roland},
  hal_id = {hal-02651076},
  hal_version = {v1},
  month = oct,
  number = {US20140037110A1},
  title = {{Method and device for forming a digital audio mixed signal, method and device for separating signals, and corresponding signal}},
  url = {https://hal.telecom-paris.fr/hal-02651076},
  year = {2010}
}

Journal Articles

Source/Filter Model for Unsupervised Main Melody Extraction From Polyphonic Audio Signals
Jean-Louis Durrieu, Gael Richard, Bertrand David, Cédric Févotte
IEEE Transactions on Audio, Speech and Language Processing, March 2010.
```
@article{durrieu:hal-02652995,
  author = {Durrieu, Jean-Louis and Richard, Gael and David, Bertrand and F{\'e}votte, C{\'e}dric},
  hal_id = {hal-02652995},
  hal_version = {v1},
  journal = {{IEEE Transactions on Audio, Speech and Language Processing}},
  keywords = {Index Terms-Music ; Source/Filter Model ; Main Melody Ex- traction ; Blind Audio Source Separation ; Spectral Analysis ; Max- imum Likelihood ; Expectation-Maximization (EM) algorithm ; Gaussian Scaled Mixture Model (GSMM) ; Non-negative Matrix Factorization (NMF)},
  month = mar,
  pdf = {https://hal.archives-ouvertes.fr/hal-02652995/file/TSALP_Durrieu10.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Source/Filter Model for Unsupervised Main Melody Extraction From Polyphonic Audio Signals}},
  url = {https://hal.archives-ouvertes.fr/hal-02652995},
  year = {2010}
}
```
Extracting the main melody from a polyphonic music recording seems natural even to untrained human listeners. To a certain extent it is related to the concept of source separation, with the human ability of focusing on a specific source in order to extract relevant information. In this article, we propose a new approach for the estimation and extraction of the main melody (and in particular the leading vocal part) from polyphonic audio signals. To that aim, we propose a new signal model where the leading vocal part is explicitly represented by a specific source/filter model. The proposed representation is investigated in the framework of two statistical models: a Gaussian Scaled Mixture Model (GSMM) and an extended Instantaneous Mixture Model (IMM). For both models, the estimation of the different parameters is done within a maximum likelihood framework adapted from single-channel source separation techniques. The desired sequence of fundamental frequencies is then inferred from the estimated parameters. The results obtained in a recent evaluation campaign (MIREX08) show that the proposed approaches are very promising and reach state-of-the-art performances on all test sets.
Audio signal representations for indexing in the transform domain
Emmanuel Ravelli, Gael Richard, Laurent Daudet
IEEE Transactions on Audio, Speech and Language Processing, March 2010.
```
@article{ravelli:hal-02652798,
  author = {Ravelli, Emmanuel and Richard, Gael and Daudet, Laurent},
  hal_id = {hal-02652798},
  hal_version = {v1},
  journal = {{IEEE Transactions on Audio, Speech and Language Processing}},
  month = mar,
  pdf = {https://hal.archives-ouvertes.fr/hal-02652798/file/TSALP_ravelli10.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Audio signal representations for indexing in the transform domain}},
  url = {https://hal.archives-ouvertes.fr/hal-02652798},
  year = {2010}
}
```
Indexing audio signals directly in the transform domain can potentially save a significant amount of computation when working on a large database of signals stored in a lossy compression format, without having to fully decode the signals. Here, we show that the representations used in standard transform-based audio codecs (e.g. MDCT for AAC, or hybrid PQF/MDCT for MP3) have a sufficient time resolution for some rhythmic features, but a poor frequency resolution, which prevents their use in tonality-related applications. Alternatively, a recently developed audio codec based on a sparse multi-scale MDCT transform has a good resolution both for time-and frequency-domain features. We show that this new audio codec allows efficient transform-domain audio indexing for 3 different applications, namely beat tracking, chord recognition and musical genre classification. We compare results obtained with this new audio codec and the two standard MP3 and AAC codecs, in terms of performance and computation time.
Explicit Modeling of Temporal Dynamics within Musical Signals for Acoustical Unit Formation and Similarity
Mathieu Lagrange, Martin Raspaud, Roland Badeau, Gael Richard
Pattern Recognition Letters, 2010.
```
@article{lagrange:hal-00945198,
  author = {Lagrange, Mathieu and Raspaud, Martin and Badeau, Roland and Richard, Gael},
  hal_id = {hal-00945198},
  hal_version = {v1},
  journal = {{Pattern Recognition Letters}},
  number = {12},
  pages = {1498--1506},
  pdf = {https://hal.inria.fr/hal-00945198/file/prnsa-10.pdf},
  publisher = {{Elsevier}},
  title = {{Explicit Modeling of Temporal Dynamics within Musical Signals for Acoustical Unit Formation and Similarity}},
  url = {https://hal.inria.fr/hal-00945198},
  volume = {31},
  year = {2010}
}
```
Timbre is a major cue for the human auditory system to recognize musical sounds. Timbral patterns can be decomposed in the widely used spectral envelope and the less considered temporal dynamics. In this paper, we present new temporal dynamics similarity measures, which will prove valuable for the recognition of timbral patterns. These similarity measures are evaluated, first alone, then in conjunction with spectral envelope similarity measures, for both single tones and solo recordings. Results are provided, showing that the new temporal dynamics features improve significantly timbral pattern recognition.

2005 - 2009 [87 publications]

2009

Conference Articles

Fast Bayesian constrained NMF for polyphonic pitch transcription
Nancy Bertin, Emmanuel Vincent, Roland Badeau
Music Information Retrieval Evaluation eXchange (MIREX). International Society for Music Information Retrieval., Kobe, Japan, October 2009. Article acco....

@inproceedings{bertin:inria-00601353,
  address = {Kobe, Japan},
  author = {Bertin, Nancy and Vincent, Emmanuel and Badeau, Roland},
  booktitle = {{Music Information Retrieval Evaluation eXchange (MIREX). International Society for Music Information Retrieval.}},
  hal_id = {inria-00601353},
  hal_version = {v1},
  month = oct,
  note = {Article accompagnant la participation {\`a} une {\'e}valuation internationale.},
  title = {{Fast Bayesian constrained NMF for polyphonic pitch transcription}},
  url = {https://hal.inria.fr/inria-00601353},
  year = {2009}
}

Fast Bayesian NMF algorithms enforcing harmonicity and temporal continuity in polyphonic music transcription
Nancy Bertin, Emmanuel Vincent, Roland Badeau
WASPAA, New Paltz, United States, October 2009.

@inproceedings{bertin:inria-00601355,
  address = {New Paltz, United States},
  author = {Bertin, Nancy and Vincent, Emmanuel and Badeau, Roland},
  booktitle = {{WASPAA}},
  hal_id = {inria-00601355},
  hal_version = {v1},
  month = oct,
  pages = {29-32},
  title = {{Fast Bayesian NMF algorithms enforcing harmonicity and temporal continuity in polyphonic music transcription}},
  url = {https://hal.inria.fr/inria-00601355},
  year = {2009}
}

Template-based chord recognition : influence of the chord types
Laurent Oudre, Yves Grenier, Cédric Févotte
International Symposium on Music Information Retrieval (ISMIR), Kobe, Japan, October 2009.

@inproceedings{OGF:ISMIR-09,
  author = {Oudre, Laurent and Grenier, Yves and F{\'e}votte, C{\'e}dric},
  title = {Template-based chord recognition : influence of the chord types},
  booktitle = {International Symposium on Music Information Retrieval (ISMIR)},
  address = {Kobe, Japan},
  year = {2009},
  month = oct,
  pages = {153--158}
}

Chord recognition using measures of fit, chord templates and filtering methods
Laurent Oudre, Yves Grenier, Cédric Févotte
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New York, USA, October 2009.

@inproceedings{OGF:WASPAA-09,
  author = {Oudre, Laurent and Grenier, Yves and F{\'e}votte, C{\'e}dric},
  title = {Chord recognition using measures of fit, chord templates and filtering methods},
  booktitle = {IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
  address = {New York, USA},
  year = {2009},
  month = oct,
  pages = {9--12}
}

Interactive Segmentation of Electro-Acoustic Music
Sébastien Gulluni, Slim Essid, Olivier Buisson, Gael Richard
2nd International Workshop on Machine Learning and Music (MML - ECML - PKDD), Bled, Slovenia, September 2009.

@inproceedings{gulluni:hal-02943665,
  address = {Bled, Slovenia},
  author = {Gulluni, S{\'e}bastien and Essid, Slim and Buisson, Olivier and Richard, Gael},
  booktitle = {{2nd International Workshop on Machine Learning and Music (MML - ECML - PKDD)}},
  hal_id = {hal-02943665},
  hal_version = {v1},
  month = sep,
  pdf = {https://hal.telecom-paris.fr/hal-02943665/file/SG_MML-09.pdf},
  title = {{Interactive Segmentation of Electro-Acoustic Music}},
  url = {https://hal.telecom-paris.fr/hal-02943665},
  year = {2009}
}

Étude des descripteurs acoustiques pour l’alignement temporel audio-sur-partition musicale
Cyril Joder, Slim Essid, Gaël Richard
GRETSI, Dijon, France, September 2009.
```
@inproceedings{joder:hal-02943624,
  address = {Dijon, France},
  author = {Joder, Cyril and Essid, Slim and Richard, Ga{\"e}l},
  booktitle = {{GRETSI}},
  hal_id = {hal-02943624},
  hal_version = {v1},
  month = sep,
  pdf = {https://hal.telecom-paris.fr/hal-02943624/file/CJ_GRETSI-09.pdf},
  title = {{{\'E}tude des descripteurs acoustiques pour l'alignement temporel audio-sur-partition musicale}},
  url = {https://hal.telecom-paris.fr/hal-02943624},
  year = {2009}
}
```
Dans cet article, nous comparons l’influence des descripteurs acoustiques utilisés dans les systèmes d’alignement temporel musique/partition, pour une tâche où la musique peutêtre polyphonique avec percussions. Différentes représentations de l’état de l’art sont employées dans un cadre formalisé avec deux stratégies d’alignement. Les résultats montrent que les descripteurs prenant en compte une dimension perceptive (échelle fréquentielle logarithmique) sont les plus pertinents; ils sont notamment plus robustes pour le cas polyphonique. De plus, la stratégie d’alignementà partir d’une resynthèse de la partition obtient des résultats globalement meilleurs que les modèles théoriques, même si sa complexité est supérieure. Abstract-In this paper we review the acoustic features used for music-to-score alignment and study their influence on the performance in a challenging alignment task, where the audio data can be polyphonic, possibly containing percussion. Different state-of-the-art features are tested in a formalized framework, with two alignment strategies. Results show that the most efficient features are those which take into account a perceptual aspect (logarithmic frequency scale). The latter are found to be more robust, especially in the polyphonic case. Moreover, the alignment strategy which uses a synthesis of the musical score obtains better results than theoretical models, though it is computationnally more expensive.

Incorporating prior knowledge on the digital media creation process into audio classifiers
M. Lardeur, Slim Essid, G. Richard, M. Haller, T. Sikora
ICASSP 2009 - 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, France, April 2009.

@inproceedings{lardeur:hal-02943669,
  address = {Taipei, France},
  author = {Lardeur, M. and Essid, Slim and Richard, G. and Haller, M. and Sikora, T.},
  booktitle = {{ICASSP 2009 - 2009 IEEE International Conference on Acoustics, Speech and Signal Processing}},
  doi = {10.1109/ICASSP.2009.4959918},
  hal_id = {hal-02943669},
  hal_version = {v1},
  month = apr,
  pages = {1653-1656},
  publisher = {{IEEE}},
  title = {{Incorporating prior knowledge on the digital media creation process into audio classifiers}},
  url = {https://hal.telecom-paris.fr/hal-02943669},
  year = {2009}
}

A tempering approach for Itakura-Saito non-negative matrix factorization. With application to music transcription
Nancy Bertin, Cédric Févotte, Roland Badeau
Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Taipei, Taiwan, 2009.
```
@inproceedings{bertin:hal-00945283,
  address = {Taipei, Taiwan},
  author = {Bertin, Nancy and F{\'e}votte, C{\'e}dric and Badeau, Roland},
  booktitle = {{Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  hal_id = {hal-00945283},
  hal_version = {v1},
  pages = {1545--1548},
  pdf = {https://hal.inria.fr/hal-00945283/file/Bertin-ICASSP2009.pdf},
  title = {{A tempering approach for Itakura-Saito non-negative matrix factorization. With application to music transcription}},
  url = {https://hal.inria.fr/hal-00945283},
  year = {2009}
}
```
In this paper we are interested in non-negative matrix factorization (NMF) with the Itakura-Saito (IS) divergence. Previous work has demonstrated the relevance of this cost function for the decomposition of audio power spectrograms. This is in particular due to its scale invariance, which makes it more robust to the wide dynamics of audio, a property which is not shared by other popular costs such as the Euclidean distance or the generalized Kulback-Leibler (KL) divergence. However, while the latter two cost functions are convex, the IS divergence is not, which makes it more prone to convergence to irrelevant local minima, as observed empirically. Thus, the aim of this paper is to propose a tempering scheme that favors convergence of IS-NMF to global minima. Our algorithm is based on NMF with the beta-divergence, where the shape parameter beta acts as a temperature parameter. Results on both synthetical and music data (in a transcription context) show the relevance of our approach.

Journal Articles

Temporal Integration for Audio Classification With Application to Musical Instrument Classification
Cyril Joder, Slim Essid, Gael Richard
IEEE Transactions on Audio, Speech and Language Processing, January 2009.
```
@article{joder:hal-02652782,
  author = {Joder, Cyril and Essid, Slim and Richard, Gael},
  doi = {10.1109/TASL.2008.2007613},
  hal_id = {hal-02652782},
  hal_version = {v1},
  journal = {{IEEE Transactions on Audio, Speech and Language Processing}},
  month = jan,
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Temporal Integration for Audio Classification With Application to Musical Instrument Classification}},
  url = {https://hal.inria.fr/hal-02652782},
  volume = {17},
  year = {2009}
}
```
Nowadays, it appears essential to design automatic indexing tools which provide meaningful and efficient means to describe the musical audio content. There is in fact a growing interest for music information retrieval (MIR) applications amongst which the most popular are related to music similarity retrieval, artist identification, musical genre or instrument recognition. Current MIR-related classification systems usually do not take into account the mid-term temporal properties of the signal (over several frames) and lie on the assumption that the observations of the features in different frames are statistically independent. The aim of this paper is to demonstrate the usefulness of the information carried by the evolution of these characteristics over time. To that purpose , we propose a number of methods for early and late temporal integration and provide an in-depth experimental study on their interest for the task of musical instrument recognition on solo musical phrases. In particular, the impact of the time horizon over which the temporal integration is performed will be assessed both for fixed and variable frame length analysis. Also, a number of recently proposed alignment kernels will be used for late temporal integration. For all experiments, the results are compared to a state of the art musical instrument recognition system. Index Terms-Alignment kernels, audio classification, music information retrieval (MIR), musical instrument recognition, support vector machine (SVM), temporal feature integration.
Sympathetic string modes in the concert harp
Jean-Loic Le Carrou, François Gautier, Roland Badeau
Acta Acustica united with Acustica, 2009.
```
@article{lecarrou:hal-00945199,
  author = {Le Carrou, Jean-Loic and Gautier, Fran{\c c}ois and Badeau, Roland},
  hal_id = {hal-00945199},
  hal_version = {v1},
  journal = {{Acta Acustica united with Acustica}},
  number = {4},
  pages = {744--752},
  pdf = {https://hal.inria.fr/hal-00945199/file/Sympathie_Modes_LeCarrou_Gautier_Badeau.pdf},
  publisher = {{Hirzel Verlag}},
  title = {{Sympathetic string modes in the concert harp}},
  url = {https://hal.inria.fr/hal-00945199},
  volume = {95},
  year = {2009}
}
```
The concert harp is composed of a soundboard, a cavity with sound holes and 47 strings. When one string is plucked, other strings are excited and induce a characteristic ’halo of sound’. This phenomenon, called sympathetic vibrations is due to a coupling between strings via the instrument’s body. These sympathetic modes generate the presence of multiple spectral components in each partial of the tone. Resolution of Fourier analysis does not permit their identification. A high resolution Method, called ESPRIT, is used to separate the spectral components which are very close one to another. Some of the measured spectral components in the analysed partials correspond to the response of sympathetic modes. The eigenfrequencies and mode shapes of these modes are investigated using a suitable model of the instrument : this model is based on a waveguide approach in which bending and longitudinal motions of 35 strings connected to an equivalent beam representing the soundboard are described. Identified experimental sympathetic modes are very well captured by the model.

Technical Reports

Supporting document for the paper ”Stability analysis of multiplicative update algorithms and application to non-negative matrix factorization”
Roland Badeau, Nancy Bertin, Emmanuel Vincent
2009.

@techreport{badeau:inria-00601349,
  author = {Badeau, Roland and Bertin, Nancy and Vincent, Emmanuel},
  hal_id = {inria-00601349},
  hal_version = {v1},
  title = {{Supporting document for the paper ''Stability analysis of multiplicative update algorithms and application to non-negative matrix factorization''}},
  type = {Technical Report},
  url = {https://hal.inria.fr/inria-00601349},
  year = {2009}
}

Adaptive harmonic spectral decomposition for multiple pitch estimation
Emmanuel Vincent, Nancy Bertin, Roland Badeau
2009. This technic....
```
@techreport{vincent:inria-00350163,
  author = {Vincent, Emmanuel and Bertin, Nancy and Badeau, Roland},
  hal_id = {inria-00350163},
  hal_version = {v3},
  keywords = {Multiple pitch estimation ; adaptive representation ; nonnegative matrix factorization ; harmonicity ; spectral smoothness},
  note = {This technical report is deprecated. Please refer to the following article instead: http://hal.inria.fr/inria-00544094/},
  number = {PI 1919},
  pages = {15},
  pdf = {https://hal.inria.fr/inria-00350163v3/file/techreport_warning.pdf},
  title = {{Adaptive harmonic spectral decomposition for multiple pitch estimation}},
  type = {Research Report},
  url = {https://hal.inria.fr/inria-00350163},
  year = {2009}
}
```
Multiple pitch estimation consists of inferring the fundamental frequencies and the salience of the notes forming a music signal over short time frames. This mid-level representation can be exploited as a front-end for higher-level applications, such as music-to-score transcription or chord detection. One approach is to decompose the short-term magnitude spectrum of the signal into a sum of basis spectra representing individual pitches scaled by time-varying amplitudes, using algorithms such as nonnegative matrix factorization (NMF). Prior training of the basis spectra is often infeasible due to the wide range of possible instruments. Appropriate spectra must then be estimated from the observed data, which may result in limited performance due to inaccurately estimated spectra. In this article, we model each basis spectrum as a weighted sum of narrowband spectra representing a few adjacent harmonic partials, thus enforcing harmonicity and spectral smoothness while adapting the spectral envelope to each instrument. We derive a NMF-like algorithm to estimate the model parameters and evaluate it on a database of piano recordings, considering several choices for the narrowband spectra. Performance appears superior to unconstrained adaptive NMF and competitive with supervised NMF based on pre-trained piano spectra. We also apply our approach to woodwind data.

2008

Journal Articles

Union of MDCT Bases for Audio Coding
Emmanuel Ravelli, Gael Richard, Laurent Daudet
IEEE Transactions on Audio, Speech and Language Processing, November 2008.
```
@article{ravelli:hal-02652697,
  author = {Ravelli, Emmanuel and Richard, Gael and Daudet, Laurent},
  doi = {10.1109/TASL.2008.2004290},
  hal_id = {hal-02652697},
  hal_version = {v1},
  journal = {{IEEE Transactions on Audio, Speech and Language Processing}},
  keywords = {Index Terms-Audio coding ; matching pursuit ; scalable coding ; signal representations ; sparse representations},
  month = nov,
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Union of MDCT Bases for Audio Coding}},
  url = {https://hal.inria.fr/hal-02652697},
  volume = {16},
  year = {2008}
}
```
This paper investigates the use of sparse overcomplete decompositions for audio coding. Audio signals are decomposed over a redundant union of modified discrete cosine transform (MDCT) bases having eight different scales. This approach produces a sparser decomposition than the traditional MDCT-based orthogonal transform and allows better coding efficiency at low bitrates. Contrary to state-of-the-art low bitrate coders, which are based on pure parametric or hybrid representations, our approach is able to provide transparency. Moreover, we use a bitplane encoding approach, which provides a fine-grain scalable coder that can seamlessly operate from very low bitrates up to transparency. Objective evaluation, as well as listening tests, show that the performance of our coder is significantly better than a state-of-the-art transform coder at very low bitrates and has similar performance at high bitrates. We provide a link to test soundfiles and source code to allow better evaluation and reproducibility of the results.
A general framework for second order blind separation of stationary colored sources
Abdeldjalil Aissa El Bey, Karim Abed-Meraim, Yves Grenier, Yingbo Hua
Signal Processing, September 2008.
```
@article{aissaelbey:hal-01771209,
  author = {Aissa El Bey, Abdeldjalil and Abed-Meraim, Karim and Grenier, Yves and Hua, Yingbo},
  doi = {10.1016/j.sigpro.2008.03.017},
  hal_id = {hal-01771209},
  hal_local_reference = {3517},
  hal_version = {v1},
  journal = {{Signal Processing}},
  keywords = {Blind source separation},
  month = sep,
  number = {9},
  pages = {2123 - 2137},
  pdf = {https://hal.archives-ouvertes.fr/hal-01771209/file/TrSP-SOS-Identifiability_C1.pdf},
  publisher = {{Elsevier}},
  title = {{A general framework for second order blind separation of stationary colored sources}},
  url = {https://hal.archives-ouvertes.fr/hal-01771209},
  volume = {88},
  year = {2008}
}
```
This paper focuses on the blind separation of stationary colored sources using the second order statistics of their instantaneous mixtures. We start first by presenting a brief overview of existing contributions in that field. Then, we present, necessary and sufficient conditions for the identifiability and partial identifiability using a finite set of correlation matrices. These conditions depend on the autocorrelation fonction of the unknown sources. However, it is shown here that they can be tested directly from the observation through the decorrelator output. This issue is of prime importance to decide whether the sources have been well separated or else if further treatments are needed. We then propose an identifiability testing based on resampling (jackknife) technique, that is validated by simulation results. Secondly, we present an iterative blind source separation method using second order statistics (SOS) and natural gradient technique. This algorithm has a number of attractive properties including its simplicity and ’easy’ generalization to adaptive or convolutive schemes. Asymptotic performance analysis of this method is performed. Several numerical simulations are presented, to assess the theoretical results w.r.t the ’separability’ testing, to demonstrate the effectiveness of the gradient-type decorrelation method and to validate the theoretical expression of the asymptotic performance index.

Fear-type emotion recognition for future audio-based surveillance systems
C. Clavel, I. Vasilescu, L. Devillers, Gael Richard, T. Ehrette
Speech Communication, May 2008.

@article{clavel:hal-00499211,
  author = {Clavel, C. and Vasilescu, I. and Devillers, L. and Richard, Gael and Ehrette, T.},
  doi = {10.1016/j.specom.2008.03.012},
  hal_id = {hal-00499211},
  hal_version = {v1},
  journal = {{Speech Communication}},
  keywords = {Physical Sciences},
  month = may,
  number = {6},
  pages = {487},
  pdf = {https://hal.archives-ouvertes.fr/hal-00499211/file/PEER_stage2_10.1016%252Fj.specom.2008.03.012.pdf},
  publisher = {{Elsevier : North-Holland}},
  title = {{Fear-type emotion recognition for future audio-based surveillance systems}},
  url = {https://hal.archives-ouvertes.fr/hal-00499211},
  volume = {50},
  year = {2008}
}

Transcription and Separation of Drum Signals From Polyphonic Music
Olivier Gillet, Gael Richard
IEEE Transactions on Audio, Speech and Language Processing, March 2008.
```
@article{gillet:hal-02652666,
  author = {Gillet, Olivier and Richard, Gael},
  doi = {10.1109/TASL.2007.914120},
  hal_id = {hal-02652666},
  hal_version = {v1},
  journal = {{IEEE Transactions on Audio, Speech and Language Processing}},
  keywords = {Index Terms-Drum signals ; feature selection ; harmonic/noise decomposition ; music transcription ; source separation ; support vector machine (SVM) ; Wiener filtering},
  month = mar,
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Transcription and Separation of Drum Signals From Polyphonic Music}},
  url = {https://hal.inria.fr/hal-02652666},
  year = {2008}
}
```
The purpose of this article is to present new advances in music transcription and source separation with a focus on drum signals. A complete drum transcription system is described, which combines information from the original music signal and a drum track enhanced version obtained by source separation. In addition to efficient fusion strategies to take into account these two complementary sources of information, the transcription system integrates a large set of features, optimally selected by feature selection. Concurrently, the problem of drum track extraction from polyphonic music is tackled both by proposing a novel approach based on harmonic/noise decomposition and time/frequency masking and by improving an existing Wiener filtering-based separation method. The separation and transcription techniques presented are thoroughly evaluated on a large public database of music signals. A transcription accuracy between 64.5% and 80.3% is obtained, depending on the drum instrument, for well-balanced mixes, and the efficiency of our drum separation algorithms is illustrated in a comprehensive benchmark.
Estimation of Frequency for AM/FM Models Using the Phase Vocoder Framework
Michaël Betser, Patrice Collen, Gael Richard, Bertrand T. David
IEEE Transactions on Signal Processing, February 2008.
```
@article{betser:hal-02652652,
  author = {Betser, Micha{\"e}l and Collen, Patrice and Richard, Gael and David, Bertrand T.},
  doi = {10.1109/TSP.2007.906768},
  hal_id = {hal-02652652},
  hal_version = {v1},
  journal = {{IEEE Transactions on Signal Processing}},
  month = feb,
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Estimation of Frequency for AM/FM Models Using the Phase Vocoder Framework}},
  url = {https://hal.inria.fr/hal-02652652},
  volume = {56},
  year = {2008}
}
```
This paper proposes an extension of the applicability of phase-vocoder-based frequency estimators for generalized sinusoidal models, which include phase and amplitude modulations. A first approach, called phase corrected vocoder (PCV), takes into account the modification of the Fourier phases resulting from these modulations. Another approach is based on an adaptation of the principles of the time-frequency reassignment and is referred to as the reassigned vocoder (RV). The robustness of the estimation against noise is studied, both theoretically and experimentally, and the performance is assessed in comparison with two state-of-the-art algorithms: an unmodified version of the reassignment method and a quadratically interpolated fast Fourier transform method (QIFFT). Index Terms-AM/FM model, frequency estimation, phase vocoder.
MULTILINEAR SINGULAR VALUE DECOMPOSITION FOR STRUCTURED TENSORS
Roland Badeau, Remy Boyer
SIAM Journal on Matrix Analysis and Applications, 2008.
```
@article{badeau:hal-00575978,
  author = {Badeau, Roland and Boyer, Remy},
  hal_id = {hal-00575978},
  hal_version = {v1},
  journal = {{SIAM Journal on Matrix Analysis and Applications}},
  number = {3},
  pdf = {https://hal.archives-ouvertes.fr/hal-00575978/file/UmVteSBCT1lFUg_SVD_for_structured_tensors.pdf},
  publisher = {{Society for Industrial and Applied Mathematics}},
  title = {{MULTILINEAR SINGULAR VALUE DECOMPOSITION FOR STRUCTURED TENSORS}},
  url = {https://hal.archives-ouvertes.fr/hal-00575978},
  volume = {30},
  year = {2008}
}
```
The Higher-Order SVD (HOSVD) is a generalization of the Singular Value Decompo- sition (SVD) to higher-order tensors (i.e. arrays with more than two indices) and plays an important role in various domains. Unfortunately, this decomposition is computationally demanding. Indeed, the HOSVD of a third-order tensor involves the computation of the SVD of three matrices, which are referred to as "modes", or "matrix unfoldings". In this paper, we present fast algorithms for computing the full and the rank-truncated HOSVD of third-order structured (symmetric, Toeplitz and Hankel) tensors. These algorithms are derived by considering two speciﬁc ways to unfold a structured tensor, leading to structured matrix unfoldings whose SVD can be eﬃciently computed1.
Cramér-Rao bounds for multiple poles and coefficients of quasipolynomials in colored noise
Roland Badeau, Bertrand David, Gael Richard
IEEE_J_SP, 2008.
```
@article{badeau:hal-00945193,
  author = {Badeau, Roland and David, Bertrand and Richard, Gael},
  hal_id = {hal-00945193},
  hal_version = {v1},
  journal = {{IEEE\_J\_SP}},
  number = {8},
  pages = {3458--3467},
  pdf = {https://hal.inria.fr/hal-00945193/file/ieee-tsp-08a.pdf},
  publisher = {{IEEE}},
  title = {{Cram{\'e}r-Rao bounds for multiple poles and coefficients of quasipolynomials in colored noise}},
  url = {https://hal.inria.fr/hal-00945193},
  volume = {56},
  year = {2008}
}
```
In this paper, we provide analytical expressions of the Cramér-Rao bounds for the frequencies, damping factors, amplitudes and phases of complex exponentials in colored noise. These expressions show the explicit dependence of the bounds of each distinct parameter with respect to the amplitudes and phases, leading to readily interpretable formulae, which are then simplified in an asymptotic context. The results are presented in the general framework of the Polynomial Amplitude Complex Exponentials (PACE) model, also referred to as the quasipolynomial model in the literature, which accounts for systems involving multiple poles, and represents a signal as a mixture of complex exponentials modulated by polynomials. This work looks further and generalizes the studies previously undertaken on the exponential and the quasipolynomial models.
Fast and stable YAST algorithm for principal and minor subspace tracking
Roland Badeau, Gael Richard, Bertrand David
IEEE_J_SP, 2008.
```
@article{badeau:hal-00945194,
  author = {Badeau, Roland and Richard, Gael and David, Bertrand},
  hal_id = {hal-00945194},
  hal_version = {v1},
  journal = {{IEEE\_J\_SP}},
  number = {8},
  pages = {3437--3446},
  pdf = {https://hal.inria.fr/hal-00945194/file/ieee-tsp-08b.pdf},
  publisher = {{IEEE}},
  title = {{Fast and stable YAST algorithm for principal and minor subspace tracking}},
  url = {https://hal.inria.fr/hal-00945194},
  volume = {56},
  year = {2008}
}
```
This paper presents a new implementation of the YAST algorithm for principal and minor subspace tracking. YAST was initially derived from the Subspace Projection (SP) algorithm by C.E. Davila, which was known for its exceptional convergence rate, compared to other classical principal subspace trackers. The novelty in the YAST algorithm was the lower computational cost (linear if the data correlation matrix satisfies a so-called shift-invariance property), and the extension to minor subspace tracking. However, the original implementation of the YAST algorithm suffered from a numerical stability problem (the subspace weighting matrix slowly loses its orthonormality). We thus propose in this paper a new implementation of YAST, whose stability is established theoretically and tested via numerical simulations. This algorithm combines all the desired properties for a subspace tracker: remarkably high convergence rate, lowest steady state error, linear complexity, and numerical stability regarding the orthonormality of the subspace weighting matrix.
Performance of ESPRIT for estimating mixtures of complex exponentials modulated by polynomials
Roland Badeau, Gael Richard, Bertrand David
IEEE_J_SP, 2008.
```
@article{badeau:hal-00945195,
  author = {Badeau, Roland and Richard, Gael and David, Bertrand},
  hal_id = {hal-00945195},
  hal_version = {v1},
  journal = {{IEEE\_J\_SP}},
  number = {2},
  pages = {492--504},
  pdf = {https://hal.inria.fr/hal-00945195/file/ieee-tsp-07.pdf},
  publisher = {{IEEE}},
  title = {{Performance of ESPRIT for estimating mixtures of complex exponentials modulated by polynomials}},
  url = {https://hal.inria.fr/hal-00945195},
  volume = {56},
  year = {2008}
}
```
High Resolution (HR) methods are known to provide accurate frequency estimates for discrete spectra. The Polynomial Amplitude Complex Exponentials (PACE) model, also called quasipolynomial model in the literature, was presented as the most general model tractable by HR methods. A subspace-based estimation scheme was recently proposed, derived from the classical ESPRIT algorithm. In this paper, we focus on the performance of this estimator. We first present some asymptotic expansions of the estimated parameters, obtained at the first order under the assumption of a high signal-to-noise ratio. Then the performance of the generalized ESPRIT algorithm for estimating the parameters of this model is analyzed in terms of bias and variance, and compared to the Cramér-Rao bounds. This performance is studied in an asymptotic context, and it is proved that the efficiency of undamped single poles estimators is close to the optimality. Moreover, our results show that the best performance is obtained for a proper dimensioning of the data. To illustrate the practical capabilities of the generalized ESPRIT algorithm, we finally propose an application to ARMA filter synthesis, in the context of system conversion from continuous time to discrete time.
Instrument-specific harmonic atoms for mid-level music representation
Pierre Leveau, Emmanuel Vincent, Gael Richard, Laurent Daudet
IEEE Transactions on Audio, Speech and Language Processing, 2008.
```
@article{leveau:inria-00544175,
  author = {Leveau, Pierre and Vincent, Emmanuel and Richard, Gael and Daudet, Laurent},
  hal_id = {inria-00544175},
  hal_version = {v1},
  journal = {{IEEE Transactions on Audio, Speech and Language Processing}},
  number = {1},
  pages = {116--128},
  pdf = {https://hal.inria.fr/inria-00544175/file/leveau_TASLP08.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Instrument-specific harmonic atoms for mid-level music representation}},
  url = {https://hal.inria.fr/inria-00544175},
  volume = {16},
  year = {2008}
}
```
Several studies have pointed out the need for accurate mid-level representations of music signals for information retrieval and signal processing purposes. In this article, we propose a new mid-level representation based on the decomposition of a signal into a small number of sound atoms or molecules bearing explicit musical instrument labels. Each atom is a sum of windowed harmonic sinusoidal partials whose relative amplitudes are specific to one instrument, and each molecule consists of several atoms from the same instrument spanning successive time windows. We design efficient algorithms to extract the most prominent atoms or molecules and investigate several applications of this representation, including polyphonic instrument recognition and music visualization.

Audio Indexing
Gael Richard
Encyclopedia of Data Warehousing and Mining, 2008.

@article{richard:hal-02652687,
  author = {Richard, Gael},
  hal_id = {hal-02652687},
  hal_version = {v1},
  journal = {{Encyclopedia of Data Warehousing and Mining}},
  title = {{Audio Indexing}},
  url = {https://hal.inria.fr/hal-02652687},
  year = {2008}
}

Conference Articles

Automatic transcription of piano music based on HMM tracking of jointly-estimated pitches
Valentin Emiya, Roland Badeau, Bertrand David
2008 Music Information Retrieval Evaluation eXchange (MIREX), Philadelphia, PA, United States, September 2008.

@inproceedings{emiya:inria-00545769,
  address = {Philadelphia, PA, United States},
  author = {Emiya, Valentin and Badeau, Roland and David, Bertrand},
  booktitle = {{2008 Music Information Retrieval Evaluation eXchange (MIREX)}},
  hal_id = {inria-00545769},
  hal_version = {v1},
  month = sep,
  title = {{Automatic transcription of piano music based on HMM tracking of jointly-estimated pitches}},
  url = {https://hal.inria.fr/inria-00545769},
  year = {2008}
}

ALIGNMENT KERNELS FOR AUDIO CLASSIFICATION WITH APPLICATION TO MUSIC INSTRUMENT RECOGNITION
Cyril Joder, Slim Essid, Gaël Richard
16th European Signal Processing Conference, Lausanne, Switzerland, August 2008.
```
@inproceedings{joder:hal-02943674,
  address = {Lausanne, Switzerland},
  author = {Joder, Cyril and Essid, Slim and Richard, Ga{\"e}l},
  booktitle = {{16th European Signal Processing Conference}},
  hal_id = {hal-02943674},
  hal_version = {v1},
  month = aug,
  pdf = {https://hal.telecom-paris.fr/hal-02943674/file/CJ_EUSIPCO-08.pdf},
  title = {{ALIGNMENT KERNELS FOR AUDIO CLASSIFICATION WITH APPLICATION TO MUSIC INSTRUMENT RECOGNITION}},
  url = {https://hal.telecom-paris.fr/hal-02943674},
  year = {2008}
}
```
In this paper we study the efficiency of support vector machines (SVM) with alignment kernels in audio classification. The classification task chosen is music instrument recognition. The alignment kernels have the advantage of handling sequential data, without assuming a model for the probability density of the features as in the case of Gaussian Mixture Model-based Hidden Markov Models (HMM). These clas-sifiers are compared to several reference systems, namely Gaussian Mixture Model, HMM classifiers and SVMs with "static" kernels. Using a higher-level representation of the feature sequence, which we call summary sequence, we show that the use of alignment kernels can significantly improve the classification scores in comparison to the reference systems .
ON THE ROBUSTNESS OF AUDIO FEATURES FOR MUSICAL INSTRUMENT CLASSIFICATION
S Wegener, M Haller, J J Burred, T Sikora, Slim Essid, Gael Richard
16th European Signal Processing Conference, Lausanne, Switzerland, August 2008.
```
@inproceedings{wegener:hal-02943672,
  address = {Lausanne, Switzerland},
  author = {Wegener, S and Haller, M and Burred, J J and Sikora, T and Essid, Slim and Richard, Gael},
  booktitle = {{16th European Signal Processing Conference}},
  hal_id = {hal-02943672},
  hal_version = {v1},
  month = aug,
  pdf = {https://hal.telecom-paris.fr/hal-02943672/file/SW_EUSIPCO-08.pdf},
  title = {{ON THE ROBUSTNESS OF AUDIO FEATURES FOR MUSICAL INSTRUMENT CLASSIFICATION}},
  url = {https://hal.telecom-paris.fr/hal-02943672},
  year = {2008}
}
```
We examine the robustness of several audio features applied exemplarily to musical instrument classification. For this purpose we study the robustness of 15 MPEG-7 Audio Low-Level Descriptors and 13 further spectral, temporal, and perceptual features against four types of signal modifications: low-pass filtering, coding artifacts, white noise, and reverberation. The robustness of the 120 feature coefficients obtained is evaluated with three different methods: comparison of rankings obtained by feature selection techniques, qualitative evaluation of changes in statistical parameters, and classification experiments using Gaussian Mixture Models (GMMs). These experiments are performed on isolated notes of 14 musical instrument classes.
Harmonic and inharmonic nonnegative matrix factorization for polyphonic pitch transcription
Emmanuel Vincent, Nancy Bertin, Roland Badeau
2008 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Las Vegas, United States, March 2008.
```
@inproceedings{vincent:inria-00544183,
  address = {Las Vegas, United States},
  author = {Vincent, Emmanuel and Bertin, Nancy and Badeau, Roland},
  booktitle = {{2008 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP)}},
  hal_id = {inria-00544183},
  hal_version = {v1},
  month = mar,
  pages = {109--112},
  pdf = {https://hal.inria.fr/inria-00544183/file/vincent_ICASSP08.pdf},
  title = {{Harmonic and inharmonic nonnegative matrix factorization for polyphonic pitch transcription}},
  url = {https://hal.inria.fr/inria-00544183},
  year = {2008}
}
```
Polyphonic pitch transcription consists of estimating the onset time, duration and pitch of each note in a music signal. This task is difficult in general, due to the wide range of possible instruments. This issue has been studied using adaptive models such as Nonnegative Matrix Factorization (NMF), which describe the signal as a weighted sum of basis spectra. However basis spectra representing multiple pitches result in inaccurate transcription. To avoid this, we propose a family of constrained NMF models, where each basis spectrum is expressed as a weighted sum of narrowband spectra consisting of a few adjacent partials at harmonic or inharmonic frequencies. The model parameters are adapted via combined multiplicative and Newton updates. The proposed method is shown to outperform standard NMF on a database of piano excerpts.
Weighted maximum likelihood autoregressive and moving average spectrum modeling
Roland Badeau, Bertrand David
Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Las Vegas, Nevada, United States, 2008.
```
@inproceedings{badeau:hal-00945273,
  address = {Las Vegas, Nevada, United States},
  author = {Badeau, Roland and David, Bertrand},
  booktitle = {{Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}},
  hal_id = {hal-00945273},
  hal_version = {v1},
  pages = {3761--3764},
  pdf = {https://hal.inria.fr/hal-00945273/file/Badeau-ICASSP2008.pdf},
  title = {{Weighted maximum likelihood autoregressive and moving average spectrum modeling}},
  url = {https://hal.inria.fr/hal-00945273},
  year = {2008}
}
```
We propose new algorithms for estimating autoregressive (AR), moving average (MA), and ARMA models in the spectral domain. These algorithms are derived from a maximum likelihood approach, where spectral weights are introduced in order to selectively enhance the accuracy on a predefined set of frequencies, while ignoring the other ones. This is of particular interest for modeling the spectral envelope of harmonic signals, whose spectrum only contains a discrete set of relevant coefficients. In the context of speech processing, our simulation results show that the proposed method provides a more accurate ARMA modeling of nasal vowels than the Durbin method.

A Collaborative Approach to Video Summarization
Emilie Dumont, Bernard Mérialdo, Slim Essid, Werner Bailer, Daragh Byrne, Hervé Bredin, Noel O’Connor, Gareth JF Jones, Martin Haller, Andreas Krutz, Thomas Sikora, Tomas Piatrik
SAMT 2008, 3rd International Conference on Semantic and Digital Media Technologies, Koblenz, Germany, 2008.

@inproceedings{dumont:hal-01987822,
  address = {Koblenz, Germany},
  author = {Dumont, Emilie and M{\'e}rialdo, Bernard and Essid, Slim and Bailer, Werner and Byrne, Daragh and Bredin, Herv{\'e} and O'Connor, Noel and Jones, Gareth JF and Haller, Martin and Krutz, Andreas and Sikora, Thomas and Piatrik, Tomas},
  booktitle = {{SAMT 2008, 3rd International Conference on Semantic and Digital Media Technologies}},
  hal_id = {hal-01987822},
  hal_version = {v1},
  title = {{A Collaborative Approach to Video Summarization}},
  url = {https://hal.archives-ouvertes.fr/hal-01987822},
  year = {2008}
}

Rushes Video Summarization using a Collaborative Approach
Emilie Dumont, Bernard Mérialdo, Slim Essid, Werner Bailer, Herwig Rehatschek, Daragh Byrne, Hervé Bredin, Noel O’Connor, Gareth JF Jones, Alan F Smeaton, Martin Haller, Andreas Krutz, Thomas Sikora, Tomas Piatrik
TRECVID 2008, ACM International Conference on Multimedia Information Retrieval, Vancouver, Canada, 2008.

@inproceedings{dumont:hal-01987824,
  address = {Vancouver, Canada},
  author = {Dumont, Emilie and M{\'e}rialdo, Bernard and Essid, Slim and Bailer, Werner and Rehatschek, Herwig and Byrne, Daragh and Bredin, Herv{\'e} and O'Connor, Noel and Jones, Gareth JF and Smeaton, Alan F and Haller, Martin and Krutz, Andreas and Sikora, Thomas and Piatrik, Tomas},
  booktitle = {{TRECVID 2008, ACM International Conference on Multimedia Information Retrieval}},
  hal_id = {hal-01987824},
  hal_version = {v1},
  title = {{Rushes Video Summarization using a Collaborative Approach}},
  url = {https://hal.archives-ouvertes.fr/hal-01987824},
  year = {2008}
}

2007

Conference Articles

Multipitch detection for piano music: Benchmarking a few approaches
Bertrand David, Roland Badeau, Nancy Bertin, Valentin Emiya, Gaël Richard
154th Meeting of the Acoustical Society of America, New Orleans, United States, November 2007.
```
@inproceedings{david:inria-00545065,
  address = {New Orleans, United States},
  author = {David, Bertrand and Badeau, Roland and Bertin, Nancy and Emiya, Valentin and Richard, Ga{\"e}l},
  booktitle = {{154th Meeting of the Acoustical Society of America}},
  doi = {10.1121/1.2942555},
  hal_id = {inria-00545065},
  hal_version = {v1},
  keywords = {Automatic music recognition ; classification ; and information retrieval},
  month = nov,
  title = {{Multipitch detection for piano music: Benchmarking a few approaches}},
  url = {https://hal.inria.fr/inria-00545065},
  year = {2007}
}
```
When trying to find a solution to the critical and sometimes ill-posed problem of multipitch estimation, it is common to have to choose between several approaches: Using a processing that resembles that of the human auditory perception, or using a decomposition adapted to the spectral content of the targeted sound category, taking into account an a priori knowledge of the spectral properties of musical notes or trying to learn some of its charateristics or even more, to run an algorithm that blindly tends to separate the sound into multiple elementary entities. This work involves some recently published techniques, such as the non-negative matrix factorization with sparsity constraints, a likelihood approach, based on a smooth spectral envelope model for both the background noise and for the partials, and a deterministic method combining spectral and temporal criteria. Their performance is comparatively assessed on a common multipitch database restricted to piano music, drawn both from home-made recordings and soft-synthesized sounds. The results are discussed with respect to the complexity, the versatility, the sensitivity to fine tuning, and the precision reached by each approach.

Listening tests of the localization performance of Stereodipole and Ambisonic systems
Andrea Capra, Simone Fontana, Fons Adriaensen, Angelo Farina, Yves Grenier
123rd Convention of the Audio Engineering Society, New York, USA, October 2007.

@inproceedings{SF:AES-2007,
  author = {Capra, Andrea and Fontana, Simone and Adriaensen, Fons and Farina, Angelo and Grenier, Yves},
  title = {Listening tests of the localization performance of Stereodipole and Ambisonic systems},
  booktitle = {123rd Convention of the Audio Engineering Society},
  address = {New York, USA},
  year = {2007},
  month = oct
}

Multipitch estimation of quasi-harmonic sounds in colored noise
Valentin Emiya, Roland Badeau, Bertrand David
10th Int. Conf. on Digital Audio Effects (DAFx-07), Bordeaux, France, September 2007.
```
@inproceedings{emiya:inria-00545615,
  address = {Bordeaux, France},
  author = {Emiya, Valentin and Badeau, Roland and David, Bertrand},
  booktitle = {{10th Int. Conf. on Digital Audio Effects (DAFx-07)}},
  hal_id = {inria-00545615},
  hal_version = {v1},
  month = sep,
  pdf = {https://hal.inria.fr/inria-00545615/file/DAFx07_emiya.pdf},
  title = {{Multipitch estimation of quasi-harmonic sounds in colored noise}},
  url = {https://hal.inria.fr/inria-00545615},
  year = {2007}
}
```
This paper proposes a new multipitch estimator based on a likelihood maximization principle. For each tone, a sinusoidal model is assumed with a colored, Moving-Average, background noise and an autoregressive spectral envelope for the overtones. A monopitch estimator is derived following a Weighted Maximum Likelihood principle and leads to find the fundamental frequency (F0 ) which jointly maximally flattens the noise spectrum and the sinusoidal spectrum. The multipitch estimator is obtained by extending the method for jointly estimating multiple F0 ’s. An application to piano tones is presented, which takes into account the inharmonicity of the overtone series for this instrument.
TOWARDS POLYPHONIC MUSICAL INSTRUMENTS RECOGNITION
Gael Richard, Pierre Leveau, Laurent Daudet, Slim Essid, Bertrand David
19th INTERNATIONAL CONGRESS ON ACOUSTICS, Madrid, Spain, September 2007.
```
@inproceedings{richard:hal-02943678,
  address = {Madrid, Spain},
  author = {Richard, Gael and Leveau, Pierre and Daudet, Laurent and Essid, Slim and David, Bertrand},
  booktitle = {{19th INTERNATIONAL CONGRESS ON ACOUSTICS}},
  hal_id = {hal-02943678},
  hal_version = {v1},
  month = sep,
  pdf = {https://hal.telecom-paris.fr/hal-02943678/file/GR_ICA-07.pdf},
  title = {{TOWARDS POLYPHONIC MUSICAL INSTRUMENTS RECOGNITION}},
  url = {https://hal.telecom-paris.fr/hal-02943678},
  year = {2007}
}
```
Automatic musical instrument recognition is a relatively new topic in the growing field of Music Information Retrieval. Early studies mostly focused on instrument recognition from recordings of isolated notes. More recently, some studies tackled the problem of musical phrases played in solo (i.e. without accompaniment) which better covers the timbre variability of a given instrument. However, the current trend is now to deal with true polyphonic music (i.e. involving multiple instruments), which appears to be a far more difficult problem but with more practical applications. The aim of this paper is to provide an overview of the state-of-the-art in automatic musical instrument recognition with a focus on recent and innovative approaches applied to true polyphonic music. It will be shown that the traditional "bag of frames" approaches can obtain interesting results by building efficient automatic taxonomies or by using complementary information to enhance the relevant signal. We however argue that it is important to consider new directions to overcome the limitations of these traditional approaches. One of these promising directions that will be detailed concerns mid-level representations, which are based on the decomposition of the signal into a small number of sound atoms or molecules bearing explicit musical instrument labels. INTRODUCTION There is a growing interest for new means of interaction with audio information that is nowadays mostly available in digital format and stored in large databases. There is therefore a strong need for efficient audio indexing techniques that would allow the extraction of a detailed and meaningful symbolic representation directly from a digital audio recording. For music signals, this representation will include information about the metric, the harmony, the melody, the genre or the interpretation style and will ultimately be represented under the form of an enriched music sheet. The availability of such a symbolic representation opens the path for numerous Music Information Retrieval (MIR) applications including content-based search by similarity, cover songs retrieval, automatic post-remixing,. .. .

Séparation aveugle sous-déterminée de sources en utilisant la décomposition en paquet d’ondelettes
Abdeldjalil Aissa El Bey, Karim Abed-Meraim, Yves Grenier
21e Colloque {GRETSI} sur le traitement du signal et des images, Troyes, France, September 2007.

@inproceedings{AA:GRETSI-07,
  author = {Aissa El Bey, Abdeldjalil and Abed-Meraim, Karim and Grenier, Yves},
  title = {Séparation aveugle sous-d{\'e}termin{\'e}e de sources en utilisant la d{\'e}composition en paquet d'ondelettes},
  booktitle = {21e Colloque \{GRETSI\} sur le traitement du signal et des images},
  address = {Troyes, France},
  year = {2007},
  month = sep
}

Blind audio source separation using sparsity based criterion for convolutive mixture case
Abdeldjalil Aissa El Bey, Karim Abed-Meraim, Yves Grenier
7th International Conference on Independent Component Analysis and Blind Source Separation, London, UK, September 2007.

@inproceedings{AA:ICA-07,
  author = {Aissa El Bey, Abdeldjalil and Abed-Meraim, Karim and Grenier, Yves},
  title = {Blind audio source separation using sparsity based criterion for convolutive mixture case},
  booktitle = {7th International Conference on Independent Component Analysis and Blind Source Separation},
  address = {London, UK},
  year = {2007},
  month = sep
}

Binaural for popular music: a case of study
Simone Fontana, Angelo Farina, Yves Grenier
13th International Conference on Auditory Display, Montréal, Canada, June 2007.

@inproceedings{SF:ICAD-2007,
  author = {Fontana, Simone and Farina, Angelo and Grenier, Yves},
  title = {Binaural for popular music: a case of study},
  booktitle = {13th International Conference on Auditory Display},
  address = {Montr{\'e}al, Canada},
  year = {2007},
  month = jun
}

A Parametric Method for Pitch Estimation of Piano Tones
Valentin Emiya, Bertrand David, Roland Badeau
IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu, United States, April 2007.
```
@inproceedings{emiya:inria-00544147,
  address = {Honolulu, United States},
  author = {Emiya, Valentin and David, Bertrand and Badeau, Roland},
  booktitle = {{IEEE International Conference on Acoustics, Speech and Signal Processing}},
  doi = {10.1109/ICASSP.2007.366663},
  hal_id = {inria-00544147},
  hal_version = {v1},
  month = apr,
  pages = {249-252},
  pdf = {https://hal.inria.fr/inria-00544147/file/ICASSP07emiya.pdf},
  title = {{A Parametric Method for Pitch Estimation of Piano Tones}},
  url = {https://hal.inria.fr/inria-00544147},
  year = {2007}
}
```
The efficiency of most pitch estimation methods declines when the analyzed frame is shortened and/or when a wide fundamental frequency (Fo) range is targeted. The technique proposed herein jointly uses a periodicity analysis and a spectral matching process to improve the fo estimation performance in such an adverse context: a 60 ms-long data frame together with the whole, 71/4-octaves, piano tessitura. The enhancements are obtained thanks to a parametric approach which, among other things, models the inharmonicity of piano tones. The performance of the algorithm is assessed, is compared to the results obtained from other estimators and is discussed in order to characterize their behavior and typical misestimations.

Combined Supervised and Unsupervised Approaches for Automatic Segmentation of Radiophonic Audio Streams
Gael Richard, Mathieu Ramona, Slim Essid
2007 IEEE International Conference on Acoustics, Speech, and Signal Processing, Honolulu, France, April 2007.

@inproceedings{richard:hal-02943676,
  address = {Honolulu, France},
  author = {Richard, Gael and Ramona, Mathieu and Essid, Slim},
  booktitle = {{2007 IEEE International Conference on Acoustics, Speech, and Signal Processing}},
  doi = {10.1109/ICASSP.2007.366272},
  hal_id = {hal-02943676},
  hal_version = {v1},
  month = apr,
  pages = {II-461-II-464},
  publisher = {{IEEE}},
  title = {{Combined Supervised and Unsupervised Approaches for Automatic Segmentation of Radiophonic Audio Streams}},
  url = {https://hal.telecom-paris.fr/hal-02943676},
  year = {2007}
}

Underdetermined blind separation of audio sources from the time-frequency representation of their convolutive mixtures
Abdeldjalil Aissa El Bey, Karim Abed-Meraim, Yves Grenier
ICASSP’07, Hawaii (USA), April 2007.

@inproceedings{AA:ICASSP-07,
  author = {Aissa El Bey, Abdeldjalil and Abed-Meraim, Karim and Grenier, Yves},
  title = {Underdetermined blind separation of audio sources from the time-frequency representation of their convolutive mixtures},
  booktitle = {ICASSP'07},
  address = {Hawaii (USA)},
  year = {2007},
  month = apr,
  volume = {1},
  pages = {153--156}
}

Underdetermined audio source separation using fast parametric decomposition
Abdeldjalil Aissa El Bey, Karim Abed-Meraim, Yves Grenier
ISSPA’07, Sharjah (United Arab Emirates), February 2007.

@inproceedings{AA:ISSPA-07,
  author = {Aissa El Bey, Abdeldjalil and Abed-Meraim, Karim and Grenier, Yves},
  title = {Underdetermined audio source separation using fast parametric decomposition},
  booktitle = {ISSPA'07},
  address = {Sharjah (United Arab Emirates)},
  year = {2007},
  month = feb
}

Conjugate gradient algorithms for minor subspace analysis
Roland Badeau, Bertrand David, Gael Richard
Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Honolulu, Hawaii, United States, 2007.
```
@inproceedings{badeau:hal-00945274,
  address = {Honolulu, Hawaii, United States},
  author = {Badeau, Roland and David, Bertrand and Richard, Gael},
  booktitle = {{Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  hal_id = {hal-00945274},
  hal_version = {v1},
  pages = {1013--1016},
  pdf = {https://hal.inria.fr/hal-00945274/file/icassp-07.pdf},
  title = {{Conjugate gradient algorithms for minor subspace analysis}},
  url = {https://hal.inria.fr/hal-00945274},
  volume = {3},
  year = {2007}
}
```
We introduce a conjugate gradient method for estimating and tracking the minor eigenvector of a data correlation matrix. This new algorithm is less computationally demanding and converges faster than other methods derived from the conjugate gradient approach. It can also be applied in the context of minor subspace tracking, as a pre-processing step for the YAST algorithm, in order to enhance its performance. Simulations show that the resulting algorithm converges much faster than existing minor subspace trackers.
Blind signal decompositions for automatic transcription of polyphonic music: NMF and K-SVD on the benchmark
Nancy Bertin, Roland Badeau, Gael Richard
Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Honolulu, Hawaii, United States, 2007.
```
@inproceedings{bertin:hal-00945282,
  address = {Honolulu, Hawaii, United States},
  author = {Bertin, Nancy and Badeau, Roland and Richard, Gael},
  booktitle = {{Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  hal_id = {hal-00945282},
  hal_version = {v1},
  pages = {65--68},
  pdf = {https://hal.inria.fr/hal-00945282/file/icassp-07-bis.pdf},
  title = {{Blind signal decompositions for automatic transcription of polyphonic music: NMF and K-SVD on the benchmark}},
  url = {https://hal.inria.fr/hal-00945282},
  volume = {1},
  year = {2007}
}
```
This paper investigates on the behavior of two blind signal decomposition algorithms, non negative matrix factorization (NMF) and non negative K-SVD (NKSVD), in a polyphonic music transcription task. State-of-the-art transcription systems are based on a frame-byframe, low-level approach; blind systems could be an alternative to them. Two raw but effective audio-to-MIDI systems are proposed and evaluated. Performances are similar, but in favor of NMF, which is more robust to initialization, choice of the order and computationnally less costly.

Fast sequential LS estimation for sinusoidal modeling and decomposition of audio signals
Bertrand David, Roland Badeau
Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York, United States, 2007.

@inproceedings{david:hal-00945284,
  address = {New Paltz, New York, United States},
  author = {David, Bertrand and Badeau, Roland},
  booktitle = {{Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}},
  hal_id = {hal-00945284},
  hal_version = {v1},
  pages = {211--214},
  title = {{Fast sequential LS estimation for sinusoidal modeling and decomposition of audio signals}},
  url = {https://hal.inria.fr/hal-00945284},
  year = {2007}
}

Theoretical and experimental investigations of harp’s sympathetic modes
Jean-Loic Le Carrou, François Gautier, Roland Badeau
Proc. of International Congress on Acoustics (ICA), Madrid, Spain, 2007.
```
@inproceedings{lecarrou:hal-00945297,
  address = {Madrid, Spain},
  author = {Le Carrou, Jean-Loic and Gautier, Fran{\c c}ois and Badeau, Roland},
  booktitle = {{Proc. of International Congress on Acoustics (ICA)}},
  hal_id = {hal-00945297},
  hal_version = {v1},
  pdf = {https://hal.inria.fr/hal-00945297/file/ica-07.pdf},
  title = {{Theoretical and experimental investigations of harp's sympathetic modes}},
  url = {https://hal.inria.fr/hal-00945297},
  year = {2007}
}
```
The harp is composed of a soundboard, a cavity with soundholes and 47 strings. When one string is plucked, other strings are excited via the soundboard. This phenomenon, called sympathetic vibrations, was investigated in a recent paper [1], both theoretically and experimentally using an ersatz of string instrument, composed of 2 strings connected to a beam clamped at both ends. In order to extend the analysis to the case of the harp, an analytical model of a set of 35 strings coupled to a beam, whose characteristics are designed in order to be equivalent to the harp’s soundboard, has been developed. The identification of sympathetic modes from the analytical modal basis of the beam-strings assembly is performed using a criterion, the Kinetic Energy Ratio. For some particular tuning conditions, these modes are present in the sound measured on a Camac concert harp. In a same partial, several components, whose frequencies can be really very close one to another are present but cannot be separated using classical Fourier transform. The identification of these components associated to sympathetic modes is achieved using specific signal processing techniques, called High Resolution methods. Identified experimental sympathetic modes are found in good agreement with those theoretically obtained.

Journal Articles

Sound field analysis based on analytical beamforming
Mathieu Guillaume, Yves Grenier
EURASIP Journal on Advances in Signal Processing, August 2007.

@article{MG:ASP-06,
  author = {Guillaume, Mathieu and Grenier, Yves},
  title = {Sound field analysis based on analytical beamforming},
  journal = {EURASIP Journal on Advances in Signal Processing},
  year = {2007},
  month = aug,
  volume = {2007}
}

Blind separation of underdetermined convolutive mixtures using their time-frequency representation
Abdeldjalil Aissa El Bey, Karim Abed-Meraim, Yves Grenier
IEEE Transactions on Audio, Speech & Language Processing, July 2007.

@article{AA:IEEE-06,
  author = {Aissa El Bey, Abdeldjalil and Abed-Meraim, Karim and Grenier, Yves},
  title = {Blind separation of underdetermined convolutive mixtures using their time-frequency representation},
  journal = {IEEE Transactions on Audio, Speech \& Language Processing},
  year = {2007},
  month = jul,
  volume = {15},
  number = {5},
  pages = {1540--1550}
}

Tempo estimation for audio recordings
Miguel A Alonso-Arévalo, Gael Richard, Bertrand David
Interface, Journal of New Music Research, March 2007.

@article{alonsoarevalo:hal-02652625,
  author = {Alonso-Ar{\'e}valo, Miguel A and Richard, Gael and David, Bertrand},
  hal_id = {hal-02652625},
  hal_version = {v1},
  journal = {{Interface, Journal of New Music Research}},
  month = mar,
  title = {{Tempo estimation for audio recordings}},
  url = {https://hal.inria.fr/hal-02652625},
  year = {2007}
}

On the Correlation of Automatic Audio and Visual Segmentations of Music Videos
Olivier Gillet, Slim Essid, Gael Richard
IEEE Transactions on Circuits and Systems for Video Technology, March 2007.
```
@article{gillet:hal-02652635,
  author = {Gillet, Olivier and Essid, Slim and Richard, Gael},
  doi = {10.1109/TCSVT.2007.890831},
  hal_id = {hal-02652635},
  hal_version = {v1},
  journal = {{IEEE Transactions on Circuits and Systems for Video Technology}},
  keywords = {Index Terms-Audio segmentation ; cross-modal queries ; infor- mation retrieval ; multimedia indexing ; multimodal processing ; music videos ; novelty detection},
  month = mar,
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{On the Correlation of Automatic Audio and Visual Segmentations of Music Videos}},
  url = {https://hal.inria.fr/hal-02652635},
  volume = {17},
  year = {2007}
}
```
The study of the associations between audio and video content has numerous important applications in the fields of information retrieval and multimedia content authoring. In this work, we focus on music videos which exhibit a broad range of structural and semantic relationships between the music and the video content. To identify such relationships, a two-level automatic struc-turing of the music and the video is achieved separately. Note onsets are detected from the music signal, along with section changes. The latter is achieved by a novel algorithm which makes use of feature selection and statistical novelty detection approaches based on kernel methods. The video stream is independently segmented to detect changes in motion activity, as well as shot boundaries. Based on this two-level segmentation of both streams, four audiovisual correlation measures are computed. The usefulness of these correlation measures is illustrated by a query by video experiment on a 100 music video database, which also exhibits interesting genre dependencies.

Underdetermined blind separation of nondisjoint sources in the time-frequency domain
Abdeldjalil Aissa El Bey, Nguyen Linh-Trung, Karim Abed-Meraim, Adel Belouchrani, Yves Grenier
IEEE Transactions on Signal Processing, March 2007.

@article{AA:IEEE-05,
  author = {Aissa El Bey, Abdeldjalil and Linh-Trung, Nguyen and Abed-Meraim, Karim and Belouchrani, Adel and Grenier, Yves},
  title = {Underdetermined blind separation of nondisjoint sources in the time-frequency domain},
  journal = {IEEE Transactions on Signal Processing},
  year = {2007},
  month = mar,
  volume = {55},
  number = {3},
  pages = {897--907}
}

Underdetermined blind audio source separation using modal decomposition
Abdeldjalil Aissa El Bey, Karim Abed-Meraim, Yves Grenier
EURASIP Journal on Audio, Speech & Music Processing, March 2007.

@article{AA:EURASIP-06,
  author = {Aissa El Bey, Abdeldjalil and Abed-Meraim, Karim and Grenier, Yves},
  title = {Underdetermined blind audio source separation using modal decomposition},
  journal = {EURASIP Journal on Audio, Speech \& Music Processing},
  year = {2007},
  month = mar,
  volume = {2007},
  pages = {1--15}
}

Accurate tempo estimation based on harmonic+noise decomposition
Miguel Alonso, Gael Richard, Bertrand David
EURASIP Journal on Advances in Signal Processing, 2007.
```
@article{alonso:hal-02652614,
  author = {Alonso, Miguel and Richard, Gael and David, Bertrand},
  hal_id = {hal-02652614},
  hal_version = {v1},
  journal = {{EURASIP Journal on Advances in Signal Processing}},
  pdf = {https://hal.archives-ouvertes.fr/hal-02652614/file/JASP06_Alonso.pdf},
  publisher = {{SpringerOpen}},
  title = {{Accurate tempo estimation based on harmonic+noise decomposition}},
  url = {https://hal.archives-ouvertes.fr/hal-02652614},
  year = {2007}
}
```
In this paper we present an innovative tempo estimation system that processes acoustic audio signals and does not use any high level musical knowledge. Our proposal relies on a harmonic plus noise decomposition of the audio signal by means of a subspace analysis method. Then, a technique to measure the degree of musical accentuation as a function of time is developed and separately applied to the harmonic and noise parts of the input signal. This is followed by a periodicity estimation block that calculates the salience of musical accents for a large number of potential periods. Next, a multi-path dynamic programming searches among all the potential periodicities for the most consistent prospects through time and finally the most energetic candidate is selected as tempo. Our proposal is validated using a manually annotated test-base containing 961 music signals from various musical genres. In addition, the performance of the algorithm under different configurations is compared. The robustness of the algorithm when processing signals of degraded quality is also measured.

2006

Conference Articles

Mid-level sparse representations for timbre identification: design of an instrument-specific harmonic dictionary
Pierre Leveau, Emmanuel Vincent, Gael Richard, Laurent Daudet
1st Workshop on Learning the Semantics of Audio Signals (LSAS), Athens, Greece, December 2006.
```
@inproceedings{leveau:inria-00544284,
  address = {Athens, Greece},
  author = {Leveau, Pierre and Vincent, Emmanuel and Richard, Gael and Daudet, Laurent},
  booktitle = {{1st Workshop on Learning the Semantics of Audio Signals (LSAS)}},
  hal_id = {inria-00544284},
  hal_version = {v1},
  month = dec,
  pdf = {https://hal.inria.fr/inria-00544284/file/leveau_LSAS06.pdf},
  title = {{Mid-level sparse representations for timbre identification: design of an instrument-specific harmonic dictionary}},
  url = {https://hal.inria.fr/inria-00544284},
  year = {2006}
}
```
Several studies have pointed out the need of mid-level representations of music signals for information retrieval and signal processing applications. In this paper, we investigate a new representation based on sparse decomposition of the signal into a collection of instrument-specific harmonic atoms modelling notes of various pitches played by different instruments. Each atom is composed of windowed harmonic sinusoidal partials whose amplitudes are learned on a training database. An efficient Matching Pursuit algorithm was designed to extract the best atoms and to estimate the phases of their partials. Then we explain how the resulting representation can be exploited for automatic instrument recognition. Preliminary experiments on a test database of solo excerpts show promising results.

A System for Head Related Impulse Responses Rapid Measurement and Direct Customization
Simone Fontana, Yves Grenier, Angelo Farina
120th Convention AES, Paris, France, October 2006.

@inproceedings{SF-CAES-06,
  author = {Fontana, Simone and Grenier, Yves and Farina, Angelo},
  title = {A System for Head Related Impulse Responses Rapid Measurement and Direct Customization},
  booktitle = {120th Convention AES},
  address = {Paris, France},
  year = {2006},
  month = oct
}

Iterative blind source separation by decorrelation: algorithm and performance analysis
Abdeldjalil Aissa El Bey, Karim Abed-Meraim, Yves Grenier
14th European signal processing conference (EUSIPCO), Florence, Italie, September 2006.

@inproceedings{AA:EUSIPCO-06,
  author = {Aissa El Bey, Abdeldjalil and Abed-Meraim, Karim and Grenier, Yves},
  title = {Iterative blind source separation by decorrelation: algorithm and performance analysis},
  booktitle = {14th European signal processing conference (EUSIPCO)},
  address = {Florence, Italie},
  year = {2006},
  month = sep
}

Audio source separation using sparsity
Abdeldjalil Aissa El Bey, H. Bousbia-Salah, Karim Abed-Meraim, Yves Grenier
IWAENC’06, Paris, France, September 2006.

@inproceedings{AA:IWAENC-06,
  author = {Aissa El Bey, Abdeldjalil and {Bousbia-Salah}, H. and Abed-Meraim, Karim and Grenier, Yves},
  title = {Audio source separation using sparsity},
  booktitle = {IWAENC'06},
  address = {Paris, France},
  year = {2006},
  month = sep
}

Experimental 3D Sound field analysis with a microphone array
Mathieu Guillaume, Yves Grenier
28th International Conference of the AES, Pitea, Sweden, July 2006.

@inproceedings{MG:CAES-06,
  author = {Guillaume, Mathieu and Grenier, Yves},
  title = {Experimental 3D Sound field analysis with a microphone array},
  booktitle = {28th International Conference of the AES},
  address = {Pitea, Sweden},
  year = {2006},
  month = jul
}

Harmonic Plus Noise Decomposition: Time-frequency Reassignment Versus a Subspace Based Method
Bertrand David, Valentin Emiya, Roland Badeau, Yves Grenier
120th AES Convention, Paris, France, May 2006.

@inproceedings{david:inria-00545041,
  address = {Paris, France},
  author = {David, Bertrand and Emiya, Valentin and Badeau, Roland and Grenier, Yves},
  booktitle = {{120th AES Convention}},
  hal_id = {inria-00545041},
  hal_version = {v1},
  month = may,
  title = {{Harmonic Plus Noise Decomposition: Time-frequency Reassignment Versus a Subspace Based Method}},
  url = {https://hal.inria.fr/inria-00545041},
  year = {2006}
}

Sound field analysis based on generalized prolate spheroidal wave sequences
Mathieu Guillaume, Yves Grenier
120th Convention of the Audio Engineering Society, Paris, FRANCE, May 2006.

@inproceedings{MG:AES120-06,
  author = {Guillaume, Mathieu and Grenier, Yves},
  title = {Sound field analysis based on generalized prolate spheroidal wave sequences},
  booktitle = {120th Convention of the Audio Engineering Society},
  address = {Paris, FRANCE},
  year = {2006},
  month = may
}

Harmonic Plus Noise Decomposition: Time-Frequency Reassignment Versus a Subspace-Based Method
Bertrand David, Valentin Emiya, Roland Badeau, Yves Grenier
120th Convention of the Audio Engineering Society, Paris, France, May 2006.

@inproceedings{BD:AES120,
  author = {David, Bertrand and Emiya, Valentin and Badeau, Roland and Grenier, Yves},
  title = {Harmonic Plus Noise Decomposition: Time-Frequency Reassignment Versus a Subspace-Based Method},
  booktitle = {120th Convention of the Audio Engineering Society},
  address = {Paris, France},
  year = {2006},
  month = may
}

Sound field analysis with a two-dimensional microphone array
Mathieu Guillaume, Yves Grenier
ICASSP, Toulouse, France, May 2006.

@inproceedings{MG-YG:ICASSP-06,
  author = {Guillaume, Mathieu and Grenier, Yves},
  title = {Sound field analysis with a two-dimensional microphone array},
  booktitle = {ICASSP},
  address = {Toulouse, France},
  year = {2006},
  month = may,
  volume = {V},
  pages = {321--324}
}

On the identifiability testing in blind source separation using resampling technique
Abdeldjalil Aissa El Bey, Karim Abed-Meraim, Yves Grenier
6th International Conference on Independent Component Analysis and Blind Source Separation, Charleston, SC, USA, March 2006.

@inproceedings{AA:ICA-06,
  author = {Aissa El Bey, Abdeldjalil and Abed-Meraim, Karim and Grenier, Yves},
  title = {On the identifiability testing in blind source separation using resampling technique},
  booktitle = {6th International Conference on Independent Component Analysis and Blind Source Separation},
  address = {Charleston, SC, USA},
  year = {2006},
  month = mar,
  number = {LCNS 3889},
  pages = {755--764}
}

A Fast Adaptive Method for Subspace Based Blind Channel Estimation
Jon Altuna, Bernie Mulgrew, Roland Badeau, Vicente Atxa
Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toulouse, France, 2006.
```
@inproceedings{altuna:hal-00945268,
  address = {Toulouse, France},
  author = {Altuna, Jon and Mulgrew, Bernie and Badeau, Roland and Atxa, Vicente},
  booktitle = {{Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  hal_id = {hal-00945268},
  hal_version = {v1},
  pages = {1121--1124},
  pdf = {https://hal.inria.fr/hal-00945268/file/icassp-06-qua.pdf},
  title = {{A Fast Adaptive Method for Subspace Based Blind Channel Estimation}},
  url = {https://hal.inria.fr/hal-00945268},
  volume = {4},
  year = {2006}
}
```
In this paper, a new fast adaptive blind channel estimation method is proposed using the subspace information from the correlation matrix. The algorithm is fully adaptive in the sense that both the subspace information and the optimization which leads to the channel estimation are computed adaptively. It is based on the recently proposed YAST subspace tracker which has been shown to outperform other methods both in terms of speed of convergence and computational complexity. A discussion on the convergence properties of the proposed algorithm is presented. We also propose a hybrid method which makes use of the YAST subspace tracker for initial fast convergence and the subspace information is then updated using the numerically stable OPAST subspace tracker.

YAST Algorithm for Minor Subspace Tracking
Roland Badeau, Bertrand David, Gael Richard
International Conference on Acoustics, Speech, and Signal Processing ICASSP’06, Toulouse, France, 2006.

@inproceedings{badeau:hal-00479782,
  address = {Toulouse, France},
  author = {Badeau, Roland and David, Bertrand and Richard, Gael},
  booktitle = {{International Conference on Acoustics, Speech, and Signal Processing ICASSP'06}},
  hal_id = {hal-00479782},
  hal_local_reference = {RB:ICASSP-06},
  hal_version = {v1},
  pages = {552-555},
  pdf = {https://hal-imt.archives-ouvertes.fr/hal-00479782/file/ICASSP-06.pdf},
  title = {{YAST Algorithm for Minor Subspace Tracking}},
  url = {https://hal-imt.archives-ouvertes.fr/hal-00479782},
  volume = {III},
  year = {2006}
}

ADAPTIVE MULTILINEAR SVD FOR STRUCTURED TENSORS
Remy Boyer, Roland Badeau
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’06), Toulouse, France, 2006.
```
@inproceedings{boyer:hal-00577274,
  address = {Toulouse, France},
  author = {Boyer, Remy and Badeau, Roland},
  booktitle = {{IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'06)}},
  hal_id = {hal-00577274},
  hal_version = {v1},
  pdf = {https://hal.archives-ouvertes.fr/hal-00577274/file/HOSVD-ICASSP-06-ter.pdf},
  title = {{ADAPTIVE MULTILINEAR SVD FOR STRUCTURED TENSORS}},
  url = {https://hal.archives-ouvertes.fr/hal-00577274},
  year = {2006}
}
```
The Higher-Order SVD (HOSVD) is a generalization of the SVD to higher-order tensors (ie. arrays with more than two indexes) and plays an important role in various domains. Unfortunately, the computational cost of this decomposition is very high since the basic HOSVD algorithm involves the computation of the SVD of three highly redundant block-Hankel matrices, called modes. In this paper, we present an ultra-fast way of computing the HOSVD of a third-order structured tensor. The key result of this work lies in the fact it is possible to reduce the basic HOSVD algorithm to the computation of the SVD of three non-redundant Hankel matrices whose columns are multiplied by a given weighting function. Next, we exploit an FFT-based implementation of the orthogonal iteration algorithm in an adaptive way. Even though for a square (I ×I ×I) tensor the complexity of the basic full-HOSVD is O(I4) and O(rI3) for its r-truncated version, our approach reaches a linear complexity of O(rI log2(I)).
HRHATRAC Algorithm for Spectral Line Tracking of Musical Signals
Bertrand David, Roland Badeau, Gael Richard
International Conference on Acoustics, Speech, and Signal Processing ICASSP’06, Toulouse, France, 2006.
```
@inproceedings{david:hal-00479785,
  address = {Toulouse, France},
  author = {David, Bertrand and Badeau, Roland and Richard, Gael},
  booktitle = {{International Conference on Acoustics, Speech, and Signal Processing ICASSP'06}},
  hal_id = {hal-00479785},
  hal_local_reference = {BD:ICASSP-06},
  hal_version = {v1},
  pages = {45-48},
  pdf = {https://hal-imt.archives-ouvertes.fr/hal-00479785/file/ICASSP-06-bis.pdf},
  title = {{HRHATRAC Algorithm for Spectral Line Tracking of Musical Signals}},
  url = {https://hal-imt.archives-ouvertes.fr/hal-00479785},
  volume = {III},
  year = {2006}
}
```
HRHATRAC combines the last improvements regarding the fast subspace tracking algorithms with a gradient update for adapting the signal poles estimates. It leads to a line spectral tracker which is able to robustly estimate the frequencies, even in a noisy context, when the lines are close to each other and when a modulation occurs. HRATRAC is also successfully applied in this paper to a piano note recording.
Analyse des modes de cordes couplées d’une harpe par une méthode à haute résolution
Jean-Loic Le Carrou, François Gautier, Roland Badeau
Actes du 8ème Congrès FranÃ\Sais d’Acoustique (CFA), Tours, France, 2006.
```
@inproceedings{lecarrou:hal-00945241,
  address = {Tours, France},
  author = {Le Carrou, Jean-Loic and Gautier, Fran{\c c}ois and Badeau, Roland},
  booktitle = {{Actes du 8{\`e}me Congr{\`e}s FranÃ\Sais d'Acoustique (CFA)}},
  hal_id = {hal-00945241},
  hal_version = {v1},
  pdf = {https://hal.inria.fr/hal-00945241/file/cfa-06.pdf},
  title = {{Analyse des modes de cordes coupl{\'e}es d'une harpe par une m{\'e}thode {\`a} haute r{\'e}solution}},
  url = {https://hal.inria.fr/hal-00945241},
  year = {2006}
}
```
La harpe de concert est un instrument de musique composé d’une table d’harmonie, d’une cavité munie d’évents et de 47 cordes. Lorsqu’une corde est jouée, de multiples couplages entre cordes se produisent, donnant lieu à des signaux vibratoires et acoustiques présentant un grand nombre de composantes spectrales, dont les fréquences sont parfois très proches. La séparation de ces composantes peut être réalisée à l’aide d’une méthode à haute résolution : la méthode ESPRIT. Cette méthode permet d’identifier les fréquences et amortissements des différents modes des cordes couplées de l’instrument. Son application à des signaux vibratoires mesurés sur la table d’harmonie d’une harpe de concert, dont certaines cordes ont été amorties, permet l’analyse des phénomènes de vibrations par sympathie caractéristiques de l’instrument.

proceedings

10th International Workshop on Acoustic Echo and Noise Control
Yves Grenier, Gaël Richard
IWAENC 2006, September 2006.

@proceedings{YGGR:iwaenc2006,
  author = {Grenier, Yves and Richard, Ga{\"e}l},
  title = {10th International Workshop on Acoustic Echo and Noise Control},
  booktitle = {IWAENC 2006},
  year = {2006},
  month = sep
}

Journal Articles

A new perturbation analysis for signal enumeration in rotational invariance techniques
Roland Badeau, Bertrand David, Gael Richard
IEEE Transactions on Signal Processing, 2006.
```
@article{badeau:hal-00479779,
  author = {Badeau, Roland and David, Bertrand and Richard, Gael},
  doi = {10.1109/TSP.2005.861899},
  hal_id = {hal-00479779},
  hal_local_reference = {RB:ITSP-06},
  hal_version = {v1},
  journal = {{IEEE Transactions on Signal Processing}},
  keywords = {ESPRIT ; Rotational invariance ; Model order selection ; Signal enumeration ; Perturbation theory},
  number = {2},
  pages = {450-458},
  pdf = {https://hal-imt.archives-ouvertes.fr/hal-00479779/file/IEEE-TSP-06.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{A new perturbation analysis for signal enumeration in rotational invariance techniques}},
  url = {https://hal-imt.archives-ouvertes.fr/hal-00479779},
  volume = {54},
  year = {2006}
}
```
The ESPRIT algorithm is a subspace-based high resolution method used in source localization and spectral analysis, which provides very accurate estimates of the signal parameters. However, the underlying theory assumes a known model order, which is usually not the case in many applications. In particular, it is well known that under-evaluating the model order biases the estimation. In this paper, we analyze the perturbation induced by an erroneous model order, and we present an error bound for the estimated parameters. Based on this theoretical framework, we propose a new method for selecting an appropriate modeling order, which consists in minimizing the error bound. This approach is applied to both synthetic and musical signals and its performance is compared to that of existing methods, such as the Information Theoretic Criteria.
High resolution spectral analysis of mixtures of complex exponentials modulated by polynomials
Roland Badeau, Bertrand David, Gael Richard
IEEE Transactions on Signal Processing, 2006.
```
@article{badeau:hal-00479781,
  author = {Badeau, Roland and David, Bertrand and Richard, Gael},
  doi = {10.1109/TSP.2006.870556},
  hal_id = {hal-00479781},
  hal_local_reference = {RB:ITSP-06b},
  hal_version = {v1},
  journal = {{IEEE Transactions on Signal Processing}},
  keywords = {High resolution ; Rotational invariance property ; ESPRIT ; Polynomial modulation ; Multiple eigenvalues},
  number = {4},
  pages = {1341-1350},
  pdf = {https://hal-imt.archives-ouvertes.fr/hal-00479781/file/IEEE-TSP-06b.pdf},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{High resolution spectral analysis of mixtures of complex exponentials modulated by polynomials}},
  url = {https://hal-imt.archives-ouvertes.fr/hal-00479781},
  volume = {54},
  year = {2006}
}
```
High resolution methods such as the ESPRIT algorithm are of major interest for estimating discrete spectra, since they overcome the resolution limit of the Fourier transform and provide very accurate estimates of the signal parameters. In signal processing literature, most contributions focus on the estimation of exponentially modulated sinusoids in a noisy signal. In this paper, we introduce a more general class of signals, involving both amplitude and frequency modulations. We show that this Polynomial Amplitude Complex Exponentials (PACE) model is the most general model tractable by high resolution methods. We develop a generalized ESPRIT algorithm for estimating the signal parameters, and we show that this model can be characterized by means of a geometrical criterion.

Du corpus émotionnel au système de détection : le point de vue applicatif de la surveillance dans les lieux publics
Chloé Clavel, Ioana Vasilescu, Gael Richard, Laurence Devillers
Revue des Sciences et Technologies de l’Information - Série RIA : Revue d’Intelligence Artificielle, 2006.

@article{clavel:hal-02652597,
  author = {Clavel, Chlo{\'e} and Vasilescu, Ioana and Richard, Gael and Devillers, Laurence},
  hal_id = {hal-02652597},
  hal_version = {v1},
  journal = {{Revue des Sciences et Technologies de l'Information -  S{\'e}rie RIA : Revue d'Intelligence Artificielle}},
  publisher = {{Lavoisier}},
  title = {{Du corpus {\'e}motionnel au syst{\`e}me de d{\'e}tection : le point de vue applicatif de la surveillance dans les lieux publics}},
  url = {https://hal.inria.fr/hal-02652597},
  year = {2006}
}

Instrument recognition in polyphonic music based on automatic taxonomies
Slim Essid, Gael Richard, Bertrand David
IEEE Transactions on Audio, Speech and Language Processing, 2006.

@article{essid:hal-00477670,
  author = {Essid, Slim and Richard, Gael and David, Bertrand},
  doi = {10.1109/TSA.2005.860351},
  hal_id = {hal-00477670},
  hal_local_reference = {SE:COD-06},
  hal_version = {v1},
  journal = {{IEEE Transactions on Audio, Speech and Language Processing}},
  number = {1},
  pages = {68-80},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Instrument recognition in polyphonic music based on automatic taxonomies}},
  url = {https://hal-imt.archives-ouvertes.fr/hal-00477670},
  volume = {14},
  year = {2006}
}

Musical instrument recognition by pairwise classification strategies
Gael Richard, Slim Essid, Bertrand David
IEEE Transactions on Audio, Speech and Language Processing, 2006.

@article{richard:hal-00477671,
  author = {Richard, Gael and Essid, Slim and David, Bertrand},
  doi = {10.1109/TSA.2005.860842},
  hal_id = {hal-00477671},
  hal_local_reference = {SE:COD-06-2},
  hal_version = {v1},
  journal = {{IEEE Transactions on Audio, Speech and Language Processing}},
  number = {4},
  pages = {1401- 1412},
  publisher = {{Institute of Electrical and Electronics Engineers}},
  title = {{Musical instrument recognition by pairwise classification strategies}},
  url = {https://hal-imt.archives-ouvertes.fr/hal-00477671},
  volume = {14},
  year = {2006}
}

Technical Reports

Technical Report on Audio and Speech Processing
Roland Badeau
2006. Livrable du ....

@techreport{badeau:hal-00945247,
  author = {Badeau, Roland},
  hal_id = {hal-00945247},
  hal_version = {v1},
  note = {Livrable du r{\'e}seau d'excellence europ{\'e}en K-Space.},
  title = {{Technical Report on Audio and Speech Processing}},
  type = {Research Report},
  url = {https://hal.inria.fr/hal-00945247},
  year = {2006}
}

2005

Theses

Classification automatique des signaux audio-fréquences : reconnaissance des instruments de musique
Slim Essid
December 2005.
```
@phdthesis{essid:pastel-00002738,
  author = {Essid, Slim},
  hal_id = {pastel-00002738},
  hal_version = {v1},
  keywords = {Apprentissage statistique ; Extraction d'informations ; Indexation automatique ; Audio ; Musique ; Instruments de musique ; Bases de donn{\'e}es sonores ; Classification automatique ; M{\'e}thodes {\`a} noyau ; Machines {\`a} vecteurs supports ; Svm ; S{\'e}lection d'attributs ; Features ; Clustering ; Taxonomies hi{\'e}rarchiques ; Descripteurs ; Timbre ; Distances probabilistes ; Rkhs ; Divergence ; Bhattacharryya},
  month = dec,
  pdf = {https://pastel.archives-ouvertes.fr/pastel-00002738/file/these_essid.pdf},
  school = {{Universit{\'e} Pierre et Marie Curie - Paris VI}},
  title = {{Classification automatique des signaux audio-fr{\'e}quences : reconnaissance des instruments de musique}},
  type = {Theses},
  url = {https://pastel.archives-ouvertes.fr/pastel-00002738},
  year = {2005}
}
```
L’objet de cette thèse est de contribuer à améliorer l’identification automatique des instruments de musique dans des contextes réalistes, (sur des solos de musique, mais également sur des pièces multi-instrumentales). Nous abordons le problème suivant une approche de classification automatique en nous efforçant de rechercher des réalisations performantes des différents modules constituant le système que nous proposons. Nous adoptons un schéma de classification hiérarchique basé sur des taxonomies des instruments et des mélanges d’instruments. Ces taxonomies sont inférées au moyen d’un algorithme de clustering hiérarchique exploitant des distances probabilistes robustes qui sont calculées en utilisant une méthode à noyau. Le système exploite un nouvel algorithme de sélection automatique des attributs pour produire une description efficace des signaux audio qui, associée à des machines à vecteurs supports, permet d’atteindre des taux de reconnaissance élevés sur des pièces sonores reflétant la diversité de la pratique musicale et des conditions d’enregistrement rencontrées dans le monde réel. Notre architecture parvient ainsi à identifier jusqu’à quatre instruments joués simultanément, à partir d’extraits de jazz incluant des percussions.

Conference Articles

Underdetermined blind source separation of audio sources in time-frequency domain
Abdeldjalil Aissa El Bey, Karim Abed-Meraim, Yves Grenier
SPARS’05, Rennes, France, November 2005.

@inproceedings{AA:SPARS-05,
  author = {Aissa El Bey, Abdeldjalil and Abed-Meraim, Karim and Grenier, Yves},
  title = {Underdetermined blind source separation of audio sources in time-frequency domain},
  booktitle = {SPARS'05},
  address = {Rennes, France},
  year = {2005},
  month = nov,
  volume = {1},
  pages = {115--118}
}

Séparation aveugle sous-déterminée de sources audio par la méthode EMD (Empirical Mode Decomposition)
Abdeldjalil Aissa El Bey, Karim Abed-Meraim, Yves Grenier
20e Colloque GRETSI sur le traitement du signal et des images, Louvain-La-Neuve, Belgique, September 2005.

@inproceedings{AA:GRETSI-05,
  author = {Aissa El Bey, Abdeldjalil and Abed-Meraim, Karim and Grenier, Yves},
  title = {S{\'e}paration aveugle sous-d{\'e}termin{\'e}e de sources audio par la m{\'e}thode EMD (Empirical Mode Decomposition)},
  booktitle = {20e Colloque GRETSI sur le traitement du signal et des images},
  address = {Louvain-La-Neuve, Belgique},
  year = {2005},
  month = sep,
  volume = {2},
  pages = {1233--1236}
}

Blind separation of audio sources convolutive mixtures using parametric decomposition
Abdeldjalil Aissa El Bey, Karim Abed-Meraim, Yves Grenier
IWAENC’05, Eindhoven (Pays-bas), September 2005.

@inproceedings{AA:IWAENC-05,
  author = {Aissa El Bey, Abdeldjalil and Abed-Meraim, Karim and Grenier, Yves},
  title = {Blind separation of audio sources convolutive mixtures using parametric decomposition},
  booktitle = {IWAENC'05},
  address = {Eindhoven (Pays-bas)},
  year = {2005},
  month = sep,
  volume = {1},
  pages = {161--164}
}

Blind separation of audio sources using modal decomposition
Abdeldjalil Aissa El Bey, Karim Abed-Meraim, Yves Grenier
ISSPA’05, Sydney (Australie), August 2005.

@inproceedings{AA:ISSPA2-05,
  author = {Aissa El Bey, Abdeldjalil and Abed-Meraim, Karim and Grenier, Yves},
  title = {Blind separation of audio sources using modal decomposition},
  booktitle = {ISSPA'05},
  address = {Sydney (Australie)},
  year = {2005},
  month = aug,
  volume = {2},
  pages = {451--454}
}

On the usefulness of differentiated transient/steady-state processing in machine recognition of musical instruments
Slim Essid, Pierre Leveau, Gael Richard, Laurent Daudet, Bertrand David
AES 118th convention, Barcelona, Spain, May 2005.
```
@inproceedings{essid:hal-02946881,
  address = {Barcelona, Spain},
  author = {Essid, Slim and Leveau, Pierre and Richard, Gael and Daudet, Laurent and David, Bertrand},
  booktitle = {{AES 118th convention}},
  hal_id = {hal-02946881},
  hal_version = {v1},
  month = may,
  pdf = {https://hal.telecom-paris.fr/hal-02946881/file/SE_AES-05.pdf},
  title = {{On the usefulness of differentiated transient/steady-state processing in machine recognition of musical instruments}},
  url = {https://hal.telecom-paris.fr/hal-02946881},
  year = {2005}
}
```
This paper addresses the usefulness of the segmentation of musical sounds into transient/non-transient parts for the task of machine recognition of musical instruments. We put into light the discriminative power of the attack-transient segments on the basis of objective criteria, consistent with the well-known psychoacoustics findings. The sound database used is composed of real-world mono-instrument phrases. Moreover, we show that, paradoxically, it is not always optimal to consider such a segmentation of the audio signal in a machine recognition system for a given decision window. Our evaluation exploits efficient automatic segmentation techniques, a wide variety of signal processing features as well as feature selection algorithms and support vector machine classification.
Yet Another Subspace Tracker
Roland Badeau, Bertrand David, Gael Richard
International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, Pennsylvania, United States, March 2005.
```
@inproceedings{badeau:hal-00960792,
  address = {Philadelphia, Pennsylvania, United States},
  author = {Badeau, Roland and David, Bertrand and Richard, Gael},
  booktitle = {{International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}},
  hal_id = {hal-00960792},
  hal_version = {v1},
  month = mar,
  pages = {329-332},
  pdf = {https://hal.archives-ouvertes.fr/hal-00960792/file/icassp-05.pdf},
  title = {{Yet Another Subspace Tracker}},
  url = {https://hal.archives-ouvertes.fr/hal-00960792},
  volume = {4},
  year = {2005}
}
```
This paper introduces a new algorithm for tracking the major subspace of the correlationmatrix associated with time series. This algorithm greatly outperforms many well-known subspace trackers in terms of subspace estimation. Moreover, it guarantees the orthonormality of the subspace weighting matrix at each iteration, and reaches the lowest complexity found in the literature.

Instrument recognition in polyphonic music
Slim Essid, G. Richard, B. David
(ICASSP ’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., Philadelphia, United States, March 2005.

@inproceedings{essid:hal-02946873,
  address = {Philadelphia, United States},
  author = {Essid, Slim and Richard, G. and David, B.},
  booktitle = {{(ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005.}},
  doi = {10.1109/ICASSP.2005.1415692},
  hal_id = {hal-02946873},
  hal_version = {v1},
  month = mar,
  pages = {245-248},
  publisher = {{IEEE}},
  title = {{Instrument recognition in polyphonic music}},
  url = {https://hal.telecom-paris.fr/hal-02946873},
  year = {2005}
}

Iterative Algorithms for Multichannel Equalization in Sound Reproduction Systems
Mathieu Guillaume, Yves Grenier, Gaël Richard
ICASSP’05, Philadelphie, US, March 2005.

@inproceedings{MG-ICASSP-05,
  author = {Guillaume, Mathieu and Grenier, Yves and Richard, Ga{\"e}l},
  title = {Iterative Algorithms for Multichannel Equalization in Sound Reproduction Systems},
  booktitle = {ICASSP'05},
  address = {Philadelphie, US},
  year = {2005},
  month = mar
}

Séparation aveugle sous-déterminée de sources audio par la méthode EMD (Empirical Mode Decomposition)
Abdeldjalil Aissa El Bey, Karim Abed-Meraim, Yves Grenier
Journées Jeunes Chercheurs en Audition, Acoustique musicale et Signal audio, Marseille, France, March 2005.

@inproceedings{AA:JJCAAS-05,
  author = {Aissa El Bey, Abdeldjalil and Abed-Meraim, Karim and Grenier, Yves},
  title = {S{\'e}paration aveugle sous-d{\'e}termin{\'e}e de sources audio par la m{\'e}thode EMD (Empirical Mode Decomposition)},
  booktitle = {Journ{\'e}es Jeunes Chercheurs en Audition, Acoustique musicale et Signal audio},
  address = {Marseille, France},
  year = {2005},
  month = mar
}

Fast adaptive ESPRIT algorithm
Roland Badeau, Gael Richard, Bertrand David
Proc. of IEEE Workshop on Statistical Signal Processing (SSP), Bordeaux, France, 2005.
```
@inproceedings{badeau:hal-00945279,
  address = {Bordeaux, France},
  author = {Badeau, Roland and Richard, Gael and David, Bertrand},
  booktitle = {{Proc. of IEEE Workshop on Statistical Signal Processing (SSP)}},
  hal_id = {hal-00945279},
  hal_version = {v1},
  pdf = {https://hal.inria.fr/hal-00945279/file/ssp-05.pdf},
  title = {{Fast adaptive ESPRIT algorithm}},
  url = {https://hal.inria.fr/hal-00945279},
  year = {2005}
}
```
The ESPRIT algorithm is a subspace-based high resolution method used in source localization and spectral analysis. It relies on the rotational invariance property of the signal subspace, and provides accurate estimates of the signal parameters. However, its main drawback is a high computational cost. In an adaptive context, some very fast algorithms were proposed to robustly track the variations of the signal subspace. Based on these subspace trackers, we propose a new adaptive implementation of the ESPRIT algorithm, faster than the existing methods.

patent

Procédé de poursuite d’un sous-espace de dimension inférieure à celle des vecteurs de données, notamment audio
Roland Badeau, Gael Richard, Bertrand David
France, March 2005.

@patent{badeau:hal-00945253,
  address = {France},
  author = {Badeau, Roland and Richard, Gael and David, Bertrand},
  hal_id = {hal-00945253},
  hal_version = {v1},
  month = mar,
  number = {FR2888963},
  title = {{Proc{\'e}d{\'e} de poursuite d'un sous-espace de dimension inf{\'e}rieure {\`a} celle des vecteurs de donn{\'e}es, notamment audio}},
  url = {https://hal.inria.fr/hal-00945253},
  year = {2005}
}

Book Chapters

Interfaces audio AES/EBU
Yves Grenier
Editions Techniques de l’Ingénieur, 2005.

@incollection{YG:Teching-05,
  author = {Grenier, Yves},
  title = {Interfaces audio AES/EBU},
  booktitle = {Editions Techniques de l'Ing{\'e}nieur},
  publisher = {Editions Techniques de l'Ing{\'e}nieur},
  year = {2005}
}

2000 - 2004 [27 publications]

2004

Conference Articles

Multichannel Equalization Analysis in Sound Reproduction Systems
Mathieu Guillaume, Yves Grenier, Gaël Richard
23rd VDT International Audio Convention, Leigzig, RFA, November 2004.

@inproceedings{MG:vdt-2004,
  author = {Guillaume, Mathieu and Grenier, Yves and Richard, Ga{\"e}l},
  title = {Multichannel Equalization Analysis in Sound Reproduction Systems},
  booktitle = {23rd VDT International Audio Convention},
  address = {Leigzig, RFA},
  year = {2004},
  month = nov
}

MUSICAL INSTRUMENT RECOGNITION BASED ON CLASS PAIRWISE FEATURE SELECTION
Slim Essid, Gael Richard, Bertrand David
International Conference on Music Information Retrieval (ISMIR), Barcelona, Spain, October 2004.
```
@inproceedings{essid:hal-02946907,
  address = {Barcelona, Spain},
  author = {Essid, Slim and Richard, Gael and David, Bertrand},
  booktitle = {{International Conference on Music Information Retrieval (ISMIR)}},
  hal_id = {hal-02946907},
  hal_version = {v1},
  month = oct,
  pdf = {https://hal.telecom-paris.fr/hal-02946907/file/SE_ISMIR-04.pdf},
  title = {{MUSICAL INSTRUMENT RECOGNITION BASED ON CLASS PAIRWISE FEATURE SELECTION}},
  url = {https://hal.telecom-paris.fr/hal-02946907},
  year = {2004}
}
```
In this work, musical instrument recognition is considered on solo music from real world performance. A large sound database is used that consists of musical phrases ex-cerpted from commercial recordings with different instrument instances, different players, and varying recording conditions. The proposed recognition scheme exploits class pairwise feature selection based on inertia ratio maximization. Moreover , new signal processing features based on octave band energy measures are introduced that prove to be useful. Classification is performed using Gaussian Mixture Models in a one vs one fashion in association with a data rescal-ing procedure as pre-processing. Experimental results show that substantial improvement in recognition success is thus achieved.
MUSICAL INSTRUMENT RECOGNITION ON SOLO PERFORMANCES
Slim Essid, Gaël Richard, Bertrand David
European Signal Processing Conference (EUSIPCO, Vienna, Austria, September 2004.
```
@inproceedings{essid:hal-02946903,
  address = {Vienna, Austria},
  author = {Essid, Slim and Richard, Ga{\"e}l and David, Bertrand},
  booktitle = {{European Signal Processing Conference (EUSIPCO}},
  hal_id = {hal-02946903},
  hal_version = {v1},
  month = sep,
  pdf = {https://hal.telecom-paris.fr/hal-02946903/file/SE_EUSIPCO-04.pdf},
  title = {{MUSICAL INSTRUMENT RECOGNITION ON SOLO PERFORMANCES}},
  url = {https://hal.telecom-paris.fr/hal-02946903},
  year = {2004}
}
```
Musical instrument recognition is one of the important goals of musical signal indexing. If much effort has already been dedicated to the automatic recognition of musical instruments, most studies were based on limited amounts of data which often included only isolated notes. In this paper, two statistical approaches, namely the Gaussian Mixture Model (GMM) and the Support Vector Machines (SVM), are studied for the recognition of woodwind instruments using a large database of isolated notes and solo excerpts extracted from many different sources. Furthermore, it is shown that the use of Principal Component Analysis (PCA) to transform the feature data significantly increases the recognition accuracy. The recognition rates obtained range from 52.0 % for Bb Clarinet up to 96.0 % for Oboe.
Efficient musical instrument recognition on solo performance music using basic features
Slim Essid, Gael Richard, Bertrand David
AES 25th conference, London, United Kingdom, June 2004.
```
@inproceedings{essid:hal-02946911,
  address = {London, United Kingdom},
  author = {Essid, Slim and Richard, Gael and David, Bertrand},
  booktitle = {{AES 25th conference}},
  hal_id = {hal-02946911},
  hal_version = {v1},
  month = jun,
  pdf = {https://hal.telecom-paris.fr/hal-02946911/file/SE_AES-04.pdf},
  title = {{Efficient musical instrument recognition on solo performance music using basic features}},
  url = {https://hal.telecom-paris.fr/hal-02946911},
  year = {2004}
}
```
Musical instrument recognition has gained growing concern for the promise it holds towards advances in musical content description. The present study pursues the goal of showing the efficiency of some basic features for such a recognition task in the realistic situation where solo musical phrases are played. A large and varied database of sounds assembled from different commercial recordings is used to ensure better training and testing conditions, in terms of statistical efficiency. It is found that when combining cepstral features with others describing the audio signal spectral shape, a high recognition accuracy can be achieved in association with Support Vector Machine classification when using a Radial Basis Function kernel.

Unsupervised Classification Techniques for Multipitch Estimation
Julie Rosier, Yves Grenier
116th Convention of the Audio Engineering Society, Berlin, RFA., May 2004.

@inproceedings{JR:AES-04,
  author = {Rosier, Julie and Grenier, Yves},
  title = {Unsupervised Classification Techniques for Multipitch Estimation},
  booktitle = {116th Convention of the Audio Engineering Society},
  address = {Berlin, RFA.},
  year = {2004},
  month = may
}

Selecting the modeling order for the ESPRIT high resolution method: an alternative approach
Roland Badeau, Bertrand David, Gael Richard
Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Montréal, Québec, Canada, 2004.
```
@inproceedings{badeau:hal-00945275,
  address = {Montr{\'e}al, Qu{\'e}bec, Canada},
  author = {Badeau, Roland and David, Bertrand and Richard, Gael},
  booktitle = {{Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  hal_id = {hal-00945275},
  hal_version = {v1},
  pages = {1025--1028},
  pdf = {https://hal.inria.fr/hal-00945275/file/icassp-04.pdf},
  title = {{Selecting the modeling order for the ESPRIT high resolution method: an alternative approach}},
  url = {https://hal.inria.fr/hal-00945275},
  volume = {2},
  year = {2004}
}
```
High Resolution methods, such as the ESPRIT algorithm, perform an accurate representation of an harmonic signal as a sum of exponentially damped sinusoids. However, in coding applications, the signal must be represented with a minimum number of parameters. Unfortunately, it is well known that applying the ESPRIT algorithm with an under-estimated model order generates biased frequency estimates. In this paper, we propose a new method for selecting an appropriate modeling order, which minimizes this bias. This approach was applied to both synthetic and musical signals and outperformed the classical information theoretic criteria.

Journal Articles

Sliding window adaptive SVD algorithms
Roland Badeau, Gael Richard, Bertrand David
IEEE_J_SP, 2004.
```
@article{badeau:hal-00945196,
  author = {Badeau, Roland and Richard, Gael and David, Bertrand},
  hal_id = {hal-00945196},
  hal_version = {v1},
  journal = {{IEEE\_J\_SP}},
  number = {1},
  pages = {1--10},
  pdf = {https://hal.inria.fr/hal-00945196/file/ieee-tsp-04.pdf},
  publisher = {{IEEE}},
  title = {{Sliding window adaptive SVD algorithms}},
  url = {https://hal.inria.fr/hal-00945196},
  volume = {52},
  year = {2004}
}
```
The singular value decomposition (SVD) is an important tool for subspace estimation. In adaptive signal processing, we are especially interested in tracking the SVD of a recursively updated data matrix. This paper introduces a new tracking technique, designed for rectangular sliding window data matrices. This approach, derived from the classical bi-orthogonal iteration SVD algorithm, shows excellent performance in the context of frequency estimation. It proves to be very robust to abrupt signal changes, due to the use of a sliding window. Finally, an ultra-fast tracking algorithm with comparable performance is proposed.

2003

Conference Articles

Modèles Sinusoïdaux Étendus pour le Codage Audio
Remy Boyer, Slim Essid, Karim Abed-Meraim, Nicolas Moreau
Dix-neuvième colloque sur le Traitement du Signal et des Images, Paris, France, September 2003.
```
@inproceedings{boyer:hal-02946917,
  address = {Paris, France},
  author = {Boyer, Remy and Essid, Slim and Abed-Meraim, Karim and Moreau, Nicolas},
  booktitle = {{Dix-neuvi{\`e}me colloque sur le Traitement du Signal et des Images}},
  hal_id = {hal-02946917},
  hal_version = {v1},
  month = sep,
  pdf = {https://hal.telecom-paris.fr/hal-02946917/file/RB_GRETSI-03.pdf},
  title = {{Mod{\`e}les Sinuso{\"i}daux {\'E}tendus pour le Codage Audio}},
  url = {https://hal.telecom-paris.fr/hal-02946917},
  year = {2003}
}
```
Dans cet article, on commence par faire un bref panorama de quelques extensions du modèle sinusoïdal. Ensuite, dans une optique de codage du signal audio, on retient deux représentations, nommées modèle sinusoïdal amorti exponentiellement et modèle sinusoïdal amorti et retardé. On montre alors leur utilité vis-à-vis de phénomènes audio identifiés (transitoires, pseudo-stationnaires, ...). En outre, on propose un algorithme d’estimation des paramètres de modèle alliant une approche Haute-Résolution et un schéma par déflation. Finalement, nous montrons en quoi ces deux modèles sont des solutions viables en tant que "briques de base" dans une architecture de codage sinusoïdal audio.

Extraction of weak background transients from audio signals
Yves Grenier, Bertrand David
114th Convention of the Audio Engineering Society, Amsterdam, NL, March 2003.

@inproceedings{YG:AES114,
  author = {Grenier, Yves and David, Bertrand},
  title = {Extraction of weak background transients from audio signals},
  booktitle = {114th Convention of the Audio Engineering Society},
  address = {Amsterdam, NL},
  year = {2003},
  month = mar
}

Musical tempo estimation using noise subspace projections
Miguel Alonso Arevalo, Roland Badeau, Bertrand David, Gael Richard
Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York, United States, 2003.
```
@inproceedings{alonsoarevalo:hal-00945267,
  address = {New Paltz, New York, United States},
  author = {Alonso Arevalo, Miguel and Badeau, Roland and David, Bertrand and Richard, Gael},
  booktitle = {{Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}},
  hal_id = {hal-00945267},
  hal_version = {v1},
  pages = {95--98},
  pdf = {https://hal.inria.fr/hal-00945267/file/waspaa-03.pdf},
  title = {{Musical tempo estimation using noise subspace projections}},
  url = {https://hal.inria.fr/hal-00945267},
  year = {2003}
}
```
Tempo estimation plays a fundamental role in music analysis, especially for the automatic processing of large amounts of musical data. In this paper, a novel idea to enhance the estimation of the tempo in musical pieces is described, based on an harmonic/noise decomposition. This separation of the signal into a deterministic and a stochastic part is performed by projecting the signal onto its noise subspace. Besides, the proposed algorithm shares various elements with other tempo estimation methods. On a database composed of 54 excerpts from many musical genres our algorithm scored a success rate of 96%.
Suivi d’espace dominant par la méthode des puissances itérées
Roland Badeau, Gael Richard, Bertrand David
Actes du colloque GRETSI, Paris, France, 2003.
```
@inproceedings{badeau:hal-00945238,
  address = {Paris, France},
  author = {Badeau, Roland and Richard, Gael and David, Bertrand},
  booktitle = {{Actes du colloque GRETSI}},
  hal_id = {hal-00945238},
  hal_version = {v1},
  pages = {137--140},
  pdf = {https://hal.inria.fr/hal-00945238/file/gretsi-03.pdf},
  title = {{Suivi d'espace dominant par la m{\'e}thode des puissances it{\'e}r{\'e}es}},
  url = {https://hal.inria.fr/hal-00945238},
  volume = {1},
  year = {2003}
}
```
Cet article introduit une version à fenêtre glissante de l’algorithme API, qui effectue le suivi de l’espace dominant d’une séquence de vecteurs. Cet algorithme est dérivé de la méthode des puissances itérées, et repose sur une approximation moins restrictive que celle connue sous le nom d’approximation par projection. Il garantit l’orthonormalité de la matrice générée à chaque itération, et satisfait une propriété de convergence globale et exponentielle. De plus, il atteint de meilleures performances que la plupart des algorithmes de suivi d’espace dominant voisins de la méthode des puissances itérées, tels que PAST, NIC, NP3 et OPAST, tout en ayant la même complexité algorithmique. Nos simulations numériques ont montré l’intérêt de l’utilisation d’une fenêtre glissante: l’algorithme réagit beaucoup plus rapidement à de brusques variations du signal.
Sliding Window Orthonormal PAST Algorithm
Roland Badeau, Karim Abed-Meraim, Gael Richard, Bertrand David
Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong Kong, China, 2003.
```
@inproceedings{badeau:hal-00945271,
  address = {Hong Kong, China},
  author = {Badeau, Roland and Abed-Meraim, Karim and Richard, Gael and David, Bertrand},
  booktitle = {{Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}},
  hal_id = {hal-00945271},
  hal_version = {v1},
  pages = {261--264},
  pdf = {https://hal.inria.fr/hal-00945271/file/icassp-03-bis.pdf},
  title = {{Sliding Window Orthonormal PAST Algorithm}},
  url = {https://hal.inria.fr/hal-00945271},
  volume = {5},
  year = {2003}
}
```
This paper introduces an orthonormal version of the sliding-window Projection Approximation Subspace Tracker (PAST). The new algorithm guarantees the orthonormality of the signal subspace basis at each iteration. Moreover, it has the same complexity as the original PAST algorithm, and like the more computationally demanding natural power (NP) method, it satisfies a global convergence property, and reaches an excellent tracking performance.
Adaptive ESPRIT algorithm based on the PAST subspace tracker
Roland Badeau, Gael Richard, Bertrand David
Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong Kong, China, 2003.
```
@inproceedings{badeau:hal-00945280,
  address = {Hong Kong, China},
  author = {Badeau, Roland and Richard, Gael and David, Bertrand},
  booktitle = {{Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}},
  hal_id = {hal-00945280},
  hal_version = {v1},
  pages = {229--232},
  pdf = {https://hal.inria.fr/hal-00945280/file/icassp-03.pdf},
  title = {{Adaptive ESPRIT algorithm based on the PAST subspace tracker}},
  url = {https://hal.inria.fr/hal-00945280},
  volume = {6},
  year = {2003}
}
```
The Estimation of Signal Parameters via Rotational Invariance Techniques (ESPRIT) algorithm is a subspace-based analysis method used in source localization or frequency estimation, originally designed in a block signal processing context. In other respects, the Projection Approximation Subspace Tracker (PAST) is a fast and robust subspace tracking method. This paper introduces a new frequency estimation and tracking algorithm, which relies on the PAST subspace tracker and a fast adaptive implementation of the ESPRIT algorithm.
Approximated power iterations for fast subspace tracking
Roland Badeau, Gael Richard, Bertrand David, Karim Abed-Meraim
Proc. of the 7th International Symposium on Signal Processing and its Applications (ISSPA), Paris, France, 2003.
```
@inproceedings{badeau:hal-00945281,
  address = {Paris, France},
  author = {Badeau, Roland and Richard, Gael and David, Bertrand and Abed-Meraim, Karim},
  booktitle = {{Proc. of the 7th International Symposium on Signal Processing and its Applications (ISSPA)}},
  hal_id = {hal-00945281},
  hal_version = {v1},
  pages = {583--586},
  pdf = {https://hal.inria.fr/hal-00945281/file/isspa-03.pdf},
  title = {{Approximated power iterations for fast subspace tracking}},
  url = {https://hal.inria.fr/hal-00945281},
  volume = {2},
  year = {2003}
}
```
This paper introduces a fast implementation of the power iterations method for subspace tracking, based on an approximation less restrictive than the well known projection approximation. This algorithm guarantees the orthonormality of the estimated subspace weighting matrix at each iteration, and satisfies a global and exponential convergence property. Moreover, it outperforms many subspace trackers related to the power method, such as PAST, NIC, NP3 and OPAST, while keeping the same computational complexity.
An EDS modelling tool for tracking and modifying musical signals
Bertrand David, Gael Richard, Roland Badeau
Proc. of Stockholm Music Acoustics Conference (SMAC), Stockholm, Sweden, 2003.
```
@inproceedings{david:hal-00945287,
  address = {Stockholm, Sweden},
  author = {David, Bertrand and Richard, Gael and Badeau, Roland},
  booktitle = {{Proc. of Stockholm Music Acoustics Conference (SMAC)}},
  hal_id = {hal-00945287},
  hal_version = {v1},
  pages = {715--718},
  pdf = {https://hal.inria.fr/hal-00945287/file/smac-03.pdf},
  title = {{An EDS modelling tool for tracking and modifying musical signals}},
  url = {https://hal.inria.fr/hal-00945287},
  volume = {2},
  year = {2003}
}
```
When one intends to achieve a good sound quality in applying classical effects like time stretching or pitch shifting to a piano recording, the problem of separating the stable sinusoidal components from the mechanical noise has to be addressed. This decomposition allows for specific processing of each part. In the case of the piano, the effect must indeed be applied only on the sinusoidal part to preserve the naturalness of the sound, especially if the amount of the stretching or the shifting is large. In this paper, an adaptive parametric method is presented, based on an Exponentially Damped Sinusoidal (EDS) model for the signal. This technique is not limited by the well-known trade-off between the spectral resolution and the length of the analysis window as in classical Fourier based approaches (for instance the phase-vocoder). The signal subspace is directly tracked and the poles related to sinusoidal components are selected regarding the stability of their time trajectory. Some recent enhancements of the algorithms for audio signal processing are demonstrated; in order to achieve a lower complexity. The musical applications include modifications of piano sounds and the singing voice.

Journal Articles

Synthèse de la parole à partir du texte
Gael Richard, Olivier Cappé
Techniques de l’Ingenieur, 2003.

@article{richard:hal-02652561,
  author = {Richard, Gael and Capp{\'e}, Olivier},
  hal_id = {hal-02652561},
  hal_version = {v1},
  journal = {{Techniques de l'Ingenieur}},
  publisher = {{Techniques de l'ing{\'e}nieur}},
  title = {{Synth{\`e}se de la parole {\`a} partir du texte}},
  url = {https://hal.inria.fr/hal-02652561},
  year = {2003}
}

2002

Conference Articles

Non-stationary modeling techniques adapted to low bitrate audio coding
Remy Boyer, Slim Essid, Nicolas Moreau
Int. Conf. on Signal Processing (ICSP), Beijing, China, August 2002.

@inproceedings{boyer:hal-01251615,
  address = {Beijing, China},
  author = {Boyer, Remy and Essid, Slim and Moreau, Nicolas},
  booktitle = {{Int. Conf. on Signal Processing (ICSP)}},
  hal_id = {hal-01251615},
  hal_version = {v1},
  month = aug,
  title = {{Non-stationary modeling techniques adapted to low bitrate audio coding}},
  url = {https://hal-supelec.archives-ouvertes.fr/hal-01251615},
  year = {2002}
}

Dynamic temporal segmentation in parametric non-stationary modeling for percussive musical signals
Remy Boyer, Slim Essid, Nicolas Moreau
IEEE International Conference on Multimedia and Expo (ICME), Lausane, Switzerland, August 2002.
```
@inproceedings{boyer:hal-01251622,
  address = {Lausane, Switzerland},
  author = {Boyer, Remy and Essid, Slim and Moreau, Nicolas},
  booktitle = {{ IEEE International Conference on Multimedia and Expo (ICME)}},
  hal_id = {hal-01251622},
  hal_version = {v1},
  month = aug,
  title = {{Dynamic temporal segmentation in parametric non-stationary modeling for percussive musical signals}},
  url = {https://hal-supelec.archives-ouvertes.fr/hal-01251622},
  year = {2002}
}
```
An audio signal parametric modeling scheme is proposed that permits higher performance for representing strong sound transients. The exponentially damped sinusoids (EDS) model is considered in association with a high resolution parameter estimation approach. Such a technique is well adapted to almost every audio signal but is unfortunately not efficient when dealing with signals presenting strong temporal variations, such as percussive music signals, and causes pre-echo artifacts and weak onset dynamic reproduction which are prejudicial to listening. A system, based on the EDS model, has been developed with a transient detector and dynamic time segmentation and modeling that allows to overcome such artifacts.
Transient modeling with a Frequency-Transform Subspace Algorithm and ”Transient + Sinusoidal” scheme
Remy Boyer, Slim Essid
IEEE Conference on Digital Signal Processing (DSP), Santorini, Greece, July 2002.
```
@inproceedings{boyer:hal-01251630,
  address = {Santorini, Greece},
  author = {Boyer, Remy and Essid, Slim},
  booktitle = {{IEEE Conference on Digital Signal Processing (DSP)}},
  hal_id = {hal-01251630},
  hal_version = {v1},
  month = jul,
  title = {{Transient modeling with a Frequency-Transform Subspace Algorithm and ''Transient + Sinusoidal'' scheme}},
  url = {https://hal-supelec.archives-ouvertes.fr/hal-01251630},
  year = {2002}
}
```
We present an efficient modeling method for strong transient character audio signals. It is shown that the parametric non-stationary exponentially damped sinusoids (EDS) model permits good performance for time domain modeling of quasi-stationary signals or "weak" transients. However, a decay in modeling performance is observed when dealing with highly nonstationary signals as in a variety of musical sounds (various percussions, castanets, triangle,...). The idea is then to process the signal in a well chosen frequency-transform domain in which the transient temporal characteristics are better modeled by EDS. As a result, better representations of the transient signal class are obtained with no pre-echo artifacts (energy before the attack) and a very good signal onset dynamic reproduction. Finally, an original "transient+sinusoidal" modeling scheme is proposed.

Two-Pitch Estimation For Co-channel Speakers Separation
Julie Rosier, Yves Grenier
ICASSP, Orlando, USA, May 2002.

@inproceedings{JR:ICASSP-02,
  author = {Rosier, Julie and Grenier, Yves},
  title = {Two-Pitch Estimation For Co-channel Speakers Separation},
  booktitle = {ICASSP},
  address = {Orlando, USA},
  year = {2002},
  month = may
}

Pitch Estimation for the Separation of Musical Sounds
Julie Rosier, Yves Grenier
112th AES Convention, Munich, RFA, May 2002.

@inproceedings{JR:AES-02,
  author = {Rosier, Julie and Grenier, Yves},
  title = {Pitch Estimation for the Separation of Musical Sounds},
  booktitle = {112th AES Convention},
  address = {Munich, RFA},
  year = {2002},
  month = may
}

EDS parametric modeling and tracking of audio signals
Roland Badeau, Rémy Boyer, Bertrand David
Proc. of the 5th International Conference on Digital Audio Effects (DAFx), Hambourg, Germany, 2002.
```
@inproceedings{badeau:hal-00945272,
  address = {Hambourg, Germany},
  author = {Badeau, Roland and Boyer, R{\'e}my and David, Bertrand},
  booktitle = {{Proc. of the 5th International Conference on Digital Audio Effects (DAFx)}},
  hal_id = {hal-00945272},
  hal_version = {v1},
  pages = {139--144},
  pdf = {https://hal.inria.fr/hal-00945272/file/dafx-02.pdf},
  title = {{EDS parametric modeling and tracking of audio signals}},
  url = {https://hal.inria.fr/hal-00945272},
  year = {2002}
}
```
Despite the success of parametric modeling in various fields of digital signal processing, the Fourier analysis remains a prominent tool for many audio applications. This paper aims at demonstrating the usefulness of the Exponentially Damped Sinusoidal (EDS) model both for analysis/synthesis and tracking purposes.
Sintrack analysis for tracking components of musical signals
Bertrand David, Roland Badeau, Gael Richard
Proc. of the Forum Acusticum, Séville, Spain, 2002.
```
@inproceedings{david:hal-00945285,
  address = {S{\'e}ville, Spain},
  author = {David, Bertrand and Badeau, Roland and Richard, Gael},
  booktitle = {{Proc. of the Forum Acusticum}},
  hal_id = {hal-00945285},
  hal_version = {v1},
  pdf = {https://hal.inria.fr/hal-00945285/file/fas-02.pdf},
  title = {{Sintrack analysis for tracking components of musical signals}},
  url = {https://hal.inria.fr/hal-00945285},
  year = {2002}
}
```
Musical signals often present close frequencies producing beats and the classical Fourier Transform does not achieve a sufficient resolution to estimate these components separately with accuracy. This paper presents a spectral analysis technique for estimating and tracking the frequencies, damping factors and amplitudes of each component. SINTRACK associates a Matrix Pencil high resolution method with a LMS adaptive algorithm. It detects changes in the signal underlying model (as the changing number of sinusoids for example) and can indicate the time variations of parameters. Results are shown for a signal recorded from a guitar.

2001

Conference Articles

Exploration de techniques modernes de modélisation adaptées à du codage audio bas-débit
Remy Boyer, Slim Essid, Nicolas Moreau
7èmes Journées d’Etudes et d’Echanges : Compression et Représentation des Signaux Audiovisuels (CORESA), Dijon, France, October 2001.

@inproceedings{boyer:hal-02946929,
  address = {Dijon, France},
  author = {Boyer, Remy and Essid, Slim and Moreau, Nicolas},
  booktitle = {{7{\`e}mes Journ{\'e}es d'Etudes et d'Echanges : Compression et Repr{\'e}sentation des Signaux Audiovisuels (CORESA)}},
  hal_id = {hal-02946929},
  hal_version = {v1},
  month = oct,
  title = {{Exploration de techniques modernes de mod{\'e}lisation adapt{\'e}es {\`a} du codage audio bas-d{\'e}bit}},
  url = {https://hal.telecom-paris.fr/hal-02946929},
  year = {2001}
}

patent

Procédé et système de restitution sonore à effet spatial, et terminal de téléphone incorporant un tel système
Gael Richard, Philip Lockwood, Francois Capman, Jérôme Boudy
France, February 2001.

@patent{richard:hal-02651102,
  address = {France},
  author = {Richard, Gael and Lockwood, Philip and Capman, Francois and Boudy, J{\'e}r{\^o}me},
  hal_id = {hal-02651102},
  hal_version = {v1},
  month = feb,
  number = {FR2797132},
  title = {{Proc{\'e}d{\'e} et syst{\`e}me de restitution sonore {\`a} effet spatial, et terminal de t{\'e}l{\'e}phone incorporant un tel syst{\`e}me}},
  url = {https://hal.telecom-paris.fr/hal-02651102},
  year = {2001}
}

Dispositif de saisie et restitution du son utilisant plusieurs capteurs
Yves Grenier
February 2001.

@patent{YG:Brevet-01,
  author = {Grenier, Yves},
  title = {Dispositif de saisie et restitution du son utilisant plusieurs capteurs},
  year = {2001},
  month = feb,
  number = {01/13896}
}

Journal Articles

Annotation in the SpeechDat Projects
Henk Van Den Heuvel, Louis Boves, Asuncion Moreno, Maurizio Omologo, Gael Richard, E Sanders
International Journal of Speech Technology, 2001.
```
@article{vandenheuvel:hal-02652545,
  author = {Van Den Heuvel, Henk and Boves, Louis and Moreno, Asuncion and Omologo, Maurizio and Richard, Gael and Sanders, E},
  hal_id = {hal-02652545},
  hal_version = {v1},
  journal = {{International Journal of Speech Technology}},
  keywords = {validation ; annotation ; transcription ; SpeechDat ; annotation tools},
  pages = {127 - 143},
  publisher = {{Springer Verlag}},
  title = {{Annotation in the SpeechDat Projects}},
  url = {https://hal.archives-ouvertes.fr/hal-02652545},
  volume = {4},
  year = {2001}
}
```
A large set of spoken language resources (SLR) for various European languages is being compiled in several SpeechDat projects with the aim to train and test speech recognizers for voice driven services, mainly over telephone lines. This paper is focused on the annotation conventions applied for the Speechdat SLR. These SLR contain typical examples of short monologue speech utterances with simple orthographic transcriptions in a hierarchically simple annotation structure. The annotation conventions and their underlying principles are described and compared to approaches used for related SLR. The synchronization of the orthographic transcriptions with the corresponding speech files is addressed, and the impact of the selected approach for capturing specific phonological and phonetic phenomena is discussed. In the SpeechDat projects a number of tools have been developed to carry out the transcription of the speech. In this paper, a short description of these tools and their properties is provided. For all SpeechDat projects, an internal validity check of the databases and their annotations is carried out. The procedure of this validation campaign, the performed evaluations, and some of the results are presented.

Starting 2025 [37 publications] BibTeX

2026

Conference Articles

Journal Articles

2025

Conference Articles

Journal Articles

2020-2024 [144 publications] BibTeX

2024

Conference Articles

Journal Articles

Technical Reports

2023

Conference Articles

Journal Articles

2022

Conference Articles

Journal Articles

2021

Conference Articles

Theses

patent

Journal Articles

2020

Conference Articles

patent

Journal Articles

2015 - 2019 [102 publications] BibTeX

2019

Conference Articles

Journal Articles

Technical Reports

Theses

2018

patent

Conference Articles

Journal Articles

Technical Reports

2017

Journal Articles

Conference Articles

patent

2016

patent

Conference Articles

Technical Reports

2015

Conference Articles

Technical Reports

Journal Articles

patent

2010 - 2014 [104 publications] BibTeX

2014

Journal Articles

Conference Articles

Technical Reports

2013

Conference Articles

Technical Reports

Journal Articles

patent

2012

Conference Articles

Journal Articles

2011

Conference Articles

Journal Articles

2010

Conference Articles

patent

Journal Articles

2005 - 2009 [87 publications] BibTeX

2009

Conference Articles

Journal Articles

Technical Reports

2008

Journal Articles

Conference Articles

2007

Starting 2025 [37 publications]

2020-2024 [144 publications]

2015 - 2019 [102 publications]

2010 - 2014 [104 publications]

2005 - 2009 [87 publications]

2000 - 2004 [27 publications]