introduction

URGENT 2024 Challenge

URGENT 2024 (Universality, Robustness, and Generalizability for EnhancemeNT) is a speech enhancement challenge accepted by the NeurIPS 2024 Competition Track. We aim to build universal speech enhancement models for unifying speech processing in a wide variety of conditions.

Goal

Based on the increasing interest in the generalizability of speech enhancement models, we propose the URGENT Challenge that aims to:

  1. Bring more attention to constructing universal speech enhancement models with strong generalizability.
  2. Push forward the progress of speech enhancement research towards more realistic scenarios with a comprehensive evaluation.
  3. Provide insightful and systematic comparison between SOTA discriminative and generative methods in a wide range of conditions, including different distortions and input formats (sampling frequencies and number of microphones).
  4. Provide a benchmark for this direction so that researchers can easily compare different methods.
  5. Allow conclusiveness of method comparisons by providing a set of training data that is exclusive and mandatory for all models.

Task Introduction

The task of this challenge is to build a single speech enhancement system to adaptively handle input speech with different distortions (corresponding to different SE subtasks) and different input formats (e.g., sampling frequencies) in different acoustic environments (e.g., noise and reverberation).

The training data will consist of several public corpora of speech, noise, and RIRs. Only the specified set of data can be used during the challenge. We encourage participants to apply data augmentation techniques such as dynamic mixing to achieve the best generalizability. The data preparation scripts are released in our GitHub repositoryhttps://github.com/urgent-challenge/urgent2024_challenge/. Check the Data tab for more information.

We also provide baselines in the ESPnet toolkit to facilitate the system development. Check the Baseline tab for more information.

We will evaluate enhanced audios with a variety of metrics to comprehensively understand the capacity of existing generative and discriminative methods. They include four different categories of metricsAn additional category (subjective SE metrics) will be added for the final blind test phase for evaluating the MOS score.:

  1. non-intrusive metrics (e.g., DNSMOS, NISQA) for reference-free speech quality evaluation.
  2. intrusive metrics (e.g., PESQ, STOI, SDR, MCD) for objective speech quality evaluation.
  3. downstream-task-independent metrics (e.g., Levenshtein phone similarity) for language-independent, speaker-independent, and task-independent evaluation.
  4. downstream-task-dependent metrics (e.g., speaker similarity, word accuracy or WAcc) for evaluation of compatibility with different downstream tasks.

More details about the evaluation plan can be found in the Rules tab.

Communication

Join our Slack workspace for real-time communication.

Workshop

Top-ranking teams will be invited to a dedicated workshop in the NeurIPS 2024 conference (December 14 or December 15, 2024). More information will be provided after the challenge is completed.

Motivation

Recent decades have witnessed rapid development of deep learning-based speech enhancement (SE) techniques, with impressive performance in matched conditions. However, most conventional speech enhancement approaches focus only on a limited range of conditions, such as single-channel, multi-channel, anechoic, and so on. In many existing works, researchers tend to only train SE models on one or two common datasets, such as the VoiceBank+DEMANDhttps://datashare.ed.ac.uk/handle/10283/2791 and Deep Noise Suppression (DNS) Challenge datasets.

The evaluation is often done only on simulated conditions that are similar to the training setting. Meanwhile, in earlier SE challenges such DNS series, the choice of training data was also often left to the participants. This led to the situation that models trained with a huge amount of private data were compared to models trained with a small public dataset. This greatly impedes understanding of the generalizability and robustness of SE methods comprehensively. In addition, the model design may be biased towards a specific limited condition if only a small amount of data is used. The resultant SE model may also have limited capacity to handle more complicated scenarios.

Apart from conventional discriminative methods, generative methods have also attracted much attention in recent years. They are good at handling different distortions with a single model and tend to generalize better than discriminative methods. However, their capability and universality have not yet been fully understood through a comprehensive benchmark.

Meanwhile, recent efforts have shown the possibility of building a single system to handle various input formats, such as different sampling frequencies and numbers of microphones. However, a well-established benchmark covering a wide range of conditions is still missing, and no systematic comparison has been made between state-of-the-art (SOTA) discriminative and generative methods regarding their generalizability.

Existing speech enhancement challenges have fostered the development of speech enhancement models for specific conditions, such as denoising and dereverberation, speech restoration, packet loss concealment, acoustic echo cancellation, hearing aids, 3D speech enhancement, far-field multi-channel speech enhancement for video conferencing, unsupervised domain adaptation for denoising, and audio-visual speech enhancement. These challenges have greatly enriched the corpora in speech enhancement studies. However, there still lacks a challenge that can benchmark the generalizability of speech enhancement systems in a wide range of conditions.

Similar issues can also be observed in other speech tasks such as automatic speech recognition (ASR), speech translation (ST), speaker verification (SV), and spoken language understanding (SLU). Among them, speech enhancement is particularly vulnerable to mismatches since it is heavily reliant on paired clean/noisy speech data to achieve strong performance. Unsupervised speech enhancement that does not require groundtruth clean speech has been proposed to address this issue, but often merely brings benefit in a final finetuning stage. Therefore, we focus on speech enhancement in this challenge to address the aforementioned problems.

Footnotes

  1. https://github.com/urgent-challenge/urgent2024_challenge/[↩]
  2. An additional category (subjective SE metrics) will be added for the final blind test phase for evaluating the MOS score.[↩]
  3. https://datashare.ed.ac.uk/handle/10283/2791[↩]

References

  1. Universal Speech Enhancement with Score-Based Diffusion[PDF]
    Joan Serrà, Santiago Pascual, Jordi Pons, R Oguz Araz and Davide Scaini, 2022. arXiv preprint arXiv:2206.03065.
  2. VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration[HTML]
    Haohe Liu, Xubo Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, DeLiang Wang, Chuanzeng Huang and Yuxuan Wang, 2022. Proc. Interspeech, pp. 4232--4236. DOI: 10.21437/Interspeech.2022-11026
  3. Conditional Diffusion Probabilistic Model for Speech Enhancement[PDF]
    Yen-Ju Lu, Zhong-Qiu Wang, Shinji Watanabe, Alexander Richard, Cheng Yu and Yu Tsao, 2022. Proc. ICASSP, pp. 7402--7406. DOI: 10.1109/ICASSP43922.2022.9746901
  4. Toward Universal Speech Enhancement for Diverse Input Conditions[PDF]
    Wangyou Zhang, Kohei Saijo, Zhong-Qiu Wang, Shinji Watanabe and Yanmin Qian, 2023. Proc. ASRU. DOI: 10.1109/ASRU57964.2023.10389733
  5. Improving Design of Input Condition Invariant Speech Enhancement[PDF]
    Wangyou Zhang, Jee-weon Jung and Yanmin Qian, 2024. Proc. ICASSP, pp. 10696--10700. DOI: 10.1109/ICASSP48485.2024.10448155
  6. The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results[HTML]
    Chandan K.A. Reddy, Vishak Gopal, Ross Cutler, Ebrahim Beyrami, Roger Cheng, Harishchandra Dubey, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, Puneet Rana, Sriram Srinivasan and Johannes Gehrke, 2020. Proc. Interspeech, pp. 2492--2496. DOI: 10.21437/Interspeech.2020-3038
  7. ICASSP 2021 Deep Noise Suppression Challenge[PDF]
    Chandan KA Reddy, Harishchandra Dubey, Vishak Gopal, Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aichner and Sriram Srinivasan, 2021. Proc. ICASSP, pp. 6623--6627. DOI: 10.1109/ICASSP39728.2021.9415105
  8. INTERSPEECH 2021 Deep Noise Suppression Challenge[HTML]
    Chandan K.A. Reddy, Harishchandra Dubey, Kazuhito Koishida, Arun Nair, Vishak Gopal, Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aichner and Sriram Srinivasan, 2021. Proc. Interspeech, pp. 2796--2800. DOI: 10.21437/Interspeech.2021-1609
  9. ICASSP 2022 Deep Noise Suppression Challenge[PDF]
    Harishchandra Dubey, Vishak Gopal, Ross Cutler, Ashkan Aazami, Sergiy Matusevych, Sebastian Braun, Sefik Emre Eskimez, Manthan Thakker, Takuya Yoshioka, Hannes Gamper and Robert Aichner, 2022. Proc. ICASSP, pp. 9271--9275. DOI: 10.1109/ICASSP43922.2022.9747230
  10. ICASSP 2023 Deep Noise Suppression Challenge[link]
    Harishchandra Dubey, Ashkan Aazami, Vishak Gopal, Babak Naderi, Sebastian Braun, Ross Cutler, Alex Ju, Mehdi Zohourian, Min Tang, Mehrsa Golestaneh and Robert Aichner, 2024. IEEE Open Journal of Signal Processing, pp. 1--13. DOI: 10.1109/OJSP.2024.3378602
  11. ICASSP 2023 Speech Signal Improvement Challenge[PDF]
    Ross Cutler, Ando Saabas, Babak Naderi, Nicolae-Cătălin Ristea, Sebastian Braun and Solomiya Branets, 2024. IEEE Open Journal of Signal Processing, pp. 1--12. DOI: 10.1109/OJSP.2024.3376293
  12. ICASSP 2024 Speech Signal Improvement Challenge[PDF]
    Nicolae Catalin Ristea, Ando Saabas, Ross Cutler, Babak Naderi, Sebastian Braun and Solomiya Branets, 2024. arXiv preprint arXiv:2401.14444.
  13. INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge[HTML]
    Lorenz Diener, Sten Sootla, Solomiya Branets, Ando Saabas, Robert Aichner and Ross Cutler, 2022. Proc. Interspeech, pp. 580--584. DOI: 10.21437/Interspeech.2022-10829
  14. The ICASSP 2024 Audio Deep Packet Loss Concealment Challenge[PDF]
    Lorenz Diener, Solomiya Branets, Ando Saabas and Ross Cutler, 2024. arXiv preprint arXiv:2402.16927.
  15. ICASSP 2021 Acoustic Echo Cancellation Challenge: Datasets, Testing Framework, and Results[PDF]
    Kusha Sridhar, Ross Cutler, Ando Saabas, Tanel Parnamaa, Markus Loide, Hannes Gamper, Sebastian Braun, Robert Aichner and Sriram Srinivasan, 2021. Proc. ICASSP, pp. 151--155. DOI: 10.1109/ICASSP39728.2021.9413457
  16. INTERSPEECH 2021 Acoustic Echo Cancellation Challenge[HTML]
    Ross Cutler, Ando Saabas, Tanel Parnamaa, Markus Loide, Sten Sootla, Marju Purin, Hannes Gamper, Sebastian Braun, Karsten Sorensen, Robert Aichner and Sriram Srinivasan, 2021. Proc. Interspeech, pp. 4748--4752. DOI: 10.21437/Interspeech.2021-1870
  17. ICASSP 2022 Acoustic Echo Cancellation Challenge[PDF]
    Ross Cutler, Ando Saabas, Tanel Parnamaa, Marju Purin, Hannes Gamper, Sebastian Braun, Karsten Sørensen and Robert Aichner, 2022. Proc. ICASSP, pp. 9107--9111. DOI: 10.1109/ICASSP43922.2022.9747215
  18. ICASSP 2023 Acoustic Echo Cancellation Challenge[PDF]
    Ross Cutler, Ando Saabas, Tanel Pärnamaa, Marju Purin, Evgenii Indenbom, Nicolae-Cătălin Ristea, Jegor Gužvin, Hannes Gamper, Sebastian Braun and Robert Aichner, 2024. IEEE Open Journal of Signal Processing, pp. 1--10. DOI: 10.1109/OJSP.2024.3376289
  19. Clarity-2021 Challenges: Machine Learning Challenges for Advancing Hearing Aid Processing[HTML]
    Simone Graetzer, Jon Barker, Trevor J Cox, Michael Akeroyd, John F Culling, Graham Naylor, Eszter Porter and Rhoddy Viveros Muñoz, 2021. Proc. Interspeech, pp. 686--690. DOI: 10.21437/Interspeech.2021-1574
  20. The 2nd Clarity Enhancement Challenge for Hearing Aid Speech Intelligibility Enhancement: Overview and Outcomes[link]
    Michael A Akeroyd, Will Bailey, Jon Barker, Trevor J Cox, John F Culling, Simone Graetzer, Graham Naylor, Zuzanna Podwińska and Zehai Tu, 2023. Proc. ICASSP. DOI: 10.1109/ICASSP49357.2023.10094918
  21. Overview of the 2023 ICASSP SP Clarity Challenge: Speech Enhancement for Hearing Aids[PDF]
    Trevor J Cox, Jon Barker, Will Bailey, Simone Graetzer, Michael A Akeroyd, John F Culling and Graham Naylor, 2023. Proc. ICASSP, pp. 1--2. DOI: 10.1109/ICASSP49357.2023.10433922
  22. L3DAS21 Challenge: Machine Learning for 3D Audio Signal Processing[PDF]
    Eric Guizzo, Riccardo F. Gramaccioni, Saeid Jamili, Christian Marinoni, Edoardo Massaro, Claudia Medaglia, Giuseppe Nachira, Leonardo Nucciarelli, Ludovica Paglialunga, Marco Pennese, Sveva Pepe, Enrico Rocchi, Aurelio Uncini and Danilo Comminiello, 2021. Proc. MLSP, pp. 1--6. DOI: 10.1109/MLSP52302.2021.9596248
  23. L3DAS22 Challenge: Learning 3D Audio Sources in a Real Office Environment[PDF]
    Eric Guizzo, Christian Marinoni, Marco Pennese, Xinlei Ren, Xiguang Zheng, Chen Zhang, Bruno Masiero, Aurelio Uncini and Danilo Comminiello, 2022. Proc. ICASSP, pp. 9186--9190. DOI: 10.1109/ICASSP43922.2022.9746872
  24. Overview of the L3DAS23 Challenge on Audio-Visual Extended Reality[PDF]
    Christian Marinoni, Riccardo F Gramaccioni, Changan Chen, Aurelio Uncini and Danilo Comminiello, 2023. Proc. ICASSP. DOI: 10.1109/ICASSP49357.2023.10433925
  25. L3DAS23: Learning 3D Audio Sources for Audio-Visual Extended Reality[link]
    Riccardo F Gramaccioni, Christian Marinoni, Changan Chen, Aurelio Uncini and Danilo Comminiello, 2024. IEEE Open Journal of Signal Processing, pp. 1--9. DOI: 10.1109/OJSP.2024.3376297
  26. ConferencingSpeech Challenge: Towards Far-Field Multi-Channel Speech Enhancement for Video Conferencing[link]
    Wei Rao, Yihui Fu, Yanxin Hu, Xin Xu, Yvkai Jv, Jiangyu Han, Zhongjie Jiang, Lei Xie, Yannan Wang, Shinji Watanabe, Zheng-Hua Tan, Hui Bu, Tao Yu and Shidong Shang, 2021. Proc. ASRU, pp. 679--686. DOI: 10.1109/ASRU51503.2021.9688126
  27. The CHiME-7 UDASE Task: Unsupervised Domain Adaptation for Conversational Speech Enhancement
    Simon Leglaive, Léonie Borne, Efthymios Tzinis, Mostafa Sadeghi, Matthieu Fraticelli, Scott Wisdom, Manuel Pariente, Daniel Pressnitzer and John R Hershey, 2023. Proc. 7th International Workshop on Speech Processing in Everyday Environments (CHiME). DOI: 10.21437/CHiME.2023-2
  28. AVSE Challenge: Audio-Visual Speech Enhancement Challenge[link]
    Andrea Lorena Aldana Blanco, Cassia Valentini-Botinhao, Ondrej Klejch, Mandar Gogate, Kia Dashtipour, Amir Hussain and Peter Bell, 2023. Proc. SLT, pp. 465--471. DOI: 10.1109/SLT54892.2023.10023284
  29. Employing Real Training Data for Deep Noise Suppression[PDF]
    Ziyi Xu, Marvin Sach, Jan Pirklbauer and Tim Fingscheidt, 2024. Proc. ICASSP, pp. 10731--10735. DOI: 10.1109/ICASSP48485.2024.10448333