Track1: Universal Speech Enhancement

Contents:

Datasets

Notice: ⚠️ A simulation-related Issue has been corrected in the baseline code. It is recommended that you pull the latest code from the baseline repository to ensure proper dynamic simulation behavior. Some noisy files contain complete silence due to this bug in the earlier simulation version. However, the official validation set that we release remains unchanged to preserve leaderboard consistency. The official validation is for dry-run use only and does not affect final ranking.

πŸ†• [New Update] Non-Blind Test for Leaderboard Now Available

The non-blind test is now available at urgent26_track1_nonblind_test

  • Check the submission guide, and submit the enhanced speech to the leaderboard for the non-blind test (opens after Oct. 14).
  • The non-blind test was simulated in the same way as the official validation set, but it includes a richer variety of speeches.

After the non-blind test phase ends, clean speech and metadata will be available.

Official Validation Set for Leaderboard Testing

πŸ†• [Oct.20 updated] The official validation set with reference audio and labels is now available here.

To help participants get familiar with the leaderboard submission process, we provide an official validation set:

  • Download it here: urgent26_track1_leaderboard_validation
  • For dry-run submissions before non-blind/blind test sets
  • Ranked using the same objective metrics as final evaluation (stage 1).
  • Check the submission guide first, and the leaderboard for validation will open after Sept. 21.

This set is not for training. Use it to verify your submission format and preview leaderboard ranking.

Brief data description:

The training and validation data are both simulated by using several public speech/noise/rir corpora (see the table below for more details). We provide the data preparation pipeline with the official baseline, which automatically downloads and pre-processes the data.

Non-blind/bind test set will be released later, check the timeline page for more details.

Detailed data description:

Based on the dataset of the 2nd URGENT challenge, we conducted a data selection using the data filtering method proposed in a recent paper .

It is noted that we encourage you to explore better ways of data selection and utilization in this challenge. In addition to the data and filtering methods provided by our baseline, you can make use of larger-scale datasets, such as the track1/track2 data from the 2nd URGENT challenge, or other allowed data (please check it in the rules section).

The training and validation data are both simulated based on the following source data.

Type Corpus Condition Sampling Frequency (kHz) Duration of in 2nd URGENT Duration of in 3rd URGENT License
Speech LibriVox data from DNS5 challenge Audiobook 8~48 ~350 h ~150 h CC BY 4.0
LibriTTS reading speech Audiobook 8~24 ~200 h ~109 h CC BY 4.0
VCTK reading speech Newspaper, etc. 48 ~80 h ~44 h ODC-BY
EARS speech Studio recording 48 ~100 h ~16 h CC-NC 4.0
Multilingual Librispeech (de, en, es, fr)We collected less compressed MLS from LibriVox, which have higher audio quality than the original MLS for ASR. Audiobook 8~48 ~450 (48600) h ~129 h CC0
CommonVoice 19.0 (de, en, es, fr, zh-CN) Crowd-sourced voices 8~48 ~1300 (9500) h ~250 h CC0
NNCES Children speech 44.1 - ~20 h CC0
SeniorTalk Elderly speech 16 - ~50 h CC BY-NC-SA 4.0
VocalSet Singing voice 44.1 - ~10 h CC BY 4.0
ESD Emotional speech 16 - ~30 h non-commercial, custom You need to sign a license to obtain this dataset.

For the noise source and RIRs, we follow the same configuration as in the 2nd URGENT challenge.

Type Corpus Condition Sampling Frequency (kHz) Duration of in 2nd URGENT License
Noise Audioset+FreeSound noise in DNS5 challenge Crowd-sourced + Youtube 8~48 ~180 h CC BY 4.0
WHAM! noise 4 Urban environments 48 ~70 h CC BY-NC 4.0
FSD50K (human voice filtered) Crowd-sourced 8~48 ~100 h CC0, CC-BY, CC-BY-NC, CC Sampling+
Free Music Archive (medium) Free Music Archive (directed by WFMU) 8~44.1 ~200 h CC
Wind noise simulated by participants - any - N/A
RIR Simulated RIRs from DNS5 challenge SLR28 48 ~60k samples CC BY 4.0
RIRs simulated by participants - any - N/A

Participants can also simulate their own RIRs using existing tools for generating the training data. Participants can also propose publicly available, real recorded RIRs to be included in the above data list during the grace period. See rules section for more details.

Data selection and Simulation. We apply the data selection to the track1 data of the 2nd URGENT using the data filtering method proposed in the recent paper . The selected data list is available at here. The speech source from NNCES, SeniorTalk, VocalSet, and ESD is not filtered.

Note that the data filtering of paper is obviously not the best method to utilize the large-scale dataset for SE. The goal of this challenge is to encourage participants to develop how to better leverage large-scale data to improve the SE performance.

The simulation data can be generated as follows:

  1. In the first step, a manifest meta.tsv is first generated by simulation/generate_data_param.py from the given list of speech, noise, and room impulse response (RIR) samples. It specifies how each sample will be simulated, including the type of distortion to be applied, the speech/noise/RIR sample to be used, the signal-to-noise ratio (SNR), the random seed, and so on.

  2. In the second step, the simulation can be done in parallel via simulation/simulate_data_from_param.py for different samples according to the manifest while ensuring reproducibility. This procedure can be used to generate training and validation datasets.

  3. By default, we applied a high-pass filter to the speech signals since we have noticed that there is high-energy noise in the infrasound frequency band in some speech sources. You can turn it off by setting highpass=False in your simulation.

A pre-simulated training and validation dataset The simulated speech derived from the ESD dataset is not included due to the license issue. You may apply for the license and run the simulation script by yourself to generate the complete dataset. is available online at HugginFace. Participants can download and use it directly without running the simulation scripts to get started quickly. For your final submitted system, we recommend dynamically generating degraded speech samples during training to increase the data diversity.

framework

Distortions Model. As depicted in the figure above, we design a distortion model (simulation stage) \mathcal{F}(\cdot) to unify the data format for different distortion types, such that different speech enhancement (SE) sub-tasks can share a consistent input/output processing. In particular, we ensure that the sampling frequency (SF) at the output of the distortion model (degraded speech) is always equal to that of its input.

During training and inference, the processing of different SFs is supported for both conventional SE models (lower-right) that usually only operate at one SF and adaptive STFT-based sampling-frequency-independent (SFI) SE models (upper-right) that can directly handle different SFs.

  • For conventional SE models (e.g., Conv-TasNet), we always upsample its input (degraded speech) to the highest SF (48 kHz) so that the model only needs to operate at 48 kHz. The model output (48 kHz) is then downsampled to the same SF as the degraded speech.
  • For adaptive STFT-based SFI SE models (e.g., BSRNN, TF-GridNet), we directly feed the degraded speech of different SFs into the model, which can adaptively adjust their STFT/iSTFT configuration according to the SF and generate the enhanced signal with the same SF.

In the challenge, the SE system has to address the following seven distortions:

  1. Additive noise
  2. Reverberation
  3. Clipping
  4. Bandwidth limitation
  5. Codec distortion
  6. Packet loss
  7. Wind noise

We provide an example simulation script as simulation/simulate_data_from_param.py.

Baseline

Please refer to the official GitHub repository for more details.

Discriminative Baseline. We provide an adaptive STFT-based SFI BSRNN as the discriminative baseline. The model is available at here. The details of the baseline training can be found in our recent ASRU paper .

Generative Baseline. We follow a recent work named FlowSE to build generative SE models. It extends the flow matching method to a conditional flow matching model that generates clean speech conditioned by the noisy speech. We reimplement an improved BSRNN to estimate the conditional vector field. The model is available at here, and the training details of the generative baseline model can be found in the paper .


Rules

  1. When generating the training and validation datasets, ONLY the speech, noise, and room impulse response (RIR) corpora listed in the Datasets section shall be used to ensure a fair comparison and proper understanding of various SE approaches.

    • The first month of the challenge will be a grace period when participants can propose additional public datasets to be included in the list. We (organizers) will reply to the requests and may update the list. Updates will be recorded in the Notice tab.

    • It is NOT allowed to use pre-trained speech enhancement models trained on other than official Challenge data.
    • Although the speech enhancement model should only be trained on the listed data, we allow the use of pre-trained foundation models such as HuBERT, WavLM, EnCodec, Llama, and so on. We also allow the use of a pretrained speech enhancement/restoration model to improve the quality of clean speech for simulation. The use of all pre-trained models must meet the following requirements:
      • They are publicly available before the challenge begins
      • They are explicitly mentioned in the submitted system description.
      • Note:
        • Their parameters can be fine-tuned on the listed data.
        • It is not allowed to fine-tune any model, be it pre-trained or not, on any extra data other than the listed data.

    • If you are unsure whether the pre-trained model you would like to use is allowed, please reach out to the organizers.
  2. We allow participants to simulate their own RIRs using existing toolsFor example, RIR-Generator, pyroomacoustics, gpuRIR, and so on. for generating the training data. The participants can also propose publicly available, real recorded RIRs to be included in the above data list during the grace period (See Timeline).

    Note: If participants used additional RIRs to train their model, the related information should be provided in the README.yaml file in the submission. Check the template for more information.

  3. We allow participants to simulate wind noise using some tools such as SC-Wind-Noise-Generator. In default, the simulation script in our repository simulates 200 and 100 wind noises for training and validation for each sampling frequency. The configuration can be easily changed in wind_noise_simulation_train.yaml and wind_noise_simulation_validation.yaml

  4. The test data should only be used for evaluation purposes. Techniques such as test-time adaptation, unsupervised domain adaptation, and self-training on the test data are NOT allowed.

  5. There is no constraint on the latency or causality of the developed system in this challenge. Any type of model can be used as long as it conforms to the other rules as listed on this page.

  6. Registration is required to submit results to the challenge (Check How to Participate for more information). Note that the team information (including affiliation, team name, and team members) should be provided when submitting the results. For detailed submission requirements, please check the Submission part.
    • Only the team name will be shown in the leaderboard, while the affiliation and team members will be kept confidential.

Submission

  • Please read the following guidelines first, and then submit your results here.
  • Each submission should be a zip file containing two parts:
    1. enhanced audios corresponding to the subset to be tested;
    2. a YAML (README.yaml) file containing the basic information about the submission (as listed below). The template can be found here.
      • team information (team name, affiliation, team mambers)
      • description of the training & validation data used for the submission
      • description of pre-trained models used for the submission (if applicable)
  • The zip file should only contain a single YAML (README.yaml) file and a folder named enhanced that contains all the enhanced audio files. That is, the directory structure after executing unzip {your_teamname}.zip should be as follows:
./
β”œβ”€β”€ README.yaml
└── enhanced/
    β”œβ”€β”€ fileid_1.flac
    β”œβ”€β”€ fileid_2.flac
    β”œβ”€β”€ ...
    └── fileid_N.flac
  • Note that the submission without README.yaml is rejected by the leaderboard system.
  • Please encode all audio files in the 16-bit FLAC format to reduce the file size (< 300 MB).
    • The audio files should be encoded in mono-channel with its original sampling frequency.
    • All audio files should have the same name and length as the original audio files in the provided subset to be tested.
  • Be careful not to include hidden directories in the zip file such as __MACOSX/ which may cause evaluation failure.
  • The submission should be done via our official leaderboard website:
    • A registration (please sign up at https://urgent-challenge.com) is required to participate in our challenge.
    • Each team shall only register once. Multiple registrations from different members in one team are not allowed.
  • Each team can submit up to 2 submissions per day during the challenge.
    • The third and later submissions will be ignored. The quota will be reset at 00:00 (UTC timezone) every day.
    • Failed submissions are not taken into account when counting the submissions per day.
    • No submission will be accepted after the deadline (January 15th, 2025).
  • For each team, only the submission with the best leaderboard performance will be used for the final evaluation.

Submissions that fail to conform to the above requirements may be rejected.

Should you encounter any problem during the submission, please feel free to contact the organizers.

By submitting the results, the participants agree to the following conditions:

As a condition of submission, entrants grant the organizers, a perpetual,
irrevocable, worldwide, royalty-free, and non-exclusive license to use,
reproduce, adapt, modify, publish, distribute, publicly perform, create a
derivative work from, and publicly display the submitted audio signals and
CSV files (containing the performance scores).

The main motivation for this condition is to donate the submitted audio
signals and corresponding performance scores to the community as a
voice quality dataset.

Ranking

The blind test ranking will be carried out in two stages. In stage 1, we will evaluate participants’ submissions with multiple objective metrics. The top-6 systems in the objective overall ranking will advance to the stage 2, be evaluated by multiple subjective tests, and then the final ranking will be determined by the subjective overall ranking. For the validation set and non-blind test set, the leardboard will only rank with the multiple objective metrics.

  1. The following objective evaluation metrics will be calculated for evaluation in stage 1. For real recorded test samples that do not have a strictly matched reference signal, part of the following metrics will be used. The overall ranking will be determined by the algorithm introduced in the subsequent section.

    Category Metric Need Reference Signals?
    Non-intrusive SE metrics DNSMOS ↑ ❌
    NISQA ↑ ❌
    UTMOS ↑ ❌
    SCOREQ ↑ ❌
    Intrusive SE metrics POLQAThis metric will only be used for evaluation of the final blind test set. ↑ βœ”
    PESQ ↑ βœ”
    ESTOI ↑ βœ”
    Downstream-task-independent metrics SpeechBERTScoreTo evaluate multilingual speech, we adopt the MHuBERT-147 backend for calculating the SpeechBERTScore, which differs from its default backend (WavLM-Large). ↑ βœ”
    LPS ↑ βœ”
    Downstream-task-dependent metrics Speaker Similarity ↑ βœ”
    Emotion Similarity ↑ βœ”
    Language identification accuracy ↑ ❌
    Character accuracy (1 - CER) ↑ ❌


  2. In stage 2, we will make ITU-T P.808 Absolute Category Rating (ACR) and Comparison Category Rating (CCR) for the top-6 submissions, and the Mean Opinion Score (MOS) and Comparison Mean Opinion Score (CMOS) will be used for subjective metric ranking. Then the overall ranking in the subjective evaluation will become the final ranking of the challenge. It is noted that, if the overall rankings of the subjective tests of two teams are the same, we will refer to the overall rankings of the objective tests to determine the final rankings.

    Category Metric Evaluation method
    Subjective metrics MOS ↑ ITU-T P.808 ACR
    CMOS ↑ ITU-T P.808 CCR


Overall ranking method

The overall ranking will be determined via the following procedure:

  1. Calculate the average score of each metric for each submission.
  2. Calculate the per-metric ranking based on the average score.
  3. Calculate the per-category ranking by averaging the rankings within each category.
  4. Calculate the overall ranking by averaging the per-category rankings.

# Step 1: Calculate the average score of each metric
scores = {}
for submission in all_submissions:
  scores[submission] = {}
  for category in metric_categories:
    for metric in category:
      scores[submission][metric] = mean([metric(each_sample) for each_sample in submission])

# Step 2: Calculate the per-metric ranking based on the average score
rank_per_metric = {}
rank_per_category = {}
for category in metric_categories:
  for metric in category:
    rank_per_metric[metric] = get_ranking([scores[submission][metric] for submission in all_submissions])

  # Step 3: Calculate the `per-category ranking` by averaging the rankings within each category
  rank_per_category[category] = get_ranking([rank_per_metric[metric] for metric in category])

# Step 4: Calculate the overall ranking by averaging the `per-category rankings`
rank_overall = get_ranking([rank_per_category[category] for category in metric_categories])