Data

Contents:

Brief data description: Track 1

Training/validation data

The training and validation data are both simulated by using several public speech/noise/rir corpora (see the table below for more details). We provide the data preparation pipeline with the official baseline, which automatically downloads and pre-processes the data.

The data preparetion script will generate two types of training data:

The first is the pre-simulated data, which has the following form of directory structure:

data/train_simulation/
├── speech_length.scp # Speech duration in number of sample points.
├── spk1.scp # Clean speech file list of id and audio path.
├── utt2fs  # id to sampling rate mapping
├── utt2spk # utterance to speaker mapping 
└── wav.scp # Noisy speech file list of id and audio path.

The pre-simulated dataset can be loaded by the PreSimulatedDataset in the baseline code .

The other is the dynamic mixing dataset; we also provided a DynamicMixingDataset class in the baseline code for loading data in a dynamic mixing manner. The dataset has the following form of directory structure:

data/train_sources
├── noise_scoures.scp # Noise audio id and audio path.
├── rirs.scp # Room impulse response id and audio path.
├── source_length.scp # Speech duration in number of sample points.
├── speech_sources.scp # Clean speech id and audio path.
└── wind_noise_scoures.scp # # Wind noise audio id and audio path.

By default, we only generate the pre-simulated data for validation.

Non-blind test set

  • Available online after November 3rd, 2025.

Blind test set

  • Available online after November 18th, 2025.


Detailed data description

The training and validation data are both simulated based on the following source data. Based on the dataset of the 2nd URGENT challenge, we conducted a data selection using the data filtering method proposed in a recent paper .

It is noted that we encourage you to explore better ways of data selection and utilization in this challenge. In addition to the data and filtering methods provided by our baseline, you can make use of larger-scale datasets, such as the track2 data from the 2nd URGENT challenge, or other allowed data (please check it in the rules pages).

Type Corpus Condition Sampling Frequency (kHz) Duration of in 2nd URGENT Duration of in 3rd URGENT License
Speech LibriVox data from DNS5 challenge Audiobook 8~48 ~350 h ~150 h CC BY 4.0
LibriTTS reading speech Audiobook 8~24 ~200 h ~109 h CC BY 4.0
VCTK reading speech Newspaper, etc. 48 ~80 h ~44 h ODC-BY
EARS speech Studio recording 48 ~100 h ~16 h CC-NC 4.0
Multilingual Librispeech (de, en, es, fr)We collected less compressed MLS from LibriVox, which have higher audio quality than the original MLS for ASR. Audiobook 8~48 ~450 (48600) h ~129 h CC0
CommonVoice 19.0 (de, en, es, fr, zh-CN) Crowd-sourced voices 8~48 ~1300 (9500) h ~250 h CC0
NNCES Children speech 44.1 - ~20 h CC0
SeniorTalk Elderly speech 16 - ~50 h CC BY-NC-SA 4.0
VocalSet Singing voice 44.1 - ~10 h CC BY 4.0
ESD Emotional speech 16 - ~30 h non-commercial, custom You need to sign a license to obtain this dataset.

For the noise source and RIRs, we follow the same configuration as in the 2nd URGENT challenge.

Type Corpus Condition Sampling Frequency (kHz) Duration of in 2nd URGENT License
Noise Audioset+FreeSound noise in DNS5 challenge Crowd-sourced + Youtube 8~48 ~180 h CC BY 4.0
WHAM! noise 4 Urban environments 48 ~70 h CC BY-NC 4.0
FSD50K (human voice filtered) Crowd-sourced 8~48 ~100 h CC0, CC-BY, CC-BY-NC, CC Sampling+
Free Music Archive (medium) Free Music Archive (directed by WFMU) 8~44.1 ~200 h CC
Wind noise simulated by participants - any - -
RIR Simulated RIRs from DNS5 challenge SLR28 48 ~60k samples CC BY 4.0
RIRs simulated by participants - any</a> - -

We allow participants to simulate their own RIRs using existing toolsFor example, RIR-Generator, pyroomacoustics, gpuRIR, and so on. for generating the training data. The participants can also propose publicly available real recorded RIRs to be included in the above data list during the grace period (See Timeline). Note: If participants used additional RIRs to train their model, the related information should be provided in the README.yaml file in the submission. Check the template for more information.

We allow participants to simulate wind noise using some tools such as SC-Wind-Noise-Generator. In default, the simulation script in our repository simulates 200 and 100 wind noises for training and validation for each sampling frequency. The configuration can be easily changed in wind_noise_simulation_train.yaml and wind_noise_simulation_validation.yaml


Data selection and Simulation

We apply the data selection to the Track 1 data of the 2nd URGENT using the data filtering method proposed in the recent paper . The selected data list is available at here. The speech source from NNCES, SeniorTalk, VocalSet, and ESD is not filtered.

Note that the data filtering of paper is not the best method to utilize the large-scale dataset for SE. The goal of this challenge is to encourage participants to develop how to better leverage large-scale data to improve the final SE performance.

The simulation data can be generated as follows:

  1. In the first step, a manifest meta.tsv is first generated by simulation/generate_data_param.py from the given list of speech, noise, and room impulse response (RIR) samples. It specifies how each sample will be simulated, including the type of distortion to be applied, the speech/noise/RIR sample to be used, the signal-to-noise ratio (SNR), the random seed, and so on.

  2. In the second step, the simulation can be done in parallel via simulation/simulate_data_from_param.py for different samples according to the manifest while ensuring reproducibility. This procedure can be used to generate training and validation datasets.

  3. By default, we applied a high-pass filter to the speech signals since we have noticed that there is high-energy noise in the infrasound frequency band in some speech sources. You can turn it off by setting highpass=False in your simulation.

For the training set, we recommend dynamically generating degraded speech samples during training to increase the data diversity.

Distortions

In this challenge, the SE system has to address the following seven distortions. In addition to the first four distortions considered in our first challenge, we added three more distortions (bold ones) often observed in real recordings. Furthermore, in this challenge, inputs may have multiple distortions.

We provide an example simulation script as simulation/simulate_data_from_param.py.

  1. Additive noise
  2. Reverberation
  3. Clipping
  4. Bandwidth limitation
  5. Codec distortion
  6. Packet loss
  7. Wind noise

Simulation metadata

The manifest mentioned above is a tsv file containing several columns (separated by \t). For example:

id noisy_path speech_uid speech_sid clean_path noise_uid snr_dB rir_uid augmentation fs length text
unique ID path to the generated degraded speech utterance ID of clean speech speaker ID of clean speech path to the paired clean speech utterance ID of noise SNR in decibel utterance ID of the RIR augmentation type sampling frequency sample length raw transcript
fileid_1 simulation_validation/noisy/fileid_1.flac p226_001_mic1 vctk_p226 simulation_validation/clean/fileid_1.flac JC1gBY5vXHI 16.106643714525433 mediumroom-Room119-00056 bandwidth_limitation-kaiser_fast->24000 48000 134338 Please call Stella.
fileid_2 simulation_validation/noisy/fileid_2.flac p226_001_mic2 vctk_p226 simulation_validation/clean/fileid_2.flac file205_039840000_loc32_day1 2.438365163611807 none bandwidth_limitation-kaiser_best->22050 48000 134338 Please call Stella.
fileid_1561 simulation_validation/noisy/fileid_1561.flac p315_001_mic1 vctk_p315 simulation_validation/clean/fileid_1561.flac AvbnjyrHq8M 1.3502745341597029 mediumroom-Room076-00093 clipping(min=0.016037324066971528,max=0.9890219800761639) 48000 114829 <not-available>

Note

  • If the rir_uid value is not none, the specified RIR is applied to the clean speech sample.

  • If the augmentation value is not none, the specified augmentation is applied to the degraded speech sample.

  • <not-available> in the text column is a placeholder for transcripts that are not available.

  • The audios in noisy_path, clean_path, and (optional) noise_path are consistently scaled such that noisy_path = clean_path + noise_path.

  • However, the scale of the enhanced audio is not critical for objective evaluation the challenge, as the evaluation metrics are made largely insensitive to the scale. For subjective listening in the final phase, however, it is recommended that the participants properly scale the enhanced audios to facilitate a consistent evaluation.

  • For all different distortion types, the original sampling frequency of each clean speech sample is always preserved, i.e., the degraded speech sample also shares the same sampling frequency. For bandwidth_limitation augmentation, this means that the generated speech sample is resampled to the original sampling frequency fs.

Data description: Track 2

Training/Validation Data

We allow the use of any public datasets for training. Please include the data usage in technical report. See https://github.com/urgent-challenge/urgent2026_challenge_track2 for training and validation data usage in official baseline.

Non-blind test set

  • The urgent 2024 MOS dataset will be used for non-blind test sets. Available online after November 3rd, 2025

Blind test set

  • Available online after November 18th, 2025