Data
Contents:
Brief data description: Track 1
Training/validation data
The training and validation data are both simulated by using several public speech/noise/rir corpora (see the table below for more details). We provide the data preparation pipeline with the official baseline, which automatically downloads and pre-processes the data.
The data preparetion script will generate two types of training data:
The first is the pre-simulated data, which has the following form of directory structure:
data/train_simulation/
├── speech_length.scp # Speech duration in number of sample points.
├── spk1.scp # Clean speech file list of id and audio path.
├── utt2fs # id to sampling rate mapping
├── utt2spk # utterance to speaker mapping
└── wav.scp # Noisy speech file list of id and audio path.
The pre-simulated dataset can be loaded by the PreSimulatedDataset
in the baseline code .
The other is the dynamic mixing dataset; we also provided a DynamicMixingDataset
class in the baseline code for loading data in a dynamic mixing manner. The dataset has the following form of directory structure:
data/train_sources
├── noise_scoures.scp # Noise audio id and audio path.
├── rirs.scp # Room impulse response id and audio path.
├── source_length.scp # Speech duration in number of sample points.
├── speech_sources.scp # Clean speech id and audio path.
└── wind_noise_scoures.scp # # Wind noise audio id and audio path.
By default, we only generate the pre-simulated data for validation.
Non-blind test set
- Available online after November 3rd, 2025.
Blind test set
- Available online after November 18th, 2025.
Detailed data description
The training and validation data are both simulated based on the following source data. Based on the dataset of the 2nd URGENT challenge, we conducted a data selection using the data filtering method proposed in a recent paper
It is noted that we encourage you to explore better ways of data selection and utilization in this challenge. In addition to the data and filtering methods provided by our baseline, you can make use of larger-scale datasets, such as the track2 data from the 2nd URGENT challenge, or other allowed data (please check it in the rules pages).
Type | Corpus | Condition | Sampling Frequency (kHz) | Duration of in 2nd URGENT | Duration of in 3rd URGENT | License |
---|---|---|---|---|---|---|
Speech | LibriVox data from DNS5 challenge | Audiobook | 8~48 | ~350 h | ~150 h | CC BY 4.0 |
LibriTTS reading speech | Audiobook | 8~24 | ~200 h | ~109 h | CC BY 4.0 | |
VCTK reading speech | Newspaper, etc. | 48 | ~80 h | ~44 h | ODC-BY | |
EARS speech | Studio recording | 48 | ~100 h | ~16 h | CC-NC 4.0 | |
Multilingual Librispeech (de, en, es, fr) | Audiobook | 8~48 | ~450 (48600) h | ~129 h | CC0 | |
CommonVoice 19.0 (de, en, es, fr, zh-CN) | Crowd-sourced voices | 8~48 | ~1300 (9500) h | ~250 h | CC0 | |
NNCES | Children speech | 44.1 | - | ~20 h | CC0 | |
SeniorTalk | Elderly speech | 16 | - | ~50 h | CC BY-NC-SA 4.0 | |
VocalSet | Singing voice | 44.1 | - | ~10 h | CC BY 4.0 | |
ESD | Emotional speech | 16 | - | ~30 h | non-commercial, custom |
For the noise source and RIRs, we follow the same configuration as in the 2nd URGENT challenge.
Type | Corpus | Condition | Sampling Frequency (kHz) | Duration of in 2nd URGENT | License |
---|---|---|---|---|---|
Noise | Audioset+FreeSound noise in DNS5 challenge | Crowd-sourced + Youtube | 8~48 | ~180 h | CC BY 4.0 |
WHAM! noise | 4 Urban environments | 48 | ~70 h | CC BY-NC 4.0 | |
FSD50K (human voice filtered) | Crowd-sourced | 8~48 | ~100 h | CC0, CC-BY, CC-BY-NC, CC Sampling+ | |
Free Music Archive (medium) | Free Music Archive (directed by WFMU) | 8~44.1 | ~200 h | CC | |
Wind noise simulated by participants | - | any | - | - | |
RIR | Simulated RIRs from DNS5 challenge | SLR28 | 48 | ~60k samples | CC BY 4.0 |
RIRs simulated by participants | - | any</a> | - | - |
We allow participants to simulate their own RIRs using existing tools
For example, RIR-Generator, pyroomacoustics, gpuRIR, and so on. for generating the training data. The participants can also propose publicly available real recorded RIRs to be included in the above data list during the grace period (SeeTimeline
). Note: If participants used additional RIRs to train their model, the related information should be provided in the README.yaml file in the submission. Check the template for more information.We allow participants to simulate wind noise using some tools such as SC-Wind-Noise-Generator. In default, the simulation script in our repository simulates 200 and 100 wind noises for training and validation for each sampling frequency. The configuration can be easily changed in wind_noise_simulation_train.yaml and wind_noise_simulation_validation.yaml
Data selection and Simulation
We apply the data selection to the Track 1 data of the 2nd URGENT using the data filtering method proposed in the recent paper
Note that the data filtering of paper
The simulation data can be generated as follows:
-
In the first step, a manifest
meta.tsv
is first generated bysimulation/generate_data_param.py
from the given list of speech, noise, and room impulse response (RIR) samples. It specifies how each sample will be simulated, including the type of distortion to be applied, the speech/noise/RIR sample to be used, the signal-to-noise ratio (SNR), the random seed, and so on. -
In the second step, the simulation can be done in parallel via
simulation/simulate_data_from_param.py
for different samples according to the manifest while ensuring reproducibility. This procedure can be used to generate training and validation datasets. -
By default, we applied a high-pass filter to the speech signals since we have noticed that there is high-energy noise in the infrasound frequency band in some speech sources. You can turn it off by setting
highpass=False
in your simulation.
For the training set, we recommend dynamically generating degraded speech samples during training to increase the data diversity.
Distortions
In this challenge, the SE system has to address the following seven distortions. In addition to the first four distortions considered in our first challenge, we added three more distortions (bold ones) often observed in real recordings. Furthermore, in this challenge, inputs may have multiple distortions.
We provide an example simulation script as simulation/simulate_data_from_param.py
.
- Additive noise
- Reverberation
- Clipping
- Bandwidth limitation
- Codec distortion
- Packet loss
- Wind noise
Simulation metadata
The manifest mentioned above is a tsv
file containing several columns (separated by \t
). For example:
id | noisy_path | speech_uid | speech_sid | clean_path | noise_uid | snr_dB | rir_uid | augmentation | fs | length | text |
---|---|---|---|---|---|---|---|---|---|---|---|
unique ID | path to the generated degraded speech | utterance ID of clean speech | speaker ID of clean speech | path to the paired clean speech | utterance ID of noise | SNR in decibel | utterance ID of the RIR | augmentation type | sampling frequency | sample length | raw transcript |
fileid_1 | simulation_validation/noisy/fileid_1.flac | p226_001_mic1 | vctk_p226 | simulation_validation/clean/fileid_1.flac | JC1gBY5vXHI | 16.106643714525433 | mediumroom-Room119-00056 | bandwidth_limitation-kaiser_fast->24000 | 48000 | 134338 | Please call Stella. |
fileid_2 | simulation_validation/noisy/fileid_2.flac | p226_001_mic2 | vctk_p226 | simulation_validation/clean/fileid_2.flac | file205_039840000_loc32_day1 | 2.438365163611807 | none | bandwidth_limitation-kaiser_best->22050 | 48000 | 134338 | Please call Stella. |
fileid_1561 | simulation_validation/noisy/fileid_1561.flac | p315_001_mic1 | vctk_p315 | simulation_validation/clean/fileid_1561.flac | AvbnjyrHq8M | 1.3502745341597029 | mediumroom-Room076-00093 | clipping(min=0.016037324066971528, | 48000 | 114829 | <not-available> |
Note
-
If the
rir_uid
value is notnone
, the specified RIR is applied to the clean speech sample. -
If the
augmentation
value is notnone
, the specified augmentation is applied to the degraded speech sample. -
<not-available>
in thetext
column is a placeholder for transcripts that are not available. -
The audios in
noisy_path
,clean_path
, and (optional)noise_path
are consistently scaled such thatnoisy_path = clean_path + noise_path
. -
However, the scale of the enhanced audio is not critical for objective evaluation the challenge, as the evaluation metrics are made largely insensitive to the scale. For subjective listening in the final phase, however, it is recommended that the participants properly scale the enhanced audios to facilitate a consistent evaluation.
-
For all different distortion types, the original sampling frequency of each clean speech sample is always preserved, i.e., the degraded speech sample also shares the same sampling frequency. For
bandwidth_limitation
augmentation, this means that the generated speech sample is resampled to the original sampling frequencyfs
.
Data description: Track 2
Training/Validation Data
We allow the use of any public datasets for training. Please include the data usage in technical report. See https://github.com/urgent-challenge/urgent2026_challenge_track2 for training and validation data usage in official baseline.
Non-blind test set
- The urgent 2024 MOS dataset will be used for non-blind test sets. Available online after November 3rd, 2025
Blind test set
- Available online after November 18th, 2025