Brief data description: Track 1

Training/validation data

The training and validation data are both simulated by using several public speech/noise/rir corpora (see the table below for more details). We provide the data preparation pipeline with the official baseline, which automatically downloads and pre-processes the data.

The data preparetion script will generate two types of training data:

The first is the pre-simulated data, which has the following form of directory structure:

data/train_simulation/
├── speech_length.scp # Speech duration in number of sample points.
├── spk1.scp # Clean speech file list of id and audio path.
├── utt2fs  # id to sampling rate mapping
├── utt2spk # utterance to speaker mapping 
└── wav.scp # Noisy speech file list of id and audio path.

The pre-simulated dataset can be loaded by the PreSimulatedDataset in the baseline code .

The other is the dynamic mixing dataset; we also provided a DynamicMixingDataset class in the baseline code for loading data in a dynamic mixing manner. The dataset has the following form of directory structure:

data/train_sources
├── noise_scoures.scp # Noise audio id and audio path.
├── rirs.scp # Room impulse response id and audio path.
├── source_length.scp # Speech duration in number of sample points.
├── speech_sources.scp # Clean speech id and audio path.
└── wind_noise_scoures.scp # # Wind noise audio id and audio path.

By default, we only generate the pre-simulated data for validation.

Available online after November 3rd, 2025.

Available online after November 18th, 2025.

Detailed data description

The training and validation data are both simulated based on the following source data. Based on the dataset of the 2nd URGENT challenge, we conducted a data selection using the data filtering method proposed in a recent paper .

It is noted that we encourage you to explore better ways of data selection and utilization in this challenge. In addition to the data and filtering methods provided by our baseline, you can make use of larger-scale datasets, such as the track2 data from the 2nd URGENT challenge, or other allowed data (please check it in the rules pages).

Type	Corpus	Condition	Sampling Frequency (kHz)	Duration of in 2nd URGENT	Duration of in 3rd URGENT	License
Speech	LibriVox data from DNS5 challenge	Audiobook	8~48	~350 h	~150 h	CC BY 4.0
	LibriTTS reading speech	Audiobook	8~24	~200 h	~109 h	CC BY 4.0
	VCTK reading speech	Newspaper, etc.	48	~80 h	~44 h	ODC-BY
	EARS speech	Studio recording	48	~100 h	~16 h	CC-NC 4.0
	Multilingual Librispeech (de, en, es, fr)We collected less compressed MLS from LibriVox, which have higher audio quality than the original MLS for ASR.	Audiobook	8~48	~450 (48600) h	~129 h	CC0
	CommonVoice 19.0 (de, en, es, fr, zh-CN)	Crowd-sourced voices	8~48	~1300 (9500) h	~250 h	CC0
	NNCES	Children speech	44.1	-	~20 h	CC0
	SeniorTalk	Elderly speech	16	-	~50 h	CC BY-NC-SA 4.0
	VocalSet	Singing voice	44.1	-	~10 h	CC BY 4.0
	ESD	Emotional speech	16	-	~30 h	non-commercial, custom You need to sign a license to obtain this dataset.

For the noise source and RIRs, we follow the same configuration as in the 2nd URGENT challenge.

Type	Corpus	Condition	Sampling Frequency (kHz)	Duration of in 2nd URGENT	License
Noise	Audioset+FreeSound noise in DNS5 challenge	Crowd-sourced + Youtube	8~48	~180 h	CC BY 4.0
	WHAM! noise	4 Urban environments	48	~70 h	CC BY-NC 4.0
	FSD50K (human voice filtered)	Crowd-sourced	8~48	~100 h	CC0, CC-BY, CC-BY-NC, CC Sampling+
	Free Music Archive (medium)	Free Music Archive (directed by WFMU)	8~44.1	~200 h	CC
	Wind noise simulated by participants	-	any	-	-
RIR	Simulated RIRs from DNS5 challenge	SLR28	48	~60k samples	CC BY 4.0
RIR	RIRs simulated by participants	-	any</a>	-	-

We allow participants to simulate their own RIRs using existing toolsFor example, RIR-Generator, pyroomacoustics, gpuRIR, and so on. for generating the training data. The participants can also propose publicly available real recorded RIRs to be included in the above data list during the grace period (See Timeline). Note: If participants used additional RIRs to train their model, the related information should be provided in the README.yaml file in the submission. Check the template for more information.

We allow participants to simulate wind noise using some tools such as SC-Wind-Noise-Generator. In default, the simulation script in our repository simulates 200 and 100 wind noises for training and validation for each sampling frequency. The configuration can be easily changed in wind_noise_simulation_train.yaml and wind_noise_simulation_validation.yaml

Data selection and Simulation

We apply the data selection to the Track 1 data of the 2nd URGENT using the data filtering method proposed in the recent paper . The selected data list is available at here. The speech source from NNCES, SeniorTalk, VocalSet, and ESD is not filtered.

Note that the data filtering of paper is not the best method to utilize the large-scale dataset for SE. The goal of this challenge is to encourage participants to develop how to better leverage large-scale data to improve the final SE performance.

The simulation data can be generated as follows:

In the first step, a manifest meta.tsv is first generated by simulation/generate_data_param.py from the given list of speech, noise, and room impulse response (RIR) samples. It specifies how each sample will be simulated, including the type of distortion to be applied, the speech/noise/RIR sample to be used, the signal-to-noise ratio (SNR), the random seed, and so on.
In the second step, the simulation can be done in parallel via simulation/simulate_data_from_param.py for different samples according to the manifest while ensuring reproducibility. This procedure can be used to generate training and validation datasets.
By default, we applied a high-pass filter to the speech signals since we have noticed that there is high-energy noise in the infrasound frequency band in some speech sources. You can turn it off by setting highpass=False in your simulation.

For the training set, we recommend dynamically generating degraded speech samples during training to increase the data diversity.

Distortions

In this challenge, the SE system has to address the following seven distortions. In addition to the first four distortions considered in our first challenge, we added three more distortions (bold ones) often observed in real recordings. Furthermore, in this challenge, inputs may have multiple distortions.

We provide an example simulation script as simulation/simulate_data_from_param.py.

Additive noise
Reverberation
Clipping
Bandwidth limitation
Codec distortion
Packet loss
Wind noise

Simulation metadata

The manifest mentioned above is a tsv file containing several columns (separated by \t). For example:

id	noisy_path	speech_uid	speech_sid	clean_path	noise_uid	snr_dB	rir_uid	augmentation	fs	length	text
unique ID	path to the generated degraded speech	utterance ID of clean speech	speaker ID of clean speech	path to the paired clean speech	utterance ID of noise	SNR in decibel	utterance ID of the RIR	augmentation type	sampling frequency	sample length	raw transcript
fileid_1	simulation_validation/noisy/fileid_1.flac	p226_001_mic1	vctk_p226	simulation_validation/clean/fileid_1.flac	JC1gBY5vXHI	16.106643714525433	mediumroom-Room119-00056	bandwidth_limitation-kaiser_fast->24000	48000	134338	Please call Stella.
fileid_2	simulation_validation/noisy/fileid_2.flac	p226_001_mic2	vctk_p226	simulation_validation/clean/fileid_2.flac	file205_039840000_loc32_day1	2.438365163611807	none	bandwidth_limitation-kaiser_best->22050	48000	134338	Please call Stella.
fileid_1561	simulation_validation/noisy/fileid_1561.flac	p315_001_mic1	vctk_p315	simulation_validation/clean/fileid_1561.flac	AvbnjyrHq8M	1.3502745341597029	mediumroom-Room076-00093	clipping(min=0.016037324066971528,max=0.9890219800761639)	48000	114829	<not-available>

Note

If the rir_uid value is not none, the specified RIR is applied to the clean speech sample.
If the augmentation value is not none, the specified augmentation is applied to the degraded speech sample.
<not-available> in the text column is a placeholder for transcripts that are not available.
The audios in noisy_path, clean_path, and (optional) noise_path are consistently scaled such that noisy_path = clean_path + noise_path.
However, the scale of the enhanced audio is not critical for objective evaluation the challenge, as the evaluation metrics are made largely insensitive to the scale. For subjective listening in the final phase, however, it is recommended that the participants properly scale the enhanced audios to facilitate a consistent evaluation.
For all different distortion types, the original sampling frequency of each clean speech sample is always preserved, i.e., the degraded speech sample also shares the same sampling frequency. For bandwidth_limitation augmentation, this means that the generated speech sample is resampled to the original sampling frequency fs.

Data description: Track 2

Training/Validation Data

We allow the use of any public datasets for training. Please include the data usage in technical report. See https://github.com/urgent-challenge/urgent2026_challenge_track2 for training and validation data usage in official baseline.

The urgent 2024 MOS dataset will be used for non-blind test sets. Available online after November 3rd, 2025

Available online after November 18th, 2025

Data

Contents: