Baseline | URGENT Challenge

🚧 This website is currently under construction. All content is subject to change and may not reflect the final details of the competition. Please refer to official announcements once the contest begins for the most accurate and up-to-date information.

Baseline code and checkpoint

Please refer to the official GitHub repository for more details.

Distortion model

framework

As depicted in the figure above, we design a distortion model (simulation stage) \mathcal{F}(\cdot) to unify the data format for different distortion types, such that different speech enhancement (SE) sub-tasks can share a consistent input/output processing. In particular, we ensure that the sampling frequency (SF) at the output of the distortion model (degraded speech) is always equal to that of its input.

During training and inference, the processing of different SFs is supported for both conventional SE models (lower-right) that usually only operate at one SF and adaptive STFT-based sampling-frequency-independent (SFI) SE models (upper-right) that can directly handle different SFs.

For conventional SE models (e.g., Conv-TasNet), we always upsample its input (degraded speech) to the highest SF (48 kHz) so that the model only needs to operate at 48 kHz. The model output (48 kHz) is then downsampled to the same SF as the degraded speech.
For adaptive STFT-based SFI SE models (e.g., BSRNN, TF-GridNet), we directly feed the degraded speech of different SFs into the model, which can adaptively adjust their STFT/iSTFT configuration according to the SF and generate the enhanced signal with the same SF.

Discriminative Baseline

We provide an adaptive STFT-based SFI BSRNN as the discriminative baseline. The model is available at here. The details of the baseline training can be found in our recent ASRU paper .

Generative Baseline

We follow a recent work named FlowSE to build generative SE models. It extends the flow matching method to a conditional flow matching model that generates clean speech conditioned by the noisy speech. We reimplement an improved BSRNN to estimate the conditional vector field. The model is available at online, and the training details of the generative baseline model can be found in the paper .