Rules
- When generating the training and validation datasets, only the speech, nosie, and room impulse response (RIR) corpora listed in the
Data
tab shall be used.- This is to ensure a fair comparison and proper understanding of various SE approaches.
- The first month of the challenge will be a grace period when participants can propose additional public datasets to be included in the list. We (organizers) will reply to the requests and may update the list. Updates will be recorded in the
Notices
tab. - Although the speech enhancement model should only be trained on the listed data, we allow the use of pre-trained foundation models such as HuBERT, WavLM, EnCodec, Llama, and so on as long as:
- they are publicly available before the challenge begins
- and they are explicitly mentioned in the submitted system description.
- Note:
- Their parameters can be fine-tuned on the listed data.
- It is not allowed to fine-tune any model, be it pre-trained or not, on any extra data other than the listed data.
-
The test data should only be used for evaluation purposes. Techniques such as test-time adaptation, unsupervised domain adaptation, and self-training on the test data are not allowed for this challenge.
-
There is no constraint on the latency or causality of the developed system in this challenge. Any type of model can be used as long as they conform to the other rules as listed in this page.
- Registration is required to submit results to the challenge (Check the
Leaderboard
tab for more information). Note that the team information (including affiliation, team name, and team members) should be provided when submitting the results. For detailed submission requirements, please check theSubmission
tab.- Only the team name will be shown in the leaderboard, while the affiliation and team members will be kept confidential.
- Only the team name will be shown in the leaderboard, while the affiliation and team members will be kept confidential.
-
The following evaluation metrics will be calculated for evaluation.
Category Metric Need Reference Signals? Supported Sampling Frequencies Value Range Non-intrusive SE metrics DNSMOS ↑ ❌ 16 kHz [1, 5] NISQA ↑ ❌ 48 kHz [1, 5] Intrusive SE metrics POLQA This metric will only be used for evaluation of the final blind test set. ↑✔ 8~48 kHz [1, 5] PESQ ↑ ✔ {8, 16} kHz [-0.5, 4.5] ESTOI ↑ ✔ 10 kHz [0, 1] SDR ↑ ✔ Any (-∞, +∞) MCD ↓ ✔ Any [0, +∞) LSD ↓ ✔ Any [0, +∞) Downstream-task-independent metrics SpeechBERTScore Based on our preliminary investigation, we adopt the HuBERT-Base backend for calculating the SpeechBERTScore, which differs from its defalut backend (WavLM-Large). ↑✔ 16 kHz [-1, 1] LPS ↑ ✔ 16 kHz (-∞, 1] Downstream-task-dependent metrics SpkSim ↑ ✔ 16 kHz [-1, 1] WAcc (=1-WER) ↑ ❌ 16 kHz (-∞, 1] Subjective SE metrics MOS This metric will only be used for evaluation of the final blind test set. ↑❌ Any [1, 5] Note: For real recorded test samples that do not have a strictly matched reference signal, part of the above metrics will be used.
-
The overall ranking will be determined via the following procedure:
- Calculate the average score of each metric for each submission.
- Calculate the per-metric ranking based on the average score.
- We adopt the dense ranking (“1223” ranking)
https://en.wikipedia.org/wiki/Ranking#Dense_ranking_("1223"_ranking) strategy for handling ties.
- We adopt the dense ranking (“1223” ranking)
- Calculate the
per-category ranking
by averaging the rankings within each category. - Calculate the overall ranking by averaging the
per-category rankings
.
# Step 1: Calculate the average score of each metric scores = {} for submission in all_submissions: scores[submission] = {} for category in metric_categories: for metric in category: scores[submission][metric] = mean([metric(each_sample) for each_sample in submission]) # Step 2: Calculate the per-metric ranking based on the average score rank_per_metric = {} rank_per_category = {} for category in metric_categories: for metric in category: rank_per_metric[metric] = get_ranking([scores[submission][metric] for submission in all_submissions]) # Step 3: Calculate the `per-category ranking` by averaging the rankings within each category rank_per_category[category] = get_ranking([rank_per_metric[metric] for metric in category]) # Step 4: Calculate the overall ranking by averaging the `per-category rankings` rank_overall = get_ranking([rank_per_category[category] for category in metric_categories])
Note: Only the original test data, the best baseline system, and participant submissions are taken into account in the ranking procedure.
Below is an example of how we calculate the overall ranking.
System | Per-metric ranking | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Non-intrusive SE metrics | Intrusive SE metrics | Downstream-task-independent metrics | Downstream-task-dependent metrics | |||||||||
DNSMOS ↑ | NISQA ↑ | PESQ ↑ | ESTOI ↑ | SDR ↑ | MCD ↓ | LSD ↓ | SpeechBERTScore ↑ | LPS ↑ | SpkSim ↑ | WAcc ↑ | ||
Noisy input | 6 | 6 | 5 | 4 | 5 | 5 | 5 | 1 | 5 | 3 | 3 | |
Baseline | 5 | 5 | 4 | 5 | 4 | 4 | 4 | 4 | 4 | 5 | 4 | |
Submission 1 | 1 | 1 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | |
Submission 2 | 4 | 4 | 3 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | |
Submission 3 | 3 | 3 | 2 | 2 | 2 | 2 | 2 | 1 | 2 | 2 | 2 | |
Submission 4 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
⬇ ⬇ ⬇ ⬇ ⬇ | ||||||||||||
Per-category ranking | ||||||||||||
System | Overall ranking | Non-intrusive SE metrics | Intrusive SE metrics | Downstream-task-independent metrics | Downstream-task-dependent metrics | |||||||
Noisy input | 4.200 | 6.0 | 4.8 | 3.0 | 3.0 | |||||||
Baseline | 4.425 | 5.0 | 4.2 | 4.0 | 4.5 | |||||||
Submission 1 | 4.750 | 1.0 | 6.0 | 6.0 | 6.0 | |||||||
Submission 2 | 3.750 | 4.0 | 3.0 | 3.5 | 4.5 | |||||||
Submission 3 | 2.125 | 3.0 | 2.0 | 1.5 | 2.0 | |||||||
Submission 4 | 1.250 | 2.0 | 1.0 | 1.0 | 1.0 |