Rules

  1. When generating the training and validation datasets, only the speech, nosie, and room impulse response (RIR) corpora listed in the Data tab shall be used.
    • This is to ensure a fair comparison and proper understanding of various SE approaches.
    • The first month of the challenge will be a grace period when participants can propose additional public datasets to be included in the list. We (organizers) will reply to the requests and may update the list. Updates will be recorded in the Notices tab.
    • Although the speech enhancement model should only be trained on the listed data, we allow the use of pre-trained foundation models such as HuBERT, WavLM, EnCodec, Llama, and so on as long as:
      • they are publicly available before the challenge begins
      • and they are explicitly mentioned in the submitted system description.
      • Note:
        • Their parameters can be fine-tuned on the listed data.
        • It is not allowed to fine-tune any model, be it pre-trained or not, on any extra data other than the listed data.

  2. The test data should only be used for evaluation purposes. Techniques such as test-time adaptation, unsupervised domain adaptation, and self-training on the test data are not allowed for this challenge.

  3. Registration is required to submit results to the challenge (Check the Leaderboard tab for more information). Note that the team information (including affiliation, team name, and team members) should be provided when submitting the results. For detailed submission requirements, please check the Submission tab.
    • Only the team name will be shown in the leaderboard, while the affiliation and team members will be kept confidential.

  4. The following evaluation metrics will be calculated for evaluation.

    Category Metric Need Reference Signals? Supported Sampling Frequencies Value Range
    Non-intrusive SE metrics DNSMOS 16 kHz [1, 5]
    NISQA 48 kHz [1, 5]
    Intrusive SE metrics PESQ {8, 16} kHz [-0.5, 4.5]
    ESTOI 10 kHz [0, 1]
    SDR Any (-∞, +∞)
    MCD Any [0, +∞)
    LSD Any [0, +∞)
    Downstream-task-independent metrics SpeechBERTScoreBased on our preliminary investigation, we adopt the HuBERT-Base backend for calculating the SpeechBERTScore, which differs from its defalut backend (WavLM-Large). 16 kHz [-1, 1]
    LPS 16 kHz (-∞, 1]
    Downstream-task-dependent metrics SpkSim 16 kHz [-1, 1]
    WAcc (=1-WER) ↑ 16 kHz (-∞, 1]


    1. Note that for the final ranking on the blind test data, further metrics (such as POLQAhttp://www.polqa.info and MOS) might be used, operating either at 8 kHz, at 16 kHz, or at 48 kHz SF.
    2. For real recorded test samples that do not have a strictly matched reference signal, part of the above metrics will be used.
  5. The overall ranking will be determined via the following procedure:

    1. Calculate the average score of each metric for each submission.
    2. Calculate the per-metric ranking based on the average score.
    3. Calculate the per-category ranking by averaging the rankings within each category.
    4. Calculate the overall ranking by averaging the per-category rankings.

     # Step 1: Calculate the average score of each metric
     scores = {}
     for submission in all_submissions:
       scores[submission] = {}
       for category in metric_categories:
         for metric in category:
           scores[submission][metric] = mean([metric(each_sample) for each_sample in submission])
    
     # Step 2: Calculate the per-metric ranking based on the average score
     rank_per_metric = {}
     rank_per_category = {}
     for category in metric_categories:
       for metric in category:
         rank_per_metric[metric] = get_ranking([scores[submission][metric] for submission in all_submissions])
    
       # Step 3: Calculate the `per-category ranking` by averaging the rankings within each category
       rank_per_category[category] = get_ranking([rank_per_metric[metric] for metric in category])
    
     # Step 4: Calculate the overall ranking by averaging the `per-category rankings`
     rank_overall = get_ranking([rank_per_category[category] for category in metric_categories])
    

    Note: Only the original test data, the best baseline system, and participant submissions are taken into account in the ranking procedure.

Below is an example of how we calculate the overall ranking.
System Per-metric ranking
Non-intrusive SE metrics Intrusive SE metrics Downstream-task-independent metrics Downstream-task-dependent metrics
DNSMOS ↑ NISQA ↑ PESQ ↑ ESTOI ↑ SDR ↑ MCD ↓ LSD ↓ SpeechBERTScore ↑ LPS ↑ SpkSim ↑ WAcc ↑
Noisy input 6 6 5 4 5 5 5 1 5 3 3
Baseline 5 5 4 5 4 4 4 4 4 5 4
Submission 1 1 1 6 6 6 6 6 6 6 6 6
Submission 2 4 4 3 3 3 3 3 4 3 4 5
Submission 3 3 3 2 2 2 2 2 1 2 2 2
Submission 4 2 2 1 1 1 1 1 1 1 1 1
⬇ ⬇ ⬇ ⬇ ⬇
Per-category ranking
System Overall ranking Non-intrusive SE metrics Intrusive SE metrics Downstream-task-independent metrics Downstream-task-dependent metrics
Noisy input 4.200 6.0 4.8 3.0 3.0
Baseline 4.425 5.0 4.2 4.0 4.5
Submission 1 4.750 1.0 6.0 6.0 6.0
Submission 2 3.750 4.0 3.0 3.5 4.5
Submission 3 2.125 3.0 2.0 1.5 2.0
Submission 4 1.250 2.0 1.0 1.0 1.0