Rules

  1. When generating the training and validation datasets, ONLY the speech, nosie, and room impulse response (RIR) corpora listed in the Data tab shall be used to ensure a fair comparison and proper understanding of various SE approaches.
    • The first month of the challenge (until Dec. 15) will be a grace period when participants can propose additional public datasets to be included in the list. We (organizers) will reply to the requests and may update the list. Updates will be recorded in the Notice tab.
    • Although the speech enhancement model should only be trained on the listed data, we allow the use of pre-trained foundation models such as HuBERT, WavLM, EnCodec, Llama, and so on as long as:
      • they are publicly available before the challenge begins
      • and they are explicitly mentioned in the submitted system description.
      • Note:
        • Their parameters can be fine-tuned on the listed data.
        • It is not allowed to fine-tune any model, be it pre-trained or not, on any extra data other than the listed data.

    • It is allowed to use a model trained on the “NeurIPS 2024 URGENT challenge” data, as those data are included in this challenge too.
  2. The test data should only be used for evaluation purposes. Techniques such as test-time adaptation, unsupervised domain adaptation, and self-training on the test data are NOT allowed for this challenge.

  3. There is no constraint on the latency or causality of the developed system in this challenge. Any type of model can be used as long as they conform to the other rules as listed in this page.

  4. Registration is required to submit results to the challenge (Check the Leaderboard tab for more information). Note that the team information (including affiliation, team name, and team members) should be provided when submitting the results. For detailed submission requirements, please check the Submission tab.
    • Only the team name will be shown in the leaderboard, while the affiliation and team members will be kept confidential.

  5. The following evaluation metrics will be calculated for evaluation.

    Category Metric Need Reference Signals? Supported Sampling Frequencies Value Range
    Non-intrusive SE metrics DNSMOS 16 kHz [1, 5]
    NISQA 48 kHz [1, 5]
    UTMOS 16 kHz [1, 5]
    Intrusive SE metrics POLQAThis metric will only be used for evaluation of the final blind test set. 8~48 kHz [1, 5]
    PESQ {8, 16} kHz [-0.5, 4.5]
    ESTOI 10 kHz [0, 1]
    SDR Any (-∞, +∞)
    MCD Any [0, +∞)
    LSD Any [0, +∞)
    Downstream-task-independent metrics SpeechBERTScoreTo evaluate multilingual speech, we adopt the MHuBERT-147 backend for calculating the SpeechBERTScore, which differs from its defalut backend (WavLM-Large). 16 kHz [-1, 1]
    LPS 16 kHz (-∞, 1]
    Downstream-task-dependent metrics SpkSim 16 kHz [-1, 1]
    Character accuracy (1 - CER) 16 kHz (-∞, 1]
    Subjective SE metrics MOSThis metric will only be used for evaluation of the final blind test set. Any [1, 5]


    Note: For real recorded test samples that do not have a strictly matched reference signal, part of the above metrics will be used.

  6. The overall ranking will be determined via the following procedure:

    1. Calculate the average score of each metric for each submission.
    2. Calculate the per-metric ranking based on the average score.
    3. Calculate the per-category ranking by averaging the rankings within each category.
    4. Calculate the overall ranking by averaging the per-category rankings.

     # Step 1: Calculate the average score of each metric
     scores = {}
     for submission in all_submissions:
       scores[submission] = {}
       for category in metric_categories:
         for metric in category:
           scores[submission][metric] = mean([metric(each_sample) for each_sample in submission])
    
     # Step 2: Calculate the per-metric ranking based on the average score
     rank_per_metric = {}
     rank_per_category = {}
     for category in metric_categories:
       for metric in category:
         rank_per_metric[metric] = get_ranking([scores[submission][metric] for submission in all_submissions])
    
       # Step 3: Calculate the `per-category ranking` by averaging the rankings within each category
       rank_per_category[category] = get_ranking([rank_per_metric[metric] for metric in category])
    
     # Step 4: Calculate the overall ranking by averaging the `per-category rankings`
     rank_overall = get_ranking([rank_per_category[category] for category in metric_categories])
    

    Note: Only the original test data, the best baseline system, and participant submissions are taken into account in the ranking procedure.

Below is an example of how we calculate the overall ranking.
System Per-metric ranking
Non-intrusive SE metrics Intrusive SE metrics Downstream-task-independent metrics Downstream-task-dependent metrics
DNSMOS ↑ NISQA ↑ PESQ ↑ ESTOI ↑ SDR ↑ MCD ↓ LSD ↓ SpeechBERTScore ↑ LPS ↑ SpkSim ↑ WAcc ↑
Noisy input 6 6 5 4 5 5 5 1 5 3 3
Baseline 5 5 4 5 4 4 4 4 4 5 4
Submission 1 1 1 6 6 6 6 6 6 6 6 6
Submission 2 4 4 3 3 3 3 3 4 3 4 5
Submission 3 3 3 2 2 2 2 2 1 2 2 2
Submission 4 2 2 1 1 1 1 1 1 1 1 1
⬇ ⬇ ⬇ ⬇ ⬇
Per-category ranking
System Overall ranking Non-intrusive SE metrics Intrusive SE metrics Downstream-task-independent metrics Downstream-task-dependent metrics
Noisy input 4.200 6.0 4.8 3.0 3.0
Baseline 4.425 5.0 4.2 4.0 4.5
Submission 1 4.750 1.0 6.0 6.0 6.0
Submission 2 3.750 4.0 3.0 3.5 4.5
Submission 3 2.125 3.0 2.0 1.5 2.0
Submission 4 1.250 2.0 1.0 1.0 1.0