Rules | URGENT Challenge (2024)

When generating the training and validation datasets, only the speech, nosie, and room impulse response (RIR) corpora listed in the Data tab shall be used.
- This is to ensure a fair comparison and proper understanding of various SE approaches.
- The first month of the challenge will be a grace period when participants can propose additional public datasets to be included in the list. We (organizers) will reply to the requests and may update the list. Updates will be recorded in the Notices tab.
- Although the speech enhancement model should only be trained on the listed data, we allow the use of pre-trained foundation models such as HuBERT, WavLM, EnCodec, Llama, and so on as long as:
  - they are publicly available before the challenge begins
  - and they are explicitly mentioned in the submitted system description.
  - Note:
    - Their parameters can be fine-tuned on the listed data.
    - It is not allowed to fine-tune any model, be it pre-trained or not, on any extra data other than the listed data.
The test data should only be used for evaluation purposes. Techniques such as test-time adaptation, unsupervised domain adaptation, and self-training on the test data are not allowed for this challenge.
There is no constraint on the latency or causality of the developed system in this challenge. Any type of model can be used as long as they conform to the other rules as listed in this page.
Registration is required to submit results to the challenge (Check the Leaderboard tab for more information). Note that the team information (including affiliation, team name, and team members) should be provided when submitting the results. For detailed submission requirements, please check the Submission tab.
- Only the team name will be shown in the leaderboard, while the affiliation and team members will be kept confidential.

The following evaluation metrics will be calculated for evaluation.

Category	Metric	Need Reference Signals?	Supported Sampling Frequencies	Value Range
Non-intrusive SE metrics	DNSMOS ↑	❌	16 kHz	[1, 5]
Non-intrusive SE metrics	NISQA ↑	❌	48 kHz	[1, 5]
Intrusive SE metrics	POLQAThis metric will only be used for evaluation of the final blind test set. ↑	✔	8~48 kHz	[1, 5]
	PESQ ↑	✔	{8, 16} kHz	[-0.5, 4.5]
	ESTOI ↑	✔	10 kHz	[0, 1]
	SDR ↑	✔	Any	(-∞, +∞)
	MCD ↓	✔	Any	[0, +∞)
	LSD ↓	✔	Any	[0, +∞)
Downstream-task-independent metrics	SpeechBERTScoreBased on our preliminary investigation, we adopt the HuBERT-Base backend for calculating the SpeechBERTScore, which differs from its defalut backend (WavLM-Large). ↑	✔	16 kHz	[-1, 1]
Downstream-task-independent metrics	LPS ↑	✔	16 kHz	(-∞, 1]
Downstream-task-dependent metrics	SpkSim ↑	✔	16 kHz	[-1, 1]
Downstream-task-dependent metrics	WAcc (=1-WER) ↑	❌	16 kHz	(-∞, 1]
Subjective SE metrics	MOSThis metric will only be used for evaluation of the final blind test set. ↑	❌	Any	[1, 5]

Note: For real recorded test samples that do not have a strictly matched reference signal, part of the above metrics will be used.

The overall ranking will be determined via the following procedure:

Calculate the average score of each metric for each submission.
Calculate the per-metric ranking based on the average score.
- We adopt the dense ranking (“1223” ranking)https://en.wikipedia.org/wiki/Ranking#Dense_ranking_("1223"_ranking) strategy for handling ties.
Calculate the per-category ranking by averaging the rankings within each category.
Calculate the overall ranking by averaging the per-category rankings.

 # Step 1: Calculate the average score of each metric
 scores = {}
 for submission in all_submissions:
   scores[submission] = {}
   for category in metric_categories:
     for metric in category:
       scores[submission][metric] = mean([metric(each_sample) for each_sample in submission])

 # Step 2: Calculate the per-metric ranking based on the average score
 rank_per_metric = {}
 rank_per_category = {}
 for category in metric_categories:
   for metric in category:
     rank_per_metric[metric] = get_ranking([scores[submission][metric] for submission in all_submissions])

   # Step 3: Calculate the `per-category ranking` by averaging the rankings within each category
   rank_per_category[category] = get_ranking([rank_per_metric[metric] for metric in category])

 # Step 4: Calculate the overall ranking by averaging the `per-category rankings`
 rank_overall = get_ranking([rank_per_category[category] for category in metric_categories])

Note: Only the original test data, the best baseline system, and participant submissions are taken into account in the ranking procedure.

Below is an example of how we calculate the overall ranking.

System		Per-metric ranking
		Non-intrusive SE metrics		Intrusive SE metrics					Downstream-task-independent metrics		Downstream-task-dependent metrics
		DNSMOS ↑	NISQA ↑	PESQ ↑	ESTOI ↑	SDR ↑	MCD ↓	LSD ↓	SpeechBERTScore ↑	LPS ↑	SpkSim ↑	WAcc ↑
Noisy input		6	6	5	4	5	5	5	1	5	3	3
Baseline		5	5	4	5	4	4	4	4	4	5	4
Submission 1		1	1	6	6	6	6	6	6	6	6	6
Submission 2		4	4	3	3	3	3	3	4	3	4	5
Submission 3		3	3	2	2	2	2	2	1	2	2	2
Submission 4		2	2	1	1	1	1	1	1	1	1	1
⬇ ⬇ ⬇ ⬇ ⬇
		Per-category ranking
System	Overall ranking	Non-intrusive SE metrics		Intrusive SE metrics					Downstream-task-independent metrics		Downstream-task-dependent metrics
Noisy input	4.200	6.0		4.8					3.0		3.0
Baseline	4.425	5.0		4.2					4.0		4.5
Submission 1	4.750	1.0		6.0					6.0		6.0
Submission 2	3.750	4.0		3.0					3.5		4.5
Submission 3	2.125	3.0		2.0					1.5		2.0
Submission 4	1.250	2.0		1.0					1.0		1.0