Track2: Speech Quality Assessment

This track focuses on predicting the Mean Opinion Score (MOS) of speech processed by speech enhancement systemsThis contrasts with existing challenges, which primarily targeted MOS prediction for text-to-speech (TTS) and voice conversion (VC) systems..

Data

The blind test set is now available at urgent26_track2_blind_test.

Check the submission guide, and submit predicted mos scores to the leaderboard for the blind test (opens after Nov. 4).

The non-blind test is now available at urgent26_track2_nonblind_test

Check the submission guide, and submit predicted mos scores to the leaderboard for the non-blind test (opens after Oct. 14).

Official Validation Set for Leaderboard Testing

To help participants get familiar with the leaderboard submission process, we provide an official validation set:

Download it here: urgent26_track2_leaderboard_validation
For dry-run submissions before non-blind/blind test sets
Ranked using the same objective metrics as final evaluation.
Check the submission guide first, and the leaderboard for validation will open after Sept. 21.

This set is not for training. Use it to verify your submission format and preview leaderboard ranking.

Training / Development Data

We provide scripts to prepare and train on the following datasets in our official baseline repo. For ease of access, datasets with redistributable licenses are mirrored on Hugging Face. As most existing MOS datasets are designed for TTS or VC, participants are encouraged to use additional public datasets and apply SE-specific data curation.

	Corpus	#Samples	#Systems	Duration (hours)	Links	License
Training	BC19	136	21	0.32	[Original]	Custom
	BVCC	4973	175	5.56	[Original]	Custom
	NISQA	11020	N/A	27.21	[Original]	Mixed
	PSTN	58709	N/A	163.08	[Original]	Unknown
	SOMOS	14100	181	18.32	[Original] [Huggingface]	CC BY-NC-SA 4.0
	TCD-VoIP	384	24	0.87	[Original] [Huggingface]	CC BY-NC-SA 4.0
	Tencent	11563	N/A	23.51	[Original] [Huggingface]	Apache
	TMHINT-QI	12937	98	11.35	[Original] [Huggingface]	MIT
	TTSDS2	460	80	0.96	[Original] [Huggingface]	MIT
	urgent2024-sqa	238	238000	429.34	[Huggingface]	CC BY-NC-SA 4.0
	urgent2025-sqa	100000	100	261.31	[Huggingface]	CC BY-NC-SA 4.0
Development	CHiME-7 UDASE Eval	640	5	0.84	[Original] [Huggingface]	CC BY-SA 4.0

Validation Data

The validation data is available on Hugging Face, and the data preparation script can be found in the official GitHub repository.

The non-blind test data will be availabel after the non-blind test phase opens, please stay tuned.

The blind test data will be availabel after the non-blind test phase opens, please stay tuned.

Baseline

Please refer to the official GitHub repository for more details, which is a implmentation of Uni-VERSA-Ext.

You can also try the pretrained models on Colab

Rules

During the first month of the challenge, participants may propose additional public datasets for inclusion in the official dataset list. The organizers will review all requests and may update the list accordingly. Any updates will be announced in the Notice tab.
Participants may use any publicly available dataset to develop their prediction systems. All datasets used must be clearly reported in the system description. The use of proprietary datasets—including collecting your own MOS ratings—is not permitted unless those resources are publicly released.These rules follow the precedent set by the AudioMOS Challenge 2025.
Participants may incorporate publicly available models into their systems for tasks such as generating additional data, producing pseudo-MOS labels, initializing model parameters, or serving as system components. All such pre-trained resources must be explicitly cited and described in the system description.
The test data should only be used for evaluation purposes. Techniques such as test-time adaptation, unsupervised domain adaptation, and self-training on the test data are NOT allowed.
Registration is required to submit results to the challenge (Check the Check How to Participate for more information tab for more information). Note that the team information (including affiliation, team name, and team members) should be provided when submitting the results. For detailed submission requirements, please check the Submission part.
- Only the team name will be shown in the leaderboard, while the affiliation and team members will be kept confidential.

Submission

Please read the following guidelines first, and then submit your results here.
Each submission should be a zip file containing two parts:
1. a mos.scp file containing the mapping from audio name (without extension) to predicted mos score
```
  fileid_0001 4.031136
  fileid_0002 3.545564 
```
2. a YAML (README.yaml) file containing the basic information about the submission (as listed below). The template can be found here.
  - team information (team name, affiliation, team mambers)
  - description of the training & validation data used for the submission
  - description of pre-trained models used for the submission (if applicable)
You can download an example submission for validation phase here.

Submissions that fail to conform to the above requirements may be rejected.

Should you encounter any problem during the submission, please feel free to contact the organizers.

Ranking

Systems will be evaluated based on both correlation and error metrics between the predicted Mean Opinion Scores (MOS) and the ground-truth MOS labels. The following metrics will be used:

Mean Squared Error (MSE) – Measures the average squared difference between predicted and true scores.
Linear Correlation Coefficient (LCC) – Assesses the strength of the linear relationship between predicted and true scores.

Spearman’s Rank Correlation Coefficient (SRCC) and Kendall’s Tau (KTAU) – Evaluate the rank correlation between predicted and true scores

Category	Metric	Value Range
Error	System level MSE ↓	[0, ∞)
Error	Utterance level MSE ↓	[0, ∞)
Linear Correlation	System level LCC ↑	[-1, 1]
Linear Correlation	Utterance level LCC ↑	[-1, 1]
Rank Correlation	System level SRCC ↑	[-1, 1]
	Utterance level SRCC ↑	[-1, 1]
	System level KTAU ↑	[-1, 1]
	Utterance level KTAU ↑	[-1, 1]

Overall ranking method

The overall ranking will be determined via the following procedure:

Calculate the average score of each metric for each submission.
Calculate the per-metric ranking based on the average score.
- We adopt the dense ranking (“1223” ranking)https://en.wikipedia.org/wiki/Ranking#Dense_ranking_("1223"_ranking) strategy for handling ties.
Calculate the per-category ranking by averaging the rankings within each category.
Calculate the overall ranking by averaging the per-category rankings.

# Step 1: Calculate the average score of each metric
scores = {}
for submission in all_submissions:
  scores[submission] = {}
  for category in metric_categories:
    for metric in category:
      scores[submission][metric] = mean([metric(each_sample) for each_sample in submission])

# Step 2: Calculate the per-metric ranking based on the average score
rank_per_metric = {}
rank_per_category = {}
for category in metric_categories:
  for metric in category:
    rank_per_metric[metric] = get_ranking([scores[submission][metric] for submission in all_submissions])

  # Step 3: Calculate the `per-category ranking` by averaging the rankings within each category
  rank_per_category[category] = get_ranking([rank_per_metric[metric] for metric in category])

# Step 4: Calculate the overall ranking by averaging the `per-category rankings`
rank_overall = get_ranking([rank_per_category[category] for category in metric_categories])

Contents:

Data

🆕 [Nov. 3 Update] Blind Test for Leaderboard Now Available

Non-Blind Test for Leaderboard Now Available