Track2: Speech Quality Assessment
This track focuses on predicting the Mean Opinion Score (MOS) of speech processed by speech enhancement systems
Contents:
Data
Training / Development Data
We provide scripts to prepare and train on the following datasets in our official baseline repo. For ease of access, datasets with redistributable licenses are mirrored on Hugging Face. As most existing MOS datasets are designed for TTS or VC, participants are encouraged to use additional public datasets and apply SE-specific data curation.
Corpus | #Samples | #Systems | Duration (hours) | Links | License | |
---|---|---|---|---|---|---|
Training | BC19 | 136 | 21 | 0.32 | [Original] | Custom |
BVCC | 4973 | 175 | 5.56 | [Original] | Custom | |
NISQA | 11020 | N/A | 27.21 | [Original] | Mixed | |
PSTN | 58709 | N/A | 163.08 | [Original] | Unknown | |
SOMOS | 14100 | 181 | 18.32 | [Original] [Huggingface] | CC BY-NC-SA 4.0 | |
TCD-VoIP | 384 | 24 | 0.87 | [Original] [Huggingface] | CC BY-NC-SA 4.0 | |
Tencent | 11563 | N/A | 23.51 | [Original] [Huggingface] | Apache | |
TMHINT-QI | 12937 | 98 | 11.35 | [Original] [Huggingface] | MIT | |
TTSDS2 | 460 | 80 | 0.96 | [Original] [Huggingface] | MIT | |
urgent2024-sqa | 238 | 238000 | 429.34 | [Huggingface] | CC BY-NC-SA 4.0 | |
urgent2025-sqa | 100000 | 100 | 261.31 | [Huggingface] | CC BY-NC-SA 4.0 | |
Development | CHiME-7 UDASE Eval | 640 | 5 | 0.84 | [Original] [Huggingface] | CC BY-SA 4.0 |
Validation Data
The validation data is available on Hugging Face, and the data preparation script can be found in the official GitHub repository.
Non-Blind Test Data
The non-blind test data will be availabel after the non-blind test phase opens, please stay tuned.
Blind Test Data
The blind test data will be availabel after the non-blind test phase opens, please stay tuned.
Baseline
Please refer to the official GitHub repository for more details, which is a implmentation of Uni-VERSA-Ext
You can also try the pretrained models on Colab
Rules
-
During the first month of the challenge, participants may propose additional public datasets for inclusion in the official dataset list. The organizers will review all requests and may update the list accordingly. Any updates will be announced in the
Notice
tab. -
Participants may use any publicly available dataset to develop their prediction systems. All datasets used must be clearly reported in the system description. The use of proprietary datasets—including collecting your own MOS ratings—is not permitted unless those resources are publicly released.
These rules follow the precedent set by the AudioMOS Challenge 2025. -
Participants may incorporate publicly available models into their systems for tasks such as generating additional data, producing pseudo-MOS labels, initializing model parameters, or serving as system components. All such pre-trained resources must be explicitly cited and described in the system description.
-
The test data should only be used for evaluation purposes. Techniques such as test-time adaptation, unsupervised domain adaptation, and self-training on the test data are NOT allowed.
-
Registration is required to submit results to the challenge (Check the Check
How to Participate
for more information tab for more information). Note that the team information (including affiliation, team name, and team members) should be provided when submitting the results. For detailed submission requirements, please check theSubmission
part.- Only the team name will be shown in the leaderboard, while the affiliation and team members will be kept confidential.
- Only the team name will be shown in the leaderboard, while the affiliation and team members will be kept confidential.
Submission
- Each submission should be a zip file containing two parts:
- a mos.scp file containing the mapping from audio name (without extension) to predicted mos score
fileid_0001 4.031136 fileid_0002 3.545564
- a YAML (README.yaml) file containing the basic information about the submission (as listed below). The template can be found here.
- team information (team name, affiliation, team mambers)
- description of the training & validation data used for the submission
- description of pre-trained models used for the submission (if applicable)
- a mos.scp file containing the mapping from audio name (without extension) to predicted mos score
Submissions that fail to conform to the above requirements may be rejected.
Should you encounter any problem during the submission, please feel free to contact the organizers.
Ranking
Systems will be evaluated based on both correlation and error metrics between the predicted Mean Opinion Scores (MOS) and the ground-truth MOS labels. The following metrics will be used:
-
Mean Squared Error (MSE) – Measures the average squared difference between predicted and true scores.
-
Linear Correlation Coefficient (LCC) – Assesses the strength of the linear relationship between predicted and true scores.
-
Spearman’s Rank Correlation Coefficient (SRCC) and Kendall’s Tau (KTAU) – Evaluate the rank correlation between predicted and true scores
Category Metric Value Range Error System level MSE ↓ [0, ∞) Utterance level MSE ↓ [0, ∞) Linear Correlation System level LCC ↑ [-1, 1] Utterance level LCC ↑ [-1, 1] Rank Correlation System level SRCC ↑ [-1, 1] Utterance level SRCC ↑ [-1, 1] System level KTAU ↑ [-1, 1] Utterance level KTAU ↑ [-1, 1]
Overall ranking method
The overall ranking will be determined via the following procedure:
- Calculate the
average score
of each metric for each submission. - Calculate the
per-metric ranking
based on the average score.- We adopt the dense ranking (“1223” ranking)
https://en.wikipedia.org/wiki/Ranking#Dense_ranking_("1223"_ranking) strategy for handling ties.
- We adopt the dense ranking (“1223” ranking)
- Calculate the
per-category ranking
by averaging the rankings within each category. - Calculate the
overall ranking
by averaging theper-category rankings
.
# Step 1: Calculate the average score of each metric
scores = {}
for submission in all_submissions:
scores[submission] = {}
for category in metric_categories:
for metric in category:
scores[submission][metric] = mean([metric(each_sample) for each_sample in submission])
# Step 2: Calculate the per-metric ranking based on the average score
rank_per_metric = {}
rank_per_category = {}
for category in metric_categories:
for metric in category:
rank_per_metric[metric] = get_ranking([scores[submission][metric] for submission in all_submissions])
# Step 3: Calculate the `per-category ranking` by averaging the rankings within each category
rank_per_category[category] = get_ranking([rank_per_metric[metric] for metric in category])
# Step 4: Calculate the overall ranking by averaging the `per-category rankings`
rank_overall = get_ranking([rank_per_category[category] for category in metric_categories])