Track2: Speech Quality Assessment

This track focuses on predicting the Mean Opinion Score (MOS) of speech processed by speech enhancement systemsThis contrasts with existing challenges, which primarily targeted MOS prediction for text-to-speech (TTS) and voice conversion (VC) systems..

Contents:

Data

Training / Development Data

We provide scripts to prepare and train on the following datasets in our official baseline repo. For ease of access, datasets with redistributable licenses are mirrored on Hugging Face. As most existing MOS datasets are designed for TTS or VC, participants are encouraged to use additional public datasets and apply SE-specific data curation.

Corpus #Samples #Systems Duration (hours) Links License
Training BC19 136 21 0.32 [Original] Custom
BVCC 4973 175 5.56 [Original] Custom
NISQA 11020 N/A 27.21 [Original] Mixed
PSTN 58709 N/A 163.08 [Original] Unknown
SOMOS 14100 181 18.32 [Original] [Huggingface] CC BY-NC-SA 4.0
TCD-VoIP 384 24 0.87 [Original] [Huggingface] CC BY-NC-SA 4.0
Tencent 11563 N/A 23.51 [Original] [Huggingface] Apache
TMHINT-QI 12937 98 11.35 [Original] [Huggingface] MIT
TTSDS2 460 80 0.96 [Original] [Huggingface] MIT
urgent2024-sqa 238 238000 429.34 [Huggingface] CC BY-NC-SA 4.0
urgent2025-sqa 100000 100 261.31 [Huggingface] CC BY-NC-SA 4.0
Development CHiME-7 UDASE Eval 640 5 0.84 [Original] [Huggingface] CC BY-SA 4.0

Validation Data

The validation data is available on Hugging Face, and the data preparation script can be found in the official GitHub repository.

Non-Blind Test Data

The non-blind test data will be availabel after the non-blind test phase opens, please stay tuned.

Blind Test Data

The blind test data will be availabel after the non-blind test phase opens, please stay tuned.


Baseline

Please refer to the official GitHub repository for more details, which is a implmentation of Uni-VERSA-Ext.

You can also try the pretrained models on Colab


Rules

  1. During the first month of the challenge, participants may propose additional public datasets for inclusion in the official dataset list. The organizers will review all requests and may update the list accordingly. Any updates will be announced in the Notice tab.

  2. Participants may use any publicly available dataset to develop their prediction systems. All datasets used must be clearly reported in the system description. The use of proprietary datasets—including collecting your own MOS ratings—is not permitted unless those resources are publicly released.These rules follow the precedent set by the AudioMOS Challenge 2025.

  3. Participants may incorporate publicly available models into their systems for tasks such as generating additional data, producing pseudo-MOS labels, initializing model parameters, or serving as system components. All such pre-trained resources must be explicitly cited and described in the system description.

  4. The test data should only be used for evaluation purposes. Techniques such as test-time adaptation, unsupervised domain adaptation, and self-training on the test data are NOT allowed.

  5. Registration is required to submit results to the challenge (Check the Check How to Participate for more information tab for more information). Note that the team information (including affiliation, team name, and team members) should be provided when submitting the results. For detailed submission requirements, please check the Submission part.

    • Only the team name will be shown in the leaderboard, while the affiliation and team members will be kept confidential.


Submission

  • Each submission should be a zip file containing two parts:
    1. a mos.scp file containing the mapping from audio name (without extension) to predicted mos score
        fileid_0001 4.031136
        fileid_0002 3.545564 
      
    2. a YAML (README.yaml) file containing the basic information about the submission (as listed below). The template can be found here.
      • team information (team name, affiliation, team mambers)
      • description of the training & validation data used for the submission
      • description of pre-trained models used for the submission (if applicable)

Submissions that fail to conform to the above requirements may be rejected.

Should you encounter any problem during the submission, please feel free to contact the organizers.

Ranking

Systems will be evaluated based on both correlation and error metrics between the predicted Mean Opinion Scores (MOS) and the ground-truth MOS labels. The following metrics will be used:

  • Mean Squared Error (MSE) – Measures the average squared difference between predicted and true scores.

  • Linear Correlation Coefficient (LCC) – Assesses the strength of the linear relationship between predicted and true scores.

  • Spearman’s Rank Correlation Coefficient (SRCC) and Kendall’s Tau (KTAU) – Evaluate the rank correlation between predicted and true scores

    Category Metric Value Range
    Error System level MSE ↓ [0, ∞)
    Utterance level MSE ↓ [0, ∞)
    Linear Correlation System level LCC ↑ [-1, 1]
    Utterance level LCC ↑ [-1, 1]
    Rank Correlation System level SRCC ↑ [-1, 1]
    Utterance level SRCC ↑ [-1, 1]
    System level KTAU ↑ [-1, 1]
    Utterance level KTAU ↑ [-1, 1]


Overall ranking method

The overall ranking will be determined via the following procedure:

  1. Calculate the average score of each metric for each submission.
  2. Calculate the per-metric ranking based on the average score.
  3. Calculate the per-category ranking by averaging the rankings within each category.
  4. Calculate the overall ranking by averaging the per-category rankings.

# Step 1: Calculate the average score of each metric
scores = {}
for submission in all_submissions:
  scores[submission] = {}
  for category in metric_categories:
    for metric in category:
      scores[submission][metric] = mean([metric(each_sample) for each_sample in submission])

# Step 2: Calculate the per-metric ranking based on the average score
rank_per_metric = {}
rank_per_category = {}
for category in metric_categories:
  for metric in category:
    rank_per_metric[metric] = get_ranking([scores[submission][metric] for submission in all_submissions])

  # Step 3: Calculate the `per-category ranking` by averaging the rankings within each category
  rank_per_category[category] = get_ranking([rank_per_metric[metric] for metric in category])

# Step 4: Calculate the overall ranking by averaging the `per-category rankings`
rank_overall = get_ranking([rank_per_category[category] for category in metric_categories])