WenetSpeech-Yue: A Large-Scale Cantonese Speech Corpus with Multi-dimensional Annotation
Longhao Li1, Zhao Guo1, Hongjie Chen2, Yuhang Dai1, Ziyu Zhang1, Hongfei Xue1, Tianlun Zuo1, Chengyou Wang1, Shuiyuan Wang1, Xin Xu3, Hui Bu3, Jie Li2, Jian Kang2, Binbin Zhang4, Ruibin Yuan5, Ziya Zhou5, Wei Xue5, Lei Xie1
1 ASLP, Northwest Polytechnical University
2 Institute of Artificial Intelligence (TeleAI), China Telecom
3 Beijing AISHELL Technology Co., Ltd.
4 WeNet Open Source Community
5 Hong Kong University of Science and Technology
📑 Paper    |   
🐙 GitHub    |   
🤗 HuggingFace
🖥️ HuggingFace Space    |   
🎤 Demo Page    |   
💬 Contact Us
Abstract
The development of speech understanding and generation has been significantly accelerated by the availability of large-scale, high-quality speech datasets. Among these, ASR and TTS are regarded as the most established and fundamental tasks. However, for Cantonese (Yue Chinese), spoken by approximately 84.9 million native speakers worldwide, limited annotated resources have hindered progress and resulted in suboptimal ASR and TTS performance. To address this challenge, we propose WenetSpeech-Pipe, an integrated pipeline for building large-scale speech corpus with multi-dimensional annotation tailored for speech understanding and generation. It comprises six modules: Audio Collection, Speaker Attributes Annotation, Speech Quality Annotation, Automatic Speech Recognition, Text Postprocessing and Recognizer Output Voting, enabling rich and high-quality annotations. Based on this pipeline, we release WenetSpeech-Yue, the first large-scale Cantonese speech corpus with multi-dimensional annotation for ASR and TTS, covering 21,800 hours across 10 domains with annotations including ASR transcription, text confidence, speaker identity, age, gender, speech quality scores, among other annotations. We also release WSYue-eval, a comprehensive Cantonese benchmark with two components: WSYue-ASR-eval, a manually annotated set for evaluating ASR on short and long utterances, code-switching, and diverse acoustic conditions, and WSYue-TTS-eval, with base and coverage subsets for standard and generalization testing. Experimental results show that models trained on WenetSpeech-Yue achieve competitive results against state-of-the-art (SOTA) Cantonese ASR and TTS systems, including commercial and LLM-based models, highlighting the value of our dataset and pipeline. The dataset, benchmark, and the ASR and TTS models built upon WenetSpeech-Yue will be open-sourced. Demos can be found in the supplementary material.
WenetSpeech-Pipe
WenetSpeech-Pipe is an automated pipeline specifically designed for building large-scale Cantonese datasets with multi-dimensional annotation. It consists of six components: (A) Audio Collection, (B) Speaker Attributes Annotation, (C) Speech Quality Annotation, (D) Automatic Speech Recognition, (E) Text Post-Processing, and (F) Recognizer Output Voting. The figure below provides an overview of the WenetSpeech-Pipe.
WenetSpeech-Yue
- Contains 21,800 hours of large-scale Cantonese speech corpus with rich annotations, the largest open-source resource for Cantonese speech research.
- Stores metadata in a single JSON file, including audio path, duration, text confidence, speaker identity, SNR, DNSMOS, age, gender, and character-level timestamps. Additional metadata tags may be added in the future.
- Covers ten domains: Storytelling, Entertainment, Drama, Culture, Vlog, Commentary, Education, Podcast, News, and Others.
Dataset Overview

Data Samples
Domain | Sample 1 | Sample 2 | Sample 3 |
---|---|---|---|
Storytelling |
两只小企鹅都有嘢食 |
刘备仲马鞭一指蜀兵一齐掩杀过去打到吴兵大败唉刘备八路兵马以雷霆万钧之势啊杀到吴兵啊尸横遍野血流成河 |
第日长老就请咗成班盗木师傅嚟啦 |
Entertainment |
叫做诶诶直入式你个脑部里边咧记得呢一个嘅以前香港有一个广告好出名嘅佢乜嘢都冇噶净系影住喺弥敦道佢哋间铺头嘅啫但系就不停有人嗌啦平平吧平吧 |
原来王力宏咧系佢家中里面咧成就最低个吓哇 |
大约系八分钟之后或者十分钟啦佢就会开始滚翻起 |
Drama |
忽然从光线死角嘅阴影度窜出一只大猫 |
无论你提出任何嘅要求 |
跟住佢就坐低马双手运力 |
Vlog |
今日我带大家去见识一位九零后嘅靓仔咧 |
咁咁多样材料咁我哋首先第一步处理咗一件 |
咁咧当然睇月路咧我哋要睇 |
Commentary |
香港嘅消费市场从此不一样 |
啲点样对于佢哋嘅服务态度啊不透过呢一年左右嘅时间啦其实大家都静一静啦咁你就会见到香港嘅经济其实 |
二零零三年中央政府推出个人游计划挽救沙士后一蹶不振的香港经济 |
ASR Leaderboard
Model | #Params (M) | In-House | Open-Source | WSYue-eval | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Dialogue | Reading | yue | HK | MDCC | Daily_Use | Commands | Short | Long | ||
w/o LLM | ||||||||||
Conformer-Yue⭐ | 130 | 16.57 | 7.82 | 7.72 | 11.42 | 5.73 | 5.73 | 8.97 | 5.05 | 8.89 |
Paraformer | 220 | 83.22 | 51.97 | 70.16 | 68.49 | 47.67 | 79.31 | 69.32 | 73.64 | 89.00 |
SenseVoice-small | 234 | 21.08 | 6.52 | 8.05 | 7.34 | 6.34 | 5.74 | 6.65 | 6.69 | 9.95 |
SenseVoice-s-Yue⭐ | 234 | 19.19 | 6.71 | 6.87 | 8.68 | 5.43 | 5.24 | 6.93 | 5.23 | 8.63 |
Dolphin-small | 372 | 59.20 | 7.38 | 39.69 | 51.29 | 26.39 | 7.21 | 9.68 | 32.32 | 58.20 |
TeleASR | 700 | 37.18 | 7.27 | 7.02 | 7.88 | 6.25 | 8.02 | 5.98 | 6.23 | 11.33 |
Whisper-medium | 769 | 75.50 | 68.69 | 59.44 | 62.50 | 62.31 | 64.41 | 80.41 | 80.82 | 50.96 |
Whisper-m-Yue⭐ | 769 | 18.69 | 6.86 | 6.86 | 11.03 | 5.49 | 4.70 | 8.51 | 5.05 | 8.05 |
FireRedASR-AED-L | 1100 | 73.70 | 18.72 | 43.93 | 43.33 | 34.53 | 48.05 | 49.99 | 55.37 | 50.26 |
Whisper-large-v3 | 1550 | 45.09 | 15.46 | 12.85 | 16.36 | 14.63 | 17.84 | 20.70 | 12.95 | 26.86 |
w/ LLM | ||||||||||
Qwen2.5-Omni-3B | 3000 | 72.01 | 7.49 | 12.59 | 11.75 | 38.91 | 10.59 | 25.78 | 67.95 | 88.46 |
Kimi-Audio | 7000 | 68.65 | 24.34 | 40.90 | 38.72 | 30.72 | 44.29 | 45.54 | 50.86 | 33.49 |
FireRedASR-LLM-L | 8300 | 73.70 | 18.72 | 43.93 | 43.33 | 34.53 | 48.05 | 49.99 | 49.87 | 45.92 |
Conformer-LLM-Yue⭐ | 4200 | 17.22 | 6.21 | 6.23 | 9.52 | 4.35 | 4.57 | 6.98 | 4.73 | 7.91 |
TTS Demo
Text | ||||||
---|---|---|---|---|---|---|
Loading... |
TTS System Performance Comparison
Evaluation Overview
- The table below presents both objective and subjective evaluation results of different TTS systems on the WSYue-TTS-eval benchmark. Objective metrics include Mixed Error Rate (MER) and speaker similarity (SIM) on both the Base and Coverage test sets. Subjective metrics include UTMOSv2, Intelligibility MOS (I-MOS), Speaker Similarity MOS (S-MOS), and Audio Naturalness MOS (A-MOS).
- Llasa-1B-Yue is our model trained on large-scale Cantonese data and achieves the best performance on most metrics.
Model | Base | Coverage | UTMOSv2 | I-MOS | S-MOS | A-MOS | ||
---|---|---|---|---|---|---|---|---|
MER (%) | SIM | MER (%) | SIM | |||||
Llasa-1B | 53.31 | 0.732 | 43.68 | 0.754 | 2.360 | 2.60 ± 1.01 | 3.05 ± 0.87 | 2.32 ± 0.98 |
Step-Audio-TTS-3B | 27.79 | 0.762 | 24.25 | 0.781 | 2.496 | 3.22 ± 0.70 | 3.14 ± 0.58 | 2.82 ± 0.69 |
CosyVoice2 | 14.38 | 0.812 | 13.74 | 0.826 | 2.989 | 3.72 ± 0.50 | 3.52 ± 0.36 | 3.22 ± 0.60 |
Edge-TTS† | 8.30 | - | 9.27 | - | 2.997 | 4.12 ± 0.28 | - | 3.48 ± 0.56 |
Llasa-1B-Yue | 10.89 | 0.762 | 12.78 | 0.772 | 2.696 | 4.30 ± 0.23 | 4.11 ± 0.37 | 4.34 ± 0.34 |