Dataocean AI Has Participated in Creating the Open-Source Dataset GigaSpeech 2: A Large-Scale and Multi-Domain ASR Corpus for Low-Resource Languages
Dataocean AI has collaborated with Shanghai Jiao Tong University, The Chinese University of Hong Kong, Tsinghua University, Pengcheng Lab, AISpeech, Birch AI, and Seasalt AI to successfully develop GigaSpeech 2. The development and test sets of GigaSpeech 2 are labeled by a professional team from Dataocean AI.
This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20240924609911/en/
(Photo: Business Wire)
GigaSpeech 2 Overview
GigaSpeech 2 is an ever-expanding, large-scale, multi-domain, and multilingual speech recognition corpus designed to promote research and development in low-resource language speech recognition. GigaSpeech 2 raw contains 30,000 hours of automatically transcribed audio, covering Thai, Indonesian, and Vietnamese. After multiple rounds of refinement and iteration, GigaSpeech 2 refined offers 10,000 hours of Thai, 6,000 hours of Indonesian, and 6,000 hours of Vietnamese. The test sets labeled by Dataocean AI for Thai and Indonesian, each consist of 10 hours, while the development sets are 10 hours for Thai and Indonesian. The team have also open-sourced multilingual speech recognition models trained on the GigaSpeech 2 data, achieving performance comparable to commercial speech recognition services.
Dataset Construction
The construction process of GigaSpeech 2 has also been open-sourced. This is an automated process for building large-scale speech recognition datasets from vast amounts of unlabeled audio available on the internet. The automated process involves data crawling, transcription, alignment, and refinement. Initially, Whisper is used for preliminary transcription, followed by forced alignment with TorchAudio to produce GigaSpeech 2 raw through multi-dimensional filtering. The dataset is then refined iteratively using an improved Noisy Student Training (NST) method, enhancing the quality of pseudo-labels through repeated iterations, ultimately resulting in GigaSpeech 2 refined.
GigaSpeech 2 encompasses a wide range of thematic domains, including agriculture, art, business, climate, culture, economics, education, entertainment, health, history, literature, music, politics, relationships, shopping, society, sports, technology, and travel. Additionally, it covers various content formats such as audiobooks, documentaries, lectures, monologues, movies and TV shows, news, interviews, and video blogs.
Training Set Details
GigaSpeech 2 offers a comprehensive and diverse training set, which is meticulously designed to support the development of robust and high-performing speech recognition models. The training set details are as follows:
- Thai: The raw version consists of 12,901.8 hours of speech data, while the refined version encompasses 10,262.0 hours.
- Indonesian: The raw data amounts to 8,112.9 hours, and the refined data comprises 5,714.0 hours.
- Vietnamese: The raw dataset includes 7,324.0 hours of speech recordings, with the refined dataset totaling 6,039.0 hours.
Development and Test Set Details
Dataocean AI’s COO - Ke Li, who is also one of the paper's authors, has led GigaSpeech 2 test sets project. With nearly 20 years of project experience, the team has contributed in Thai and Indonesian with word accuracy of over 97%. Besides those two East Asian languages, Dataocean AI’s team can also cover over 200 languages and dialects around the world. The company offer 1600+ high-quality off-the-shelf datasets are applicable for multiple scenarios such as Generative AI, Autonomous driving, Smart home, Customer services and etc., fulfilling the evolving needs of the AI industry.
Experimental Results
We conducted a comparative evaluation of speech recognition models trained on the GigaSpeech 2 dataset against industry-leading models, including OpenAI Whisper (large-v3, large-v2, base), Meta MMS L1107, Azure Speech CLI 1.37.0, and Google USM Chirp v2. The comparison was carried out in Thai, Indonesian, and Vietnamese languages. Performance evaluation was based on three test sets: GigaSpeech 2, Common Voice 17.0, and FLEURS, using Character Error Rate (CER) or Word Error Rate (WER) as metrics. The results indicate:
Thai: Our model demonstrated exceptional performance, surpassing all competitors, including commercial interfaces from Microsoft and Google. Notably, our model achieved this significant result while having only one-tenth the number of parameters compared to Whisper large-v3.
Indonesian and Vietnamese: Our system exhibited competitive performance compared to existing baseline models in both Indonesian and Vietnamese languages.
Resource Links
The GigaSpeech 2 dataset is now available for download:
https://huggingface.co/datasets/speechcolab/gigaspeech2
The automated process for constructing large-scale speech recognition datasets is available at:
https://github.com/SpeechColab/GigaSpeech2
The preprint paper is available at:
https://arxiv.org/pdf/2406.11546
Dataocean AI website:
https://www.dataoceanai.com
View source version on businesswire.com: https://www.businesswire.com/news/home/20240924609911/en/
Contacts
contact@dataoceanai.com
About Business Wire
For more than 50 years, Business Wire has been the global leader in press release distribution and regulatory disclosure.
www.businesswire.com

Subscribe to releases from Business Wire
Subscribe to all the latest releases from Business Wire by registering your e-mail address below. You can unsubscribe at any time.
Latest releases from Business Wire
Rimini Street and ServiceNow Partner with Apsen Farmacêutica to Deliver a Next-Generation Vision of Enterprise-Wide Workflow Automation7.5.2025 19:55:00 EEST | Press release
Knowledge 2025 —Rimini Street, Inc. (Nasdaq: RMNI), a global provider of end-to-end enterprise software support, management and innovation solutions, and the leading third-party support provider for Oracle, SAP and VMware software, today at ServiceNow’s annual customer and partner event, Knowledge 2025, announced that Apsen Farmacêutica, a Brazilian pharmaceutical manufacturer, has partnered with Rimini Street and ServiceNow to implement a next-generation vision of enterprise-wide workflow automation. This new vision will enable autonomy across the company to help respond faster to market changes, reduce information silos and improve operational efficiency. This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20250507517040/en/ Rimini Street and ServiceNow Partner with Apsen Farmacêutica to Deliver a Next-Generation Vision of Enterprise-Wide Workflow Automation Apsen is prioritizing business needs above technology, leaning into a fl
Results from the VirTus Respiratory Research Ltd Human Rhinovirus Challenge Model Used to Advance Altesa BioSciences’ Lead Medicine into Advanced Clinical Trials7.5.2025 19:07:00 EEST | Press release
VirTus Respiratory Research Ltd, a leading respiratory-focused contract research organisation (CRO), today announced that results from their human rhinovirus challenge model in patients with chronic obstructive pulmonary disease (COPD) provided Altesa BioSciences with compelling evidence to advance their lead drug candidate, vapendavir, into large scale, late-stage clinical trials. This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20250507489716/en/ Professor Sebastian Johnston https://www.prnewswire.com/news-releases/altesa-biosciences-details-positive-topline-vapendavir-results-from-phase-2-placebo-controlled-rhinovirus-challenge-study-in-copd-patients-302448650.html Rhinovirus infection is the cause of at least half of the acute respiratory deteriorations experienced by the millions of patients with chronic obstructive lung disease. “Before advancing vapendavir into clinical trials testing vapendavir on thousands of patients c
FlightSafety International Announces Major Expansion at Farnborough Airport7.5.2025 18:56:00 EEST | Press release
FlightSafety International (FSI), a global leader in aviation training, is pleased to announce the expansion of its facilities at Farnborough Airport, the home of British aviation and Europe’s leading airport for premium air travel connectivity, highlighting its commitment to enhancing aviation safety and training excellence in Europe. This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20250506583636/en/ Rendering: arriving at FlightSafety International The new FSI centre is expected to open in the second quarter of 2027. Lease terms have been finalised and the planning application was submitted in January. Construction is scheduled to start in late 2025 and finish by early 2027. FSI has operated a training centre at Farnborough Airport since 2004. The new facilities will boost training capacity by 40-50%, enabling more aviation professionals access to FSI's world-class training programmes. The current facility is 4,680 square met
Verimatrix Positioned as a Leader and Ace Performer in the 2025 SPARK Matrix TM for In-App Protection by QKS Group7.5.2025 18:45:00 EEST | Press release
Regulatory News: This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20250507970501/en/ Verimatrix positioned as a Leader in the 2025 SPARK Matrix™ for In-App Protection by QKS Group. Verimatrix has been recognized for its comprehensive in-app protection capabilities, earning strong ratings across both Technology Excellence and Customer Impact in the latest SPARK Matrix™. The company has also been named an Ace Performer a recognition given to vendors that demonstrate revenue growth potential, partnership strategy, and customer acquisition—all evaluated over the last one-year period or since the previous SPARK Matrix assessment. The QKS Group SPARK Matrix™ provides competitive analysis & ranking of the leading In-App Protection vendors. For more information about Verimatrix, visit: https://www.verimatrix.com SPARK MatrixTM: In-App Protection, Q1 2025: https://qksgroup.com/market-research/spark-matrix-in-app-protection-q1-2025-8413 V
Amazon to Invest More Than $4 Billion to Launch Infrastructure Region in Chile7.5.2025 16:00:00 EEST | Press release
Amazon (NASDAQ: AMZN) today announced it plans to launch an Amazon Web Services (AWS) infrastructure Region in Chile by the end of 2026. The new AWS South America (Chile) Region will give developers, startups, entrepreneurs, and enterprises, as well as financial services, retail, education, government, and nonprofit organizations, greater choice for running their applications and serving end users from data centers located in Chile. As part of its long-term commitment, Amazon is planning to invest more than $4 billion in Chile to support the construction, connection, operation, and maintenance of its data centers in the country. For more information about AWS Global Infrastructure, visit aws.amazon.com/about-aws/global-infrastructure. “The AWS South America (Chile) Region will help serve the fast-growing demand for cloud services across Latin America and in Chile with secure, reliable, and efficient cloud infrastructure,” said Prasad Kalyanaraman, vice president of Infrastructure Servi
In our pressroom you can read all our latest releases, find our press contacts, images, documents and other relevant information about us.
Visit our pressroom