Dataocean AI Has Participated in Creating the Open-Source Dataset GigaSpeech 2: A Large-Scale and Multi-Domain ASR Corpus for Low-Resource Languages

24.9.2024 22:00:00 EEST | Business Wire | Press release

Dataocean AI has collaborated with Shanghai Jiao Tong University, The Chinese University of Hong Kong, Tsinghua University, Pengcheng Lab, AISpeech, Birch AI, and Seasalt AI to successfully develop GigaSpeech 2. The development and test sets of GigaSpeech 2 are labeled by a professional team from Dataocean AI.

This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20240924609911/en/

(Photo: Business Wire)

GigaSpeech 2 Overview

GigaSpeech 2 is an ever-expanding, large-scale, multi-domain, and multilingual speech recognition corpus designed to promote research and development in low-resource language speech recognition. GigaSpeech 2 raw contains 30,000 hours of automatically transcribed audio, covering Thai, Indonesian, and Vietnamese. After multiple rounds of refinement and iteration, GigaSpeech 2 refined offers 10,000 hours of Thai, 6,000 hours of Indonesian, and 6,000 hours of Vietnamese. The test sets labeled by Dataocean AI for Thai and Indonesian, each consist of 10 hours, while the development sets are 10 hours for Thai and Indonesian. The team have also open-sourced multilingual speech recognition models trained on the GigaSpeech 2 data, achieving performance comparable to commercial speech recognition services.

Dataset Construction

The construction process of GigaSpeech 2 has also been open-sourced. This is an automated process for building large-scale speech recognition datasets from vast amounts of unlabeled audio available on the internet. The automated process involves data crawling, transcription, alignment, and refinement. Initially, Whisper is used for preliminary transcription, followed by forced alignment with TorchAudio to produce GigaSpeech 2 raw through multi-dimensional filtering. The dataset is then refined iteratively using an improved Noisy Student Training (NST) method, enhancing the quality of pseudo-labels through repeated iterations, ultimately resulting in GigaSpeech 2 refined.

GigaSpeech 2 encompasses a wide range of thematic domains, including agriculture, art, business, climate, culture, economics, education, entertainment, health, history, literature, music, politics, relationships, shopping, society, sports, technology, and travel. Additionally, it covers various content formats such as audiobooks, documentaries, lectures, monologues, movies and TV shows, news, interviews, and video blogs.

Training Set Details

GigaSpeech 2 offers a comprehensive and diverse training set, which is meticulously designed to support the development of robust and high-performing speech recognition models. The training set details are as follows:

- Thai: The raw version consists of 12,901.8 hours of speech data, while the refined version encompasses 10,262.0 hours.
- Indonesian: The raw data amounts to 8,112.9 hours, and the refined data comprises 5,714.0 hours.
- Vietnamese: The raw dataset includes 7,324.0 hours of speech recordings, with the refined dataset totaling 6,039.0 hours.

Development and Test Set Details

Dataocean AI’s COO - Ke Li, who is also one of the paper's authors, has led GigaSpeech 2 test sets project. With nearly 20 years of project experience, the team has contributed in Thai and Indonesian with word accuracy of over 97%. Besides those two East Asian languages, Dataocean AI’s team can also cover over 200 languages and dialects around the world. The company offer 1600+ high-quality off-the-shelf datasets are applicable for multiple scenarios such as Generative AI, Autonomous driving, Smart home, Customer services and etc., fulfilling the evolving needs of the AI industry.

Experimental Results

We conducted a comparative evaluation of speech recognition models trained on the GigaSpeech 2 dataset against industry-leading models, including OpenAI Whisper (large-v3, large-v2, base), Meta MMS L1107, Azure Speech CLI 1.37.0, and Google USM Chirp v2. The comparison was carried out in Thai, Indonesian, and Vietnamese languages. Performance evaluation was based on three test sets: GigaSpeech 2, Common Voice 17.0, and FLEURS, using Character Error Rate (CER) or Word Error Rate (WER) as metrics. The results indicate:

Thai: Our model demonstrated exceptional performance, surpassing all competitors, including commercial interfaces from Microsoft and Google. Notably, our model achieved this significant result while having only one-tenth the number of parameters compared to Whisper large-v3.

Indonesian and Vietnamese: Our system exhibited competitive performance compared to existing baseline models in both Indonesian and Vietnamese languages.

Resource Links

The GigaSpeech 2 dataset is now available for download:
https://huggingface.co/datasets/speechcolab/gigaspeech2

The automated process for constructing large-scale speech recognition datasets is available at:
https://github.com/SpeechColab/GigaSpeech2

The preprint paper is available at:
https://arxiv.org/pdf/2406.11546

Dataocean AI website:
https://www.dataoceanai.com

View source version on businesswire.com: https://www.businesswire.com/news/home/20240924609911/en/

Contacts

contact@dataoceanai.com

About Business Wire

For more than 50 years, Business Wire has been the global leader in press release distribution and regulatory disclosure.

www.businesswire.com

Subscribe to releases from Business Wire

Subscribe to all the latest releases from Business Wire by registering your e-mail address below. You can unsubscribe at any time.

Latest releases from Business Wire

MTN Group Fintech Advances Digital Transformation Through Strategic Technology Partnership11.6.2026 05:00:00 EEST | Press release

MTN Group Fintech has entered into a strategic partnership with Ant International, a leading global digital payment, digitisation and financial technology provider, to accelerate the transformation of its mobile money ecosystem. The partnership, which is expected to launch in Nigeria next quarter, will introduce a super-app platform designed to enhance user experience, deepen digital inclusion and enable a next-generation ecosystem for digital finance, lifestyle and commerce services around MoMo. By leveraging Ant International’s advanced technology, MTN is evolving MoMo to enable stronger ecosystem integration through a mini app platform, enhanced fraud prevention and richer engagement features for consumers and merchants. The partnership represents a major step in building a more resilient and future-ready digital ecosystem. “This partnership aligns with MTN Group’s ambition of leading digital solutions for Africa’s progress by leveraging scale, technology and strong global partnersh

TestMu AI Announces Day-Zero Support for macOS Golden Gate and iOS 27 Beta Following WWDC 202610.6.2026 22:52:00 EEST | Press release

TestMu AI (formerly LambdaTest), the world's first full-stack Agentic AI Quality Engineering platform, today announced day-zero support for Apple's newly unveiled operating systems, macOS Golden Gate and iOS 27 Beta, across its cloud testing platform. Announced at Apple's Worldwide Developers Conference (WWDC) 2026, macOS Golden Gate and iOS 27 introduce significant platform updates that will impact application compatibility, user experience, and performance. Both operating systems are now available for testing on TestMu AI, the same day Apple released its beta versions, allowing engineering teams to identify issues months before the stable rollout reaches millions of users worldwide. macOS Golden Gate is available through the TestMu AI Real Time Testing platform, while iOS 27 Beta is available on TestMu AI Real Device Cloud across the latest iPhones and iPads. "WWDC marks the start of a critical testing window for every team that ships on Apple platforms," said Mudit Singh, Co-Founder

Pure Lithium Corporation Named World Economic Forum Technology Pioneer10.6.2026 21:10:00 EEST | Press release

Pure Lithium Corporation, a disruptive lithium metal battery technology company, is pleased to announce that the company has been selected as a 2026 Technology Pioneer by the World Economic Forum. Launched in 2000, the Technology Pioneers is a leading community for companies from around the world that are shaping the future through breakthrough technologies and innovations. These companies are selected for their potential to have a significant impact on business and society and are invited to engage with public and private sector leaders through the World Economic Forum’s global platform. “We are delighted to be selected as a World Economic Forum Technology Pioneer. I would like to express my gratitude to the World Economic Forum for this honor,” said Pure Lithium Founder, Chairman and CEO, Emilie Bodoin. “Pure Lithium’s selection is a testament to our breakthrough lithium metal battery technology. We are constantly reminded of the urgent need for a battery that is more energy efficien

Visa Partners with OpenAI to Power the Next Generation of AI Commerce10.6.2026 21:00:00 EEST | Press release

Visa (NYSE: V) today announced a strategic collaboration with OpenAI to enable secure Visa payments within agentic commerce, enabling seamless and trusted payments across OpenAI. The companies made the announcement at the Visa Payments Forum in San Francisco. Through the partnership, Visa will provide its global network, credentialing capabilities and security infrastructure to support agentic commerce experiences, helping consumers and businesses interact and transact with confidence. This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20260610422687/en/ The collaboration is part of the broader Visa Intelligent Commerce initiative, which is focused on extending secure payment capabilities into new digital environments. Together, Visa and OpenAI will also explore a range of enterprise applications, including developer-focused experiences powered by Codex, as well as more automated and conversational workflows, as AI continues to ev

Visa Announces New AI, Stablecoin and Token Innovations to Power Intelligent, Programmable Commerce at Visa Payments Forum10.6.2026 21:00:00 EEST | Press release

At Visa Payments Forum 2026, Visa (NYSE: V) today announced new AI, stablecoin and token capabilities designed to help clients unlock the next generation of commerce. These innovations reflect a simple objective: ensuring trust, security and control evolve alongside increasingly fast, automated and intelligent commerce experiences. This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20260610464331/en/ In a keynote presentation, Jack Forestell, Chief Product & Strategy Officer at Visa, outlined how two foundational shifts—artificial intelligence and stablecoins—are transforming both the front end and back end of money movement, and how Visa is enabling clients to participate. “AI is transforming the front end of commerce. Stablecoins are reshaping the back end,” said Forestell. “Visa’s role is to enable it to work securely, reliably and at global scale, for every participant in the ecosystem.” Powering the Front End of Commerce—and

In our pressroom you can read all our latest releases, find our press contacts, images, documents and other relevant information about us.

Visit our pressroom