Company
blog
Blog
blog
22
Apr
26
Browse: 127
How to Start Your Machine Learning Projects with MagicData-RAMC?

As a collection of high quality and richly annotated training data, MagicData-RAMC (free download available at magichub.com) is applicable to a series of research. This article will introduce 3 experiments related to speech recognition, speaker diarization and keyword search based on MagicData-RAMC conducted by Magic Data, together with the Institute of Acoustics, Chinese Academy of Sciences, Shanghai Jiao Tong University and Northwestern Polytechnic University.


1. Automatic Speech Recognition

The research team uses a Conformer-based end-to-end (E2E) model implemented by ESPnet2 toolkit to conduct speech recognition. The Conformer model is composed of a Conformer encoder proposed in and a Transformer decoder. The team adopts 14 Conformer blocks in the encoder and 6 Transformer blocks in the decoder. Connectionist temporal classification (CTC) is employed on top of the encoder as an auxiliary task to perform joint CTC-attention (CTC/Att) training and decoding within the multi-task learning framework. During the beam search decoding, the beam size is set to 10.

The research team computes 80-dimensional logarithmic filterbank features from the raw waveform and utilize utterance-wise mean variance normalization. The frame length is 25 ms with a stride of 10 ms. SpecAugment is applied with 2 frequency masks and 2 time masks for data augmentation. The maximum widths of each frequency mask and time mask are F = 30 and T = 40 respectively. The input sequence is sub-sampled through a 2D convolutional layer by a factor of 4. The inner dimension of position-wise feed-forward networks in both encoder and decoder is 2048. The research team applies dropout in the encoder layer with a rate of 0.1 and set the label smoothing weight to 0.1 for regularization.

The multi-head attention layer contains 8 heads with 256 attention transformation dimensions. CTC weight used for multi-task training is 0.3. We train the model using the Adam optimizer with β1 = 0.9, β2 = 0.98, ǫ = 10−9, gradient clipping norm 5 and no weight decay. Noam learning rate scheduler is set to 25000 warm-up steps. The maximum trainable epoch is 100. The final model is averaged by the last 5 checkpoints

The training set is made up of two parts: the 150 hours training set of MagicData-RAMC and the 755 hours MAGICDATA Mandarin Chinese Read Speech Corpus (MAGICDATAREAD). The two sets are combined to compose over 900 hours of data for training. The experimental result is shown in terms of character error rate (CER) in Table 1.

Table 1: ASR Results (CER%) on dev and test set

2. Speaker Diarization

For SD task, the baseline system consists of three components: speaker activity detection (SAD), speaker embedding extractor and clustering.

Following Variational Bayes HMM x-vectors (VBx) experiment setting, Kaldi-based SAD module is used for detecting speech activity. The research team adopts ResNet trained on VoxCeleb Dataset (openslr-494), CN-Celeb Corpus (openslr-825) and the split training set of MagicData-RAMC to obtain the speaker embedding extractor.

For training details, the SAD module utilizes a 40-dimensional Mel frequency cepstral coefficient (MFCC) with 25 ms frame length and 10 ms stride as input features to detect the speech activity. ResNet-101 with two fully connected layers is employed to conduct speaker classification task with 64-dimensional filterbank features extracted every 10 ms with 25 ms window, and additive margin softmax is used to get a more distinct decision boundary. The raw waveform is split every 4s (400 dimensions) to form ResNet input. The speaker embedding network is trained by stochastic gradient descent (SGD) optimizer with a 0.9 momentum factor and 0.0001 L2 regularization factor.

Besides, 256-dimension embeddings are conducted dimensionality reduction using probabilistic linear discriminant analysis (PLDA) to 128-dimension. Embeddings are extracted on SAD result every 240 ms, and the chunk length is set to 1.5s. For the clustering part, the research team use Variational Bayes HMM [6] on this task. An agglomerative hierarchical clustering algorithm with VBx is conducted to get the clustering result. In the VBx, the acoustic scaling factor Fa is set to 0.3, and the speaker regularization coefficient is set to 17. The probability of not switching speakers between frames is 0.99. The experimental result is showed in Table 2.

Table 2: Speaker diarization results of VBx system

3. Keyword Search

The research team carry out the KWS task following the DTA Att-E2E-KWS approach proposed in relying on attention-based E2E ASR framework and frame-synchronous phoneme alignments. The KWS system is based on our Conformer-based E2E ASR system described in Sec 3.1. We adopt the dynamic time alignment (DTA) algorithm to connect a frame-wise phoneme classifier’s output posteriors and the label-wise ASR result candidates for generating accurate time alignments and reliable confidence scores of recognized words. Keyword occurrences are retrieved within the N-best hypotheses generated in the joint CTC/Att decoding process.

The keyword list is built by picking 200 words from the dev set and provided together with the dataset. In the DTA Att-E2E-KWS system, the frame-wise phoneme classifier of the KWS system shares 12 Conformer blocks with the E2E ASR encoder while retaining the top 2 Conformer blocks unshared. The classifier outputs posteriors of 61 phonemes, including silence and noise. The KWS system is optimized following the setup in Sec 3.1. During the inference stage, we retrieve keywords within ASR 2-best hypotheses. During KWS scoring, a predicted keyword occurrence is considered correct when there is a 50% time overlap at least between the predicted occurrence and a reference occurrence of the same keyword. The results are shown in Table 3.

Table 3: Results on dev and test set for the Conformer-based DTA Att-E2E-KWS system

We hope this introduction can be of some help for your further studies and MagicData-RAMC can facilitate the applications of various speech-related research.

You are welcome to share any insight or comment to open@magicdatatech.com.

Share
Previous
Page
What Conversational Data Play in the Future of Online Conferencing?
Next
Page
Open-source MagicData-RAMC: 180-hour Conversational Speech Dataset in Mandarin Released
Popular Tags
Latest Blogs
What Conversational Data Play in the Future of Online Conferencing?

Over two years into the pandemic, a lot of things have changed in the remote work landscape. As more jobs move to remote settings than ever before, the communication between coworkers and customers has shifted to that realm as well. With that shift comes a new set of trials and tribulations that didn’t exist in face-to-face meetings.

22
Apr
26
Open-source MagicData-RAMC: 180-hour Conversational Speech Dataset in Mandarin Released

MagicHub, an open-source community for AI, releases 180-hour conversational speech dataset in Mandarin for free, enriching the open source speech corpus and promoting the development of spoken language processing technology and conversational AI.

22
Apr
26
Magic Data Launches Conversational AI Datasets for Machine Learning

Magic Data launches an accumulation of more than 200,000 hours of training datasets, including 140,000 hours of conversational AI training datasets and 60,000 hours of read speech datasets, covering Asian languages, English dialects, and European languages, boosting the rapid development of human-computer interaction in artificial intelligence.

22
Apr
26
Moving Toward the Globe | Magic Data Builds Partnership with AWS, Empowering AI Data Processing

Recently, Magic Data officially become one of AWS’s ISV partners after Annotator® 5.0, an AI-assisted data labeling platform passing the ASW foundation technology Review (FTR).

22
Apr
26
Data Security and Compliance in Deploying AI — Magic Data’s Data Security Commitment

The importance of data security has been increasingly realized, no matter it is in national or personal level. Always putting data security at the first priority, Magic Data designs and applies a strict data protection mechanism so as to provide sufficient trusted AI training data for the industry.

22
Apr
26
Sales Department
Please fill in this form to purchase datasets or quote for
data collection/ annotation services.
Name
*
Company Name
*
Title
*
Email
*
Phone Number
*

Country code + Phone Number

Phone number-e.g. +86 138315xxxxx

Detail
Country
City
Submit
Sales Department
Please fill in this form and we will contact you soon
Name
*
Company Name
*
Email
*
Phone Number
*

Country code + Phone Number

Phone number-e.g. +86 138315xxxxx

Detail
Country
City
Submit
Resources Department
If you want to be our data collection and annotation team
member, please fill in this form.
DATA COLLECTION PROJECTS
Language*
Location*
DATA ANNOTATION PROJECTS
Language*
CONTACT INFORMATION
Name*
Company Name*
E-mail*
Phone Number*

Country code + Phone Number

Phone number-e.g. +86 138315xxxxx

Experience*
Address*
Submit
Marketing Department
If you want to forward our article or tell us marketing
events, please fill in this form.
Name
*
Company Name
*
Email
*
Phone Number
*

Country code + Phone Number

Phone number-e.g. +86 138315xxxxx

Detail
Submit
Human Resources Department
Please fill in this form to be a member of Magic Data Tech.
Name
*
Email
*
Phone Number
*

Country code + Phone Number

Phone number-e.g. +86 138315xxxxx

Job
*
Upload Resume
Submit
Sample Download
Name*
E-mail*
Phone Number*

Country code + Phone Number

Phone number-e.g. +86 138315xxxxx

Company Name*
Job
Department
Company Product
I am also interested in the following data:
Languages
Style
Scenario

We will contact you via telephone to confirm your information and provide the method to download.
Submit
Submission Successful!
We will contact you as soon as possible.
This page would be
closed in 3 seconds automatically.
Talk to Magic Data
>
TOP