A Deep Dive into Automatic Speech Recognition Technology

4 minute read

Published:

A Deep Dive into Automatic Speech Recognition Technology

ASR, or automatic speech recognition, is a technology that aims to convert spoken utterances into a textual representation such as words, syllables, or phonemes. Speech recognition technology involves three models: the lexicon model which understands how words are pronounced, the acoustic model which analyzes speech patterns, and the language model which predicts word sequences. These models work together in decoding to produce accurate transcriptions of spoken language.

A common metric used to evaluate the accuracy of the decoded sentence generated by ASR is the Word Error Rate.

Levenshtein Distance, also known as the edit distance, measures the minimum number of single-character edits required to change one word or text into another. It is a general metric used in various fields such as computational linguistics, computer science, and bioinformatics for tasks like spell checking and DNA sequence analysis. In contrast, Word Error Rate (WER) is specifically designed for evaluating speech recognition and transcription systems. It measures the performance of a system by calculating the ratio of the total number of errors to the total number of words in the reference, expressed as a percentage.

To calculate Levenshtein Distance, the smallest number of insertions, deletions, and substitutions required to turn one string into another is considered. The distance is not normalized, meaning it’s an absolute number that can range from 0 (identical strings) to the length of the longer string involved. On the other hand, WER is calculated as the sum of substitutions, deletions, and insertions, divided by the number of words in the reference, often expressed as a percentage. This normalization by the reference length makes it a relative measure of the error rate.

Levenshtein Distance is not inherently normalized, and to compare across different lengths, it might be normalized by the length of the longer string. Word Error Rate is inherently normalized against the length of the reference sequence, making it a percentage that can be directly compared across different tasks, regardless of the length of the input.

While Levenshtein Distance is versatile and can be applied to any scenario that involves comparing two sequences of characters or symbols, not just words, Word Error Rate is tailored for evaluating the performance of speech recognition systems, focusing on the accuracy of word recognition and transcription in spoken language processing. The Hidden Markov Model is the basis of ASR, which is designed for sequential processes where each event state depends on the probability of the previous state event. The model is used to determine the probability of a sequence of observations, such as MFCCs (Mel Frequency Cepstral Coefficients), given a sequence of hidden states, such as phonemes. The HMM has emission probabilities that represent the probability that a state emits a given observation.

To improve the accuracy of ASR, various techniques are used, such as supervised training of HMMs from fully observed data and unsupervised training with the Baum-Welch EM algorithm using data where state sequences are not observed. Additionally, decoding algorithms like the Viterbi algorithm are used to find the most likely sequence of hidden states given a sequence of observable events.

HMM is an extension of the Markov process for modeling hidden states. The states are phonemes i.e. a small number of basic sounds that can be produced. The observations are frames of audio which are represented using MFCCs. Given a sequence of MFCCs i.e. the audio, we want to know what the sequence of phonemes was. Once we have the phonemes we can work out words using a phoneme-to-word dictionary. Determining the probability of MFCC observations given the state is done using Gaussian Mixture Models (GMMs). We only observe speech signals corresponding to each word and need to deduce states using the observations.

Hidden states give observations with a certain probability. HMM has emission probabilities which represent the probability that a state emits a given observation. Baum Welch algo is an unsupervised learning algorithm that iteratively adjusts the probabilities of events occurring in each state to fit the data better. Viterbi algo is a dynamic programming algo that finds the most likely sequence of hidden states given a sequence of observable events.

Hidden Markov Model finding the best state sequence for observation, computing the probability of observations through the forward and backward algorithms, supervised training of HMMs from fully observed data, and unsupervised training with the Baum-Welch EM algorithm from data where state sequences are not observed.