Getting started with Whisper by OPENAI

3 minute read

Published:

Getting started with Whisper by OPENAI

Whisper is a cutting-edge speech recognition model developed by OpenAI in October 2022. Its primary purpose is to convert audio files into text with remarkable accuracy, supporting up to 99 languages, including Japanese. The model’s encoder was trained through a technique called weakly supervised learning, leveraging a vast dataset of over 68,000 hours of speech. This approach enabled the model to surpass the accuracy of traditional academic data sets.

Whisper’s architecture is built on an encoder-decoder transformer that includes an encoder and a decoder. The encoder is responsible for extracting a latent representation from speech, while the decoder outputs text from the latent representation. To extract the latent representation from the spectrogram, the encoder is executed only once per 30-second segment. The transformer performs position embedding using the Sin function, and the decoder produces the probability of occurrence for each token from the latent representation.

To determine the tokens, the model uses either Greedy Search or Beam Search on the token probability of occurrence in the output. The chosen token sequence is then decoded into text using GPT2TokenizerFast. Whisper’s architecture is based on a byte-level BPE text tokenizer, which uses Unicode byte codes instead of words.

Getting a Bit sidetracked but its important for understanding Whisper model understanding Subword-based tokenization is a technique that has emerged as a solution between word and character-based tokenization. The main objective of this technique is to overcome the limitations faced by word-based and character-based tokenization methods.

Word-based tokenization suffers from a large vocabulary size, a large number of out-of-vocabulary (OOV) tokens, and different meanings of very similar words. On the other hand, character-based tokenization generates long sequences and less meaningful individual tokens.

Subword-based tokenization algorithms aim to split the rare words into smaller meaningful subwords instead of splitting frequently used words. This approach helps the model understand the relationship between words with similar roots and different meanings. For instance, the word “boys” is split into “boy” and “s” to signify that it is formed using the word “boy” but has a slightly different meaning.

There are several popular subword tokenization algorithms available, including WordPiece, Byte-Pair Encoding (BPE), Unigram, and SentencePiece. WordPiece is a subword-based tokenization algorithm that was introduced by Google for natural language processing tasks. BPE is another widely used subword-based tokenization algorithm that works by iteratively merging the most frequent pairs of consecutive symbols. Unigram is a simpler algorithm that treats each character as a separate token. Lastly, SentencePiece is a framework for unsupervised text segmentation.

Byte-Pair Encoding (BPE) is a subword-based tokenization algorithm that addresses the limitations of both word-based and character-based tokenization. It works by merging frequently occurring pairs of consecutive symbols to form a single token. This ensures that the most common words are represented in the vocabulary as a single token, which reduces the vocabulary size and helps the model learn more efficiently.

On the other hand, rare words are broken down into two or more subword tokens. This enables the model to understand the relationship between words with similar roots and different meanings. Therefore, BPE is in line with the subword-based tokenization algorithm, which aims to split rare words into smaller meaningful subwords to improve the model’s performance.

steps:

  1. Initialize the vocabulary with all the bytes or characters in the text corpus
  2. Calculate the frequency of each byte or character in the text corpus.
  3. Repeat the following steps until the desired vocabulary size is reached:

    a. Find the most frequent pair of consecutive bytes or characters in the text corpus b. Merge the pair to create a new subword unit. c. Update the frequency counts of all the bytes or characters that contain the merged pair. d. Add the new subword unit to the vocabulary.

References

  1. https://towardsdatascience.com/byte-pair-encoding-subword-based-tokenization-algorithm-77828a70bee0
  2. https://medium.com/@pierre_guillou/byte-level-bpe-an-universal-tokenizer-but-aff932332ffe
  3. https://towardsdatascience.com/byte-pair-encoding-subword-based-tokenization-algorithm-77828a70bee0
  4. https://www.youtube.com/watch?v=AwJf8aQfChE