|
SpeechRecognition
Speech Recognition ProcessThe common way to recognize speech is the following: we take waveform, split it on utterances by silences then try to recognize what's being said in each utterance. To do so, we want to take all possible combinations of words and try to match them with the audio. We choose the best matching combination. The architecture of the speech recognition is presented below:
The basic elements of the Speech Recognition(SR) architecture are as follows: Feature ExtractionThe first step in obtaining the sequence of acoustic observations O is to convert an analog audio signal into a digital representation. During this analog-to-digital conversion, the amplitude of the signal is measured at fixed time intervals and translated to a floating point number. Because the information in this sequence of numbers is highly redundant, it is transformed into a reduced representation so that the relevant information is maintained but the data is less redundant. This step is called feature extraction. A more extensive discussion on feature extraction can be found in [1]. Hidden Markov Models
Figure 2.1 is a graphical representation of a typical HMM topology used to model phones. It consists of three states State1, State2 and State3, and each state is connected to itself and to the following state. State1 is the only initial state and State3 is the final state. Gaussian Mixture ModelsIn ASR, the probability distribution functions of the HMMs are often Gaussian Mixture Models (GMM). A GMM is a continuous function modelled out of a mixture of Gaussian functions where the output of each Gaussian is multiplied by a certain weight w. The Gaussian weights sum up to 1 and the Gaussian functions themselves are defined by their mean vector and covariance matrix. The Gaussian mixture model converts the feature vector into observation PDF model. HMM Lexicon and N-gram GrammarThe a priori probability P(W) where W is a sequence of words is calculated using a n-gram language model. In n-gram models, for each possible sequence of n-1 words, the probability of the next word is stored. Because obtaining these statistics is only possible when a vocabulary is defined before creating n-gram model. In a typical system like LVCSR , HMM Lexicon (vocabularies) are defined that consist of more than 50K words. Some systems even use vocabularies with more than 300K words. With these large vocabularies the risk is minimized of not recognizing a word. Viterbi DecoderThe Virtebi algorithm decodes the observation using HMM lexicon and recognize individual words. Finally these words are combine using N-gram grammar to comprehend running speech sentence. A more extensive discussion on Viterbi Decoder can be found in [1]. | |