My favorites | Sign in
Project Home Downloads Wiki Issues Source
READ-ONLY: This project has been archived. For more information see this post.
Search
for
SpeechRecognition  
Updated Dec 10, 2012 by an...@norture.com

Speech Recognition Process

The common way to recognize speech is the following: we take waveform, split it on utterances by silences then try to recognize what's being said in each utterance. To do so, we want to take all possible combinations of words and try to match them with the audio. We choose the best matching combination. The architecture of the speech recognition is presented below:

The basic elements of the Speech Recognition(SR) architecture are as follows:

Feature Extraction

The first step in obtaining the sequence of acoustic observations O is to convert an analog audio signal into a digital representation. During this analog-to-digital conversion, the amplitude of the signal is measured at fixed time intervals and translated to a floating point number. Because the information in this sequence of numbers is highly redundant, it is transformed into a reduced representation so that the relevant information is maintained but the data is less redundant. This step is called feature extraction. A more extensive discussion on feature extraction can be found in [1].

Hidden Markov Models

  • The statistical model most often used to calculate the likelihood P(O |λ), is the Hidden Markov Model (HMM). An HMM consists of a finite number of states that are connected in a fixed topology. The input of the HMM, the feature vectors, are called observations.
  • Each HMM state can 'emit' an observation from the observation sequence O with a certain probability defined by its Probability Distribution Function (PDF). The first observation must be emitted by a state that is defined to be one of the initial states. After this observation has been processed, one of the states that is connected to the initial state is chosen to emit the next observation.
  • The probability that a particular transition from one state to another is picked, is modelled with the transition probability. Eventually all observations are emitted by a state that is connected to the state that emitted the previous observation and finally, final observation should be emitted by one of the final states.
  • Since the actual path taken to create a specific state sequence is unknown to a theoretical observer therefore this type of Markov Model is called a Hidden Markov Model.

Figure 2.1 is a graphical representation of a typical HMM topology used to model phones. It consists of three states State1, State2 and State3, and each state is connected to itself and to the following state. State1 is the only initial state and State3 is the final state.

Gaussian Mixture Models

In ASR, the probability distribution functions of the HMMs are often Gaussian Mixture Models (GMM). A GMM is a continuous function modelled out of a mixture of Gaussian functions where the output of each Gaussian is multiplied by a certain weight w. The Gaussian weights sum up to 1 and the Gaussian functions themselves are defined by their mean vector and covariance matrix. The Gaussian mixture model converts the feature vector into observation PDF model.

HMM Lexicon and N-gram Grammar

The a priori probability P(W) where W is a sequence of words is calculated using a n-gram language model. In n-gram models, for each possible sequence of n-1 words, the probability of the next word is stored. Because obtaining these statistics is only possible when a vocabulary is defined before creating n-gram model. In a typical system like LVCSR , HMM Lexicon (vocabularies) are defined that consist of more than 50K words. Some systems even use vocabularies with more than 300K words. With these large vocabularies the risk is minimized of not recognizing a word.

Viterbi Decoder

The Virtebi algorithm decodes the observation using HMM lexicon and recognize individual words. Finally these words are combine using N-gram grammar to comprehend running speech sentence. A more extensive discussion on Viterbi Decoder can be found in [1].

Powered by Google Project Hosting