Language model
From Sphinx
Two types of models are used in speech recognition, 'statistical' and 'finite' state.
Contents |
What is a Statistical Language Model?
A language model is a statistical model that captures word and word-sequence probabilities. It's used in the decoder to constrain search (which speeds things up) and generally makes a significant contribution to accuracy. A good language model is one that closely models the expected input language; a bad one doesn't and most likely will do more harm than good.
A language model is characterized by its order, in terms of an "n-gram" where the 'n' indicates the size of the window over which statistics are computed. In general the higher the 'n' the more accurate the model. Of course the higher the 'n' the more data you should have to ensure that the statistics are estimated soundly. Models available for download are typically "general" meaning not optimized for any particular domain of discourse. They should be fine to start out with.
For example a trigram language model consists of:
- Unigrams: The entire set of words in this LM, and their individual probabilities of occurrence in the language. The unigrams must include the special beginning-of-sentence and end-of-sentence tokens: <s>, and </s> respectively.
- Bigrams: A bigram is mathematically P(word2 | word1). That is, the conditional probability that word2 immediately follows word1 in the language. An LM typically contains this information for some subset of the possible word pairs. That is, not all possible word1 word2 pairs need be covered by the bigrams.
- Trigrams: Similar to a bigram, a trigram is P(word3 | word1, word2), or the conditional probability that word3 immediately follows a word1 word2 sequence in the language. Not all possible 3-word combinations need be covered by the trigrams.
Want to know more gory details?
What it looks like
\data\ ngram 1=4999 ngram 2=1894274 ngram 3=10990492 \1-grams: -2.7387 </s> -99.0000 -99.0000 <s> -1.7618 -2.7454 a -1.1295 -3.1367 a. -0.2181 -5.0597 a.'s -0.2564 -3.8554 abandoned -0.5314 -4.0791 abdel -0.9960 -4.0335 abdul -0.8713 ... \2-grams: -1.5955 <s> a -0.7368 -3.0845 <s> a. -1.7420 -5.2405 <s> abandoned -1.4005 ... -3.1129 aboard buses -0.6727 -2.8328 aboard but -0.6652 -3.5958 aboard by -0.6563 -3.4223 aboard c. -0.6045 -3.4223 aboard cars -0.4831 -3.4223 aboard caught -0.7119 -3.4223 aboard causing -99.0000 -4.8403 aboard charter -0.5107 \3-grams: -3.6654 <s> a </s> -3.2998 <s> a a -5.1508 <s> a a. -4.6757 <s> a about ... -2.3190 <s> burundi ambassador -1.5141 <s> burundi and -1.5545 <s> burundi authorities -2.3190 <s> burundi borders -1.7046 <s> burundi erupted ... -0.9002 after swimming or -0.9002 after swimming the -0.8135 after swimming to -1.0085 after swimming up
Alternative: Finite State Grammar
A finite state grammar is a drop-in replacement for a statistical language model (SLM).
- Which is better? It depends. If you have a fixed language (say for device control or for a tightly scripted IVR application) where it's reasonably to expect users to stay within the language, the FSG works better (which is why it's by far the most popular approach in commercial systems). If you want to allow users to be able to say things with less constraint (a dictation task is a typical example), you should use a SLM. The FSG will recognize only words sequences legal in the language specified by the grammar. The SLM will be able to decode novel word sequences (but not, of course, ones that are entirely unpredictable from the corpus used to train the model).
- Which is faster? Usually the FSG. But it really depends on the details of the language.
Note that sphinx4 supports them well, sphinx3 has just apparently added support but they seemed to be ignored in version 3.0.6 kbcore.c.
Building Language Models
Online LM Tool
LM tool; generates an SLM plus a dictionary, given a text corpus as input.
Python Script to automatically use online LM tool
QuickLM online tool
QuickLM allows you to quickly generate language models or to play with variations. If you are configuring a recognizer, you should be using the LM tool instead.
CMU Statistical Language Model toolkit
CMU Statistical Language Model toolkit
Other Documentation - not sure if this is the most current
SimpleLM
Download - not sure what the difference between this and CMU Statistical Language Model toolkit is
