Language model
From Sphinx
Contents |
[edit]
What it is
A language model is a statistical model of word probabilities. For example a trigram language model consists of:
- Unigrams: The entire set of words in this LM, and their individual probabilities of occurrence in the language. The unigrams must include the special beginning-of-sentence and end-of-sentence tokens: <s>, and </s> respectively.
- Bigrams: A bigram is mathematically P(word2 | word1). That is, the conditional probability that word2 immediately follows word1 in the language. An LM typically contains this information for some subset of the possible word pairs. That is, not all possible word1 word2 pairs need be covered by the bigrams.
- Trigrams: Similar to a bigram, a trigram is P(word3 | word1, word2), or the conditional probability that word3 immediately follows a word1 word2 sequence in the language. Not all possible 3-word combinations need be covered by the trigrams.
Want to know more gory details?
[edit]
What it looks like
\data\ ngram 1=4999 ngram 2=1894274 ngram 3=10990492 \1-grams: -2.7387 </s> -99.0000 -99.0000 <s> -1.7618 -2.7454 a -1.1295 -3.1367 a. -0.2181 -5.0597 a.'s -0.2564 -3.8554 abandoned -0.5314 -4.0791 abdel -0.9960 -4.0335 abdul -0.8713 ... \2-grams: -1.5955 <s> a -0.7368 -3.0845 <s> a. -1.7420 -5.2405 <s> abandoned -1.4005 ... -3.1129 aboard buses -0.6727 -2.8328 aboard but -0.6652 -3.5958 aboard by -0.6563 -3.4223 aboard c. -0.6045 -3.4223 aboard cars -0.4831 -3.4223 aboard caught -0.7119 -3.4223 aboard causing -99.0000 -4.8403 aboard charter -0.5107 \3-grams: -3.6654 <s> a </s> -3.2998 <s> a a -5.1508 <s> a a. -4.6757 <s> a about ... -2.3190 <s> burundi ambassador -1.5141 <s> burundi and -1.5545 <s> burundi authorities -2.3190 <s> burundi borders -1.7046 <s> burundi erupted ... -0.9002 after swimming or -0.9002 after swimming the -0.8135 after swimming to -1.0085 after swimming up
[edit]
Alternative: Finite State Grammar
A finite state grammar is a drop-in replacement for a language model.
Which is better? XXX
Which is faster? XXX
Note that sphinx4 supports them well, sphinx3 has just apparently added support but they seemed to be ignored in version 3.0.6 kbcore.c.
[edit]
Building Language Models
[edit]
Online LM Tool
Python Script to automatically use online LM tool
[edit]
QuickLM online tool
[edit]
CMU Statistical Language Model toolkit
CMU Statistical Language Model toolkit
Other Documentation - not sure if this is the most current
[edit]
SimpleLM
Download - not sure what the difference between this and CMU Statistical Language Model toolkit is
