Language model

From Sphinx

Contents

What it is

A language model is a statistical model of word probabilities. For example a trigram language model consists of:

  • Unigrams: The entire set of words in this LM, and their individual probabilities of occurrence in the language. The unigrams must include the special beginning-of-sentence and end-of-sentence tokens: <s>, and </s> respectively.
  • Bigrams: A bigram is mathematically P(word2 | word1). That is, the conditional probability that word2 immediately follows word1 in the language. An LM typically contains this information for some subset of the possible word pairs. That is, not all possible word1 word2 pairs need be covered by the bigrams.
  • Trigrams: Similar to a bigram, a trigram is P(word3 | word1, word2), or the conditional probability that word3 immediately follows a word1 word2 sequence in the language. Not all possible 3-word combinations need be covered by the trigrams.

Want to know more gory details?

What it looks like


\data\
ngram 1=4999
ngram 2=1894274
ngram 3=10990492

\1-grams:
-2.7387 </s>    -99.0000
-99.0000        <s>     -1.7618
-2.7454 a       -1.1295
-3.1367 a.      -0.2181
-5.0597 a.'s    -0.2564
-3.8554 abandoned       -0.5314
-4.0791 abdel   -0.9960
-4.0335 abdul   -0.8713
...
\2-grams:
-1.5955 <s> a   -0.7368
-3.0845 <s> a.  -1.7420
-5.2405 <s> abandoned   -1.4005
...
-3.1129 aboard buses    -0.6727
-2.8328 aboard but      -0.6652
-3.5958 aboard by       -0.6563
-3.4223 aboard c.       -0.6045
-3.4223 aboard cars     -0.4831
-3.4223 aboard caught   -0.7119
-3.4223 aboard causing  -99.0000
-4.8403 aboard charter  -0.5107
\3-grams:
-3.6654 <s> a </s>
-3.2998 <s> a a
-5.1508 <s> a a.
-4.6757 <s> a about
...
-2.3190 <s> burundi ambassador
-1.5141 <s> burundi and
-1.5545 <s> burundi authorities
-2.3190 <s> burundi borders
-1.7046 <s> burundi erupted
...
-0.9002 after swimming or
-0.9002 after swimming the
-0.8135 after swimming to
-1.0085 after swimming up

Alternative: Finite State Grammar

A finite state grammar is a drop-in replacement for a language model.

Which is better? XXX

Which is faster? XXX

Note that sphinx4 supports them well, sphinx3 has just apparently added support but they seemed to be ignored in version 3.0.6 kbcore.c.

Building Language Models

Online LM Tool

LM tool

Python Script to automatically use online LM tool

QuickLM online tool

QuickLM

CMU Statistical Language Model toolkit

CMU Statistical Language Model toolkit

Official Documentation

Other Documentation - not sure if this is the most current

SimpleLM

Download - not sure what the difference between this and CMU Statistical Language Model toolkit is

related