N Gram Language Models

Predicting the next word in a sentence

Example: Get probability of house given This is a.... -> Fourgram

Process:

  1. Get a corpus

This is a house The turtle is fat This is a turtle They lay on the ground

  • 1 sentence has house given This is a

  • 2 fourgrams in total have This is a So P(house|This is a) = 1/2

Example: Get probability of Jack given that -> Bigram

Same process as above: This is a house This is a turtle This is a house that Jack built They lay on the ground - The turtle is fat That lay in the house that Jack built

  • 2 sentences have Jack given that

  • 3 bigrams in total have That So P(Jack|That) = 2/3

Predicting probability of sequence of words

Sequence of words = (w1w2w3...)

Bigram Language Model (n=2)

  1. P(Sequence of words) = P(w1) P(w2|w1) ... P(wk|wk-1)

  2. P(Sequence of words) = P(w1|start) P(w2|w1) ... P(wk|wk-1) P(end|wk)

Example:

Corpus: This is the malt That lay in the house that Jack build

P(this is the house) = P(this) P(is|this) P(the|is) P(house|the)

Note: P(this) was not calculated like P(w1), it was rather calculated as P(w1|start) where start is equal to 2 since there are two ways to start the sentence

Question:

  • Is this normalized for all sequence lengths? (i.e. Do they all add up to probability 1.0?)

  • No! They are normalized for each sequence length.

    • P(this) + P(that) = 1.0

    • P(this this) + P(this is) + ... + P(built built) = 1.0

Note: This approach has limitations with memory as reported here. - LSTMs are better for next-word predictions as it remembers longer sentences.

Last updated