N Gram Language Models
Predicting the next word in a sentence
Example: Get probability of house
given This is a
.... -> Fourgram
house
given This is a
.... -> FourgramProcess:
Get a corpus
This is a house The turtle is fat This is a turtle They lay on the ground
1 sentence has
house
givenThis is a
2 fourgrams in total have
This is a
SoP(house|This is a) = 1/2
Example: Get probability of Jack
given that
-> Bigram
Jack
given that
-> BigramSame process as above: This is a house This is a turtle This is a house that Jack built They lay on the ground - The turtle is fat That lay in the house that Jack built
2 sentences have
Jack
giventhat
3 bigrams in total have
That
SoP(Jack|That) = 2/3
Predicting probability of sequence of words
Sequence of words = (w1w2w3...)
Bigram Language Model (n=2)
P(Sequence of words) = P(w1) P(w2|w1) ... P(wk|wk-1)
P(Sequence of words) = P(w1|start) P(w2|w1) ... P(wk|wk-1) P(end|wk)
Example:
Corpus: This is the malt That lay in the house that Jack build
P(this is the house) = P(this) P(is|this) P(the|is) P(house|the)
Note: P(this) was not calculated like P(w1), it was rather calculated as P(w1|start) where start is equal to 2 since there are two ways to start the sentence
Question:
Is this normalized for all sequence lengths? (i.e. Do they all add up to probability 1.0?)
No! They are normalized for each sequence length.
P(this) + P(that) = 1.0
P(this this) + P(this is) + ... + P(built built) = 1.0
Note: This approach has limitations with memory as reported here. - LSTMs are better for next-word predictions as it remembers longer sentences.
Last updated