let's begin with the task of computing , the probability of a word given some history . suppose the history is "its water is so transparent that" and we want to know the probability that the next word is the: one way to estimate this probability is from relative frequency counts: take a very large corpus, count the number of times we see its water is so transparent that, and count the number of times this is followed by the. this would be answering the question "out of the times we saw the history , how many times was it followed by the word ", as follows:
with a large enough corpus, such as the web, we can compute these counts and estimate the probability from eq-n-gram-1.
while this method of estimating probabilities directly from counts works fine in many cases, it turns out that even the web isn't big enough to give us good estimates in most cases. this is because language is creative; new sentences are created all the time, and we won't always be able to count entire sentences. even simple extensions of the example sentence may have counts of zero on the web (such as "Walden Pond's water is so transparent that the"; well, used to have counts of zero).
similarly, if we wanted to know the joint probability of an entire sequence of words like its water is so transparent, we could do it by asking "out of all possible sequences of five words, how many of them are its water is so transparent?" we would have to get the count of its water is so transparent and divide by the sum of the counts of all possible five word sequences. that seems rather a lot to estimate!
for this reason, we'll need to introduce more clever ways of estimating the probability of a word given a history , or the probability of an entire word sequence . to represent the probability of a particular random variable taking on the value "the", or , we will use the simplification . we'll represent a sequence of words either as or (so the expression means the string ). for the joint probability distribution of each word in a sequence having a particular value we'll use .
now how can we compute probabilities of entire sequences like ? one thing we can do is decompose this probability using the chain rule of probability:
applying the chain rule to words, we get
the intuition of the n-gram model is that instead of computing the probability of a word given its entire history, we can approximate the history by just the last few words. the assumption that the probability of a word depends only on the previous word is called a markov assumption.
the general equation for this n-gram approximation to the conditional probability of the next word in a sequence is
for the general case of maximum likelihood estimation:
equation eq-n-gram-2 estimates the n-gram probability by dividing the observed frequency of a particular sequence by the observed frequency of a prefix. this ratio is called a relative frequency. this use of relative frequencies as a way to estimate probabilities is an example of maximum likelihood estimation. in MLE, the resulting parameter set maximizes the likelihood of the training set given the model (i.e., ). for example, suppose the word Chinese occurs 400 times in a corpus of a million words like the Brown corpus. what is the probability that a random word selected from some other text of, say, a million words will be the word Chinese? the MLE of its probability is or .0004. now .0004 is not the best possible estimate of the probability of Chinese occurring in all situations; it might turn out that in some other corpus or context Chinese is a very unlikely word. but it is the probability that makes it most likely that Chinese will occur 400 times in a million-word corpus. ways to modify the MLE estimates slightly to get better probability estimates are discussed in (refer to Daniel Jurafsky, James H. Martin, 2020 chapter 3.4).