Taking the logarithm of a certain probability turns it into a more manageable number.
Category Log Probabilities
The Log Probability of a category is merely the log of the proportion a text appears in a sample.
If you take a data set of text messages and 70% is not spam but 30% is spam, the proportion that is not-spam is 0.70 and the proportion that is spam is 0.30.
However, the log probability is as follows:
>>> np.log(0.70)
np.float64(-0.35667494393873245)
>>> np.log(0.30)
np.float64(-1.2039728043259361)
Word Log Probabilities
The Log Probability of a category is merely the log of the proportion a term appears in the sample of the author.
For instance, if an author uses the word 'the' 23 times and has a sample of 1000, the proportion is 0.023.
Likewise the log is as follows:
>>> np.log(0.023)
np.float64(-3.7722610630529876)
However before generating the log probability of each term, we want to ensure all words for all authors appears in the dictionary for a stated author.
This is because we want each author to have a value above zero for every term— even if they never used a stated term.
To do this, we want to increment the total count of a term by one before calculating the probability. As such, the word 'the' will have a count of 24 and a word not used by the stated author, but used by a different author, will be 1.
As such, the log probability for the word 'the' where it appears 23 times in a data set of 100 words is:
>>> np.log(0.024)
np.float64(-3.7297014486341915)
Now that we have collected the log probability of each category and the log probability for each word, we can use these datasets to 'predict' what category a piece of text is.
And we have the log categories set to:
Category | Log Probability | Proportion |
---|---|---|
1 | -2.3025850929940455 | 10% |
2 | -1.6094379124341003 | 20% |
3 | -1.2039728043259361 | 30% |
4 | -0.6931471805599453 | 50% |
Those with a higher probability will have larger log probabilities.
Assume we have a dictionary with Word Log Probabilities for category one set to the following:
Word | Score |
---|---|
of | -4.685212894 |
wood | -8.207038263 |
balcony | -8.612503371 |
the | -3.915152898 |
to | -4.574398527 |
red | -8.581731713 |
doors | -8.581731713 |
which | -6.589301548 |
is | -5.748518368 |
extraordinary | -8.900185444 |
do | -7.282448728 |
not | -5.673010816 |
believe | -8.581731713 |
mother | -7.600902460 |
would | -6.124995940 |
smiling | -8.294049640 |
Let us find the log probability of category one for the following sentence:
"** Do not believe mother! The extraordinary red balcony of wood is strong!**"
The formula for a sentence's log prob is:
$y = catlog +\sum_{wordprobabilities}
$
For this sentence, this is:
>>> -2.3025850929940455 + -7.282448728 + -5.673010816 + -8.581731713 + -7.600902460 + -3.915152898 + -8.581731713 + -8.612503371 + -4.685212894 + -8.207038263 + -5.748518368 + -8.294049640
-79.48488595699406
>>>
In order to figure out which category best fits the training data, run the same steps against each category! The category with the highest total is the category that best fits the model!