Bayes Model

Log Probabilities

Taking the logarithm of a certain probability turns it into a more manageable number.

Category Log Probabilities

The Log Probability of a category is merely the log of the proportion a text appears in a sample.

If you take a data set of text messages and 70% is not spam but 30% is spam, the proportion that is not-spam is 0.70 and the proportion that is spam is 0.30.

However, the log probability is as follows:

>>> np.log(0.70)
np.float64(-0.35667494393873245)
>>> np.log(0.30)
np.float64(-1.2039728043259361)

Word Log Probabilities

The Log Probability of a category is merely the log of the proportion a term appears in the sample of the author.

For instance, if an author uses the word 'the' 23 times and has a sample of 1000, the proportion is 0.023.

Likewise the log is as follows:

>>> np.log(0.023)
np.float64(-3.7722610630529876)

However before generating the log probability of each term, we want to ensure all words for all authors appears in the dictionary for a stated author.

This is because we want each author to have a value above zero for every term— even if they never used a stated term.

To do this, we want to increment the total count of a term by one before calculating the probability. As such, the word 'the' will have a count of 24 and a word not used by the stated author, but used by a different author, will be 1.

As such, the log probability for the word 'the' where it appears 23 times in a data set of 100 words is:

>>> np.log(0.024)
np.float64(-3.7297014486341915)

Running the Bayes Model

Now that we have collected the log probability of each category and the log probability for each word, we can use these datasets to 'predict' what category a piece of text is.

And we have the log categories set to:

Category	Log Probability	Proportion
1	-2.3025850929940455	10%
2	-1.6094379124341003	20%
3	-1.2039728043259361	30%
4	-0.6931471805599453	50%

Those with a higher probability will have larger log probabilities.

Assume we have a dictionary with Word Log Probabilities for category one set to the following:

Word	Score
of	-4.685212894
wood	-8.207038263
balcony	-8.612503371
the	-3.915152898
to	-4.574398527
red	-8.581731713
doors	-8.581731713
which	-6.589301548
is	-5.748518368
extraordinary	-8.900185444
do	-7.282448728
not	-5.673010816
believe	-8.581731713
mother	-7.600902460
would	-6.124995940
smiling	-8.294049640

Let us find the log probability of category one for the following sentence:

"** Do not believe mother! The extraordinary red balcony of wood is strong!**"

The formula for a sentence's log prob is:

$y = catlog +\sum_{wordprobabilities}$

For this sentence, this is:

>>> -2.3025850929940455 + -7.282448728 + -5.673010816 + -8.581731713 + -7.600902460 + -3.915152898 + -8.581731713 + -8.612503371 + -4.685212894 + -8.207038263 + -5.748518368 + -8.294049640
-79.48488595699406
>>>

In order to figure out which category best fits the training data, run the same steps against each category! The category with the highest total is the category that best fits the model!

bayes.md 3.4 KB Kalıcı Bağlantı Geçmiş Ham

Bayes Model

Log Probabilities

Running the Bayes Model

bayes.md 3.4 KB

Kalıcı Bağlantı Geçmiş Ham