Doxa: Splendid

Tuesday, February 1, 2022

Moving on in the last CS50 AI lecture(Language), we then look

at information retrieval but still within the 'bag of words' model.

thus, we are doing term modelling. A three step dive into the issue

illustrates what can be done.

Step 1. We want to know what our text is about, and use simple term

frequency. The example looks at the corpus of Sherlock Holmes stories.

We extract the most frequently used terms, sure enought; but it is not very helpful.

english contaons a lot of helper words like the, and , to. These are common to all texts.

So for the next iteration, we will ignore these.

Step 2. The program is fed a list of function words that will be ignored in the request.

Better, but still redundant. 'Holmes' appears everywhere, Getting clever, one will then

use inverse document frequency ie we want those words that are frequent in one

story, but will penalize those that occur in all. the winning formula will multiply

term frequency by inverse document frequency.

Here we are: colonel, horse, moor...

This same code on the Federalist Papers:

Splendid: starting to see what each section might be about!!

* * *

Mthematically, the whole success of this approach hinges on the fact that the

log of any number at zero is euqual to 1. If Holmes is part of all the stories, then

inverse document frequency will be equal to 0 and one can safely discard that

term.

Below, how log plays out!

The 10 series:

Various log functions:

Doxa