Moving on in the last CS50 AI lecture(Language), we then look
at information retrieval but still within the 'bag of words' model.
thus, we are doing term modelling. A three step dive into the issue
illustrates what can be done.
Step 1. We want to know what our text is about, and use simple term
frequency. The example looks at the corpus of Sherlock Holmes stories.
We extract the most frequently used terms, sure enought; but it is not very helpful.
english contaons a lot of helper words like the, and , to. These are common to all texts.
So for the next iteration, we will ignore these.
Step 2. The program is fed a list of function words that will be ignored in the request.
Better, but still redundant. 'Holmes' appears everywhere, Getting clever, one will then
use inverse document frequency ie we want those words that are frequent in one
story, but will penalize those that occur in all. the winning formula will multiply
term frequency by inverse document frequency.
No comments:
Post a Comment