2015年1月30日星期五

Reading notes for week 4

Chapter 6

We use Boolean queries do the searches, 0 presents does not match, 1 presents match. When the number of result of the matching documents is too large, we need a "rank " method to rand those result document. This chapter introduced the method to score a document.

In the first part of parametric and zone indexes. Parametric indexes is designed to select only the documents matching a specific query. Zone are similar to fields, except the contents of a zone canbe arbitrary free text. The difference between parametric and zone indexes is that the dictionary for parametric index comes from a fixed vocabulary( a set of languages, or a set of dates), the dictionary for a zone index must structure whatever vocabulary stems from the text of that zone. The weighted zone store is defined to be a linear combination of zone scores, when each zone of the document contributes a Boolean value.

In term frequency and weighting, people assign each term in each document a wight for that term, which depends one the number of occurrences of the term in the document, and compute a score between a query term t and a document d, based on the weight of t in d. Another problem mentioned in this part is that certain terms have little or no discriminating power in determining relevance. we can scale down the term weights of terms with high collection frequency, defined to be the total number of occurrences of a term in a collection.

The vector space model for scoring means a document vector which captures the relative importance of the terms in a document.it can compute the similarity between two documents of the vector difference simply, to compute the magnitude of the vectors. In this chapter, the author introduced many algorithems in vector space model which are hard to summarize but I think I have to spend more time in reading this chapters and strength my background of mathematics.

Chapter 7
In chapter 6, we developed the theory of computing the weight of the terms in a document to score those term and introduced the vector space model and basic cosine scoring algorithm, and in chapter 7, author comes up with some methods for speed up this computation and outline a  complete search engine and how the pisces fit together.
Inexact top k document retrieval-->aims to dramatically lower the cost of computation of k documents without materially altering the users' perceived relevance of top K results.
Index elimination:
1). we can consider documents containing terms whose idf exceeds a preset threshold.
2). we can consider documents that contain many query terms.
Champion lists: In man search engines, we have available a measure of quality g(d) for each documents d that is query- independent and thus static.

Components of an information retrieval system:


没有评论:

发表评论