Qian's Information Retrieval Blog : 一月 2015

2015年1月30日星期五

Reading notes for week 4

Chapter 6

We use Boolean queries do the searches, 0 presents does not match, 1 presents match. When the number of result of the matching documents is too large, we need a "rank " method to rand those result document. This chapter introduced the method to score a document.

In the first part of parametric and zone indexes. Parametric indexes is designed to select only the documents matching a specific query. Zone are similar to fields, except the contents of a zone canbe arbitrary free text. The difference between parametric and zone indexes is that the dictionary for parametric index comes from a fixed vocabulary( a set of languages, or a set of dates), the dictionary for a zone index must structure whatever vocabulary stems from the text of that zone. The weighted zone store is defined to be a linear combination of zone scores, when each zone of the document contributes a Boolean value.

In term frequency and weighting, people assign each term in each document a wight for that term, which depends one the number of occurrences of the term in the document, and compute a score between a query term t and a document d, based on the weight of t in d. Another problem mentioned in this part is that certain terms have little or no discriminating power in determining relevance. we can scale down the term weights of terms with high collection frequency, defined to be the total number of occurrences of a term in a collection.

The vector space model for scoring means a document vector which captures the relative importance of the terms in a document.it can compute the similarity between two documents of the vector difference simply, to compute the magnitude of the vectors. In this chapter, the author introduced many algorithems in vector space model which are hard to summarize but I think I have to spend more time in reading this chapters and strength my background of mathematics.

Chapter 7
In chapter 6, we developed the theory of computing the weight of the terms in a document to score those term and introduced the vector space model and basic cosine scoring algorithm, and in chapter 7, author comes up with some methods for speed up this computation and outline a complete search engine and how the pisces fit together.
Inexact top k document retrieval-->aims to dramatically lower the cost of computation of k documents without materially altering the users' perceived relevance of top K results.
Index elimination:
1). we can consider documents containing terms whose idf exceeds a preset threshold.
2). we can consider documents that contain many query terms.
Champion lists: In man search engines, we have available a measure of quality g(d) for each documents d that is query- independent and thus static.

Components of an information retrieval system:

Muddiest points for week 3

Still be confused about the Gamma Code. I mean, it is said that 1 and 0 can be switched in unary code , 3 can be either coded as 0001 or 1110, but how could we know about that. I think, instructor should give us more examples in the class, for the Gamma Code and for other materials in the class.

2015年1月23日星期五

Reading notes for week 3

Chapter 4. Index Construction

Some basic knowledge for computer hardware:

1. We always keep disk data that frequently used in the main memory ( catching)

2. Seek time. we have to reduce the seek time.

3. Store block in the buffer

Index methods for large collection:

1. represent term by termID (instead of string in methods introduced in chapter 1).

Blocked sorted-based indexing algorithm:

1. segment the collection into parts of equal size.

2. Sorts the termID and docID;

3. Stores intermediate sorted results on disk and merges them into final index

Single-pass in-memory indexing

(writes each block’s dictionary to diskàcan index any size as long as there is enough space available)

each posting list is dynamic and the SPIMI algorithm is faster then BSBI algorithm because it saves memory. We have to sort the terms (line 11) before doing this because we want to write postings lists in lexicographic order to facilitate the final merging step.

Distributed indexing (for web) is an application of MapReduce . Map consists map splits of the input data into key-value pairs and for the reduce phase, we store all values for a given key closely

Dynamic indexing:

The dynamic index is so complex, so some large search engines adopt reconstruction-from-scratch strategy which means they do not construct indexes dynamically, they build new index from scratch periodically.

Chapter 5. Index compression

The benefits of compression are very clear:

achieve less disk space à

increased use of caching à

faster transfer of data from disk to memory.

This chapter introduced the lossless compression which preserves all the information (compared to the lossy compression introduced in chapter 6).

The Heap’s law estimated the number of terms.

(i) the dictionary size continues to increase with more documents in the collection, rather than a maximum vocabulary size being reached.

(ii) the size of the dictionary is quite large for large collections

The Zipf’s law models the distribution of terms

The àit helps us understand how terms are distributed across documents which can help us to characterize the properties of algorithms for compressing postings lists.

In the next session, author introduced the dictionary compression, the main goal of which is to fit it in main memory, or at least a large portion of it, to support high query throughput.After I read this section, I realize that even with the best compression scheme, it may not be feasible to store the entire dictionary in the main memory for very large text collection.

The inverted posting file compression contains variable byte codes andγ codes. The former encoding uses an integral number of bytes to encode a gap and the latter one adapts the length of the code on the finer grained bit level. And it uses entropy H(p) to determines its coding properties.

Muddiest points for week 2

How does the search engine deal with the word segmentation which does not guarantee a unique tokenization. For example the slides in the week 2 provided some example in chinese language to show that different methods for word segmentation lead to different results. The method called n-gram was introduced and the example of"新西兰花" was presented. And I want to know if we deal with a sentence with more words, can we use that n-gram?

It goes to fast when instructor introduced the phrase recognition, so I failed to catch up and quite confused about the content about that.
what's the differences between terms and tokens?

2015年1月9日星期五

Muddiest point for Week 1

I have no muddiest point for week 1

Reading notes for week 1 & 2

After reading the reading assignments for week 1 and week 2, I got a brief understanding of information retrieval.Originally, IR was developed in libraries,but with the wide spread of World Wide Web, it is widely used in Web. As defined in section 1.1 of Introduction to Information Retrieval: IR is trying to find some information from an unstructured nature which satisfies an query needed from large collections. And I think the the biggest difference between IR and database is that the data IR processes is unstructured data.

In section 1.2, I learned how to build an inverted index.We have to build an index on the linguistic contents in documents and give docID to each document, and they can be sorted by docID in linked lists.This is, to some extent, an efficient structure and can reduce the storage space. Two good ways to store and access documents efficiently , the linked lists and variable arrays are briefly discussed in this section.

Linguistic processing and tokenization were briefly introduced in the section 1. And chapter 2 gave details in how to tokenize and do linguistic preprocessing of tokens. As introduced in this chapter, when determining the vocabulary of terms, tokenization is the process of chopping original character streams into tokens, and the main topic of it is to determine use the correct token, which is language-specific. Moreover, the method of word segmentation includes the hidden Markov models or conditional random field.

The first time I saw the assignment 1and was confused about the stopping words. The chapter 2 helped me a lot, especially in why we should drop the stop words and how we determine them. Using stop list to filter the document enables machine to reduce the storage space. This chapter also introduced stemming and lemmatization, which aim to reduce the inflectional forms and sometimes derivationally related forms of a word to a common base form. Porter’s Algorithm is also introduced as the most common algorithm in stemming English.But the stemmed form does not matter, the most important aspect is the equivalence class it produces. And we can see that it differs it in different language. In the next part, the author introduced the the combination schemes which is the fruitfully combination of the byword indexes and positional indexes.

After we have basic knowledge of query and inverted index, chapter 3 teaches people to develop data structures that help the search for terms in the vocabulary in an inverted index.

Search structures for dictionaries include hashing and search trees. The B- tree is very interesting, which is one of binary tree an allow the number of sub-trees under an certain interval [a,b], which can mitigate rebalancing.

Wildcard queries are used in those three situations:

1,uncertain of the spelling of a query term;

2, aware of multiple variants of spelling a term;

3,seek for documents containing variants of a term and unsure about stemming;

4,uncertain of the correct rendition of a foreign word or phrase

The main techniques for handling general wildcard queries are permuterm indexes and k-gram indexes.

I find there are a close relationship between B-tree and the permuterm index.But the problem of permuterm index is distinctive, it may cause the results of the query to be abundant.

The principle for spelling correction

1, Find the nearest selection

2, Select the one which is more common when meet two correctly spelled queries.

The author focus on two spelling corrections: the isolated-term correction and the context-sensitive correction.