2015年1月23日星期五

Reading notes for week 3

Chapter 4. Index Construction

Some basic knowledge for computer hardware:

1.     We always keep disk data that frequently used in the main memory ( catching)
2.     Seek time. we have to reduce the seek time.
3.     Store block in the buffer

Index methods for large collection:
1.     represent term by termID (instead of string in methods introduced in chapter 1).

Blocked sorted-based indexing algorithm:

1.     segment the collection into parts of equal size.
2.     Sorts the termID and docID;
3.     Stores intermediate sorted results on disk and merges them into final index


Single-pass in-memory indexing

(writes each block’s dictionary to diskàcan index any size as long as there is enough space available)

each posting list is dynamic and the SPIMI algorithm is faster then BSBI algorithm because it saves memory. We have to sort the terms (line 11) before doing this because we want to write postings lists in lexicographic order to facilitate the final merging step.

Distributed indexing (for web) is an application of MapReduce . Map consists map splits of the input data into key-value pairs and for the reduce phase, we store all values for a given key closely

Dynamic indexing:
The dynamic index is so complex, so some large search engines adopt reconstruction-from-scratch strategy which means they do not construct indexes dynamically, they build new index from scratch periodically.

Chapter 5. Index compression
The benefits of compression are very clear: 
achieve less disk space à
increased use of caching à 
faster transfer of data from disk to memory.

This chapter introduced the lossless compression which preserves all the information (compared to the lossy compression introduced in chapter 6).
The Heap’s law estimated the number of terms.

(i)   the dictionary size continues to increase with more documents in the collection, rather than a maximum vocabulary size being reached.

(ii)   the size of the dictionary is quite large for large collections

The Zipf’s law models the distribution of terms
 The  àit helps us understand how terms are distributed across documents which can help us to characterize the properties of algorithms for compressing postings lists.

In the next session, author introduced the dictionary compression, the main goal of which is to fit it in main memory, or at least a large portion of it, to support high query throughput.After I read this section, I realize that even with the best compression scheme, it may not be feasible to store the entire dictionary in the main memory for very large text collection.

The inverted posting file compression contains variable byte codes andγ codes. The former encoding uses an integral number of bytes to encode a gap and the latter one adapts the length of the code on the finer grained bit level. And it uses entropy H(p) to determines its coding properties.



没有评论:

发表评论