Chapter 4. Index Construction
Some basic knowledge for computer hardware:
1.
We always keep disk data that frequently used in
the main memory ( catching)
2.
Seek
time. we have to reduce the seek time.
3.
Store block in the buffer
Index methods for large collection:
1.
represent term by termID (instead of string in methods introduced in chapter 1).
Blocked sorted-based
indexing algorithm:
1.
segment the collection into parts of equal size.
2.
Sorts the termID and docID;
3.
Stores intermediate sorted results on disk and
merges them into final index
Single-pass in-memory indexing
(writes each block’s dictionary to diskàcan index any size as
long as there is enough space available)
each posting list is dynamic and the SPIMI algorithm is
faster then BSBI algorithm because it saves memory. We have to sort the terms
(line 11) before doing this because we want to write postings lists in
lexicographic order to facilitate the final merging step.
Distributed indexing
(for web) is an application of MapReduce . Map consists map splits
of the input data into key-value pairs and for the reduce phase, we store all
values for a given key closely
Dynamic indexing:
The dynamic index is so complex, so some large search
engines adopt reconstruction-from-scratch strategy which means they do not
construct indexes dynamically, they build
new index from scratch periodically.
Chapter 5. Index
compression
The benefits of compression are very clear:
achieve less
disk space à
increased
use of caching à
faster transfer of data from disk to memory.
This chapter introduced the lossless compression which
preserves all the information (compared to the lossy compression introduced in
chapter 6).
The Heap’s law estimated the number of terms.
(i) the dictionary size continues to increase with
more documents in the collection, rather than a maximum vocabulary size being
reached.
(ii) the size of the dictionary is quite large for
large collections
The Zipf’s law models the distribution of terms
The àit helps us understand
how terms are distributed across documents which can help us to characterize
the properties of algorithms for compressing postings lists.
In the next session, author introduced the dictionary
compression, the main goal of which is to fit it in main memory, or at least a
large portion of it, to support high query throughput.After I read this
section, I realize that even with the best compression scheme, it may not be
feasible to store the entire dictionary in the main memory for very large text
collection.
The inverted posting file compression contains variable byte
codes andγ
codes. The former encoding uses an integral number of bytes to encode a gap and
the latter one adapts the length of the code on the finer grained bit level.
And it uses entropy H(p) to determines its coding properties.
没有评论:
发表评论