2015年4月10日星期五
muddiest points for week 12
In k-means clustering, if we don't know how many clusters it have, how we define the value of k?
2015年4月3日星期五
reading notes for week12
Chapter 13 text classification and Naive Bayes
The classification in IR always involves the following parts:
1). The automatic detection of spam pages (which then are not included in
the search engine index).
2). The automatic detection of sexually explicit content (which is included in search results only if the user turns an option such as SafeSearch off)
3). Sentiment detection or the automatic classification of a movie or product review as positive or negative.
4). Personal email sorting. A user may have folders like talk announcements, electronic bills, email from family and friends, and so on, and may want a classifier to classify each incoming email and automatically move it to the appropriate folder.
5). Topic-specific or vertical search. Vertical search engines restrict searches to a particular topic.
2). The automatic detection of sexually explicit content (which is included in search results only if the user turns an option such as SafeSearch off)
3). Sentiment detection or the automatic classification of a movie or product review as positive or negative.
4). Personal email sorting. A user may have folders like talk announcements, electronic bills, email from family and friends, and so on, and may want a classifier to classify each incoming email and automatically move it to the appropriate folder.
5). Topic-specific or vertical search. Vertical search engines restrict searches to a particular topic.
This type of learning is called supervised learning because a supervisor (the
human who defines the classes and labels training documents) serves as a
teacher directing the learning process. We denote the supervised learning
method by Γ and write Γ(D) = γ. The learning method Γ takes the training
set D as input and returns the learned classification function γ.
The first supervised learning method we introduce is the multinomial Naive Bayes or multinomial NB model, a probabilistic learning method. The probability of a document d being in class c is computed as
The first supervised learning method we introduce is the multinomial Naive Bayes or multinomial NB model, a probabilistic learning method. The probability of a document d being in class c is computed as
An alternative to the multinomial model is the multivariate Bernoulli model
or Bernoulli model. It is equivalent to the binary independence model of Section which generates an indicator for each term of the vocabulary, either 1 indicating presence of the term in the document or 0 indicating absence.
To reduce the number of parameters, we make the Naive Bayes conditional
independence assumption. We assume that attribute values are independent of
each other given the class:
Feature selection is the process of selecting a subset of the terms occurring
in the training set and using only this subset as features in text classification.
Feature selection serves two main purposes.
First, it makes training and applying a classifier more efficient by decreasing the size of the effective vocabulary.
Feature selection serves two main purposes.
First, it makes training and applying a classifier more efficient by decreasing the size of the effective vocabulary.
Second, feature selection often increases classification accuracy by eliminating noise features.
As you might expect, keeping the in-
formative terms and eliminating the non-informative ones tends to reduce
noise and improve the classifier’s accuracy.
For the multinomial model (MI feature selection), the peak occurs later, at 100 features, and its effectiveness recovers somewhat at the end when we use all features. The reason is that the multinomial takes the number of occurrences into account in parameter estimation and classification and therefore better exploits a larger number of features than the Bernoulli model. Regardless of the differences between the two methods, using a carefully selected subset of the features results in better effectiveness than using all features.
Contiguity hypothesis. Documents in the same class form a contiguous region and regions of different classes do not overlap.
For the multinomial model (MI feature selection), the peak occurs later, at 100 features, and its effectiveness recovers somewhat at the end when we use all features. The reason is that the multinomial takes the number of occurrences into account in parameter estimation and classification and therefore better exploits a larger number of features than the Bernoulli model. Regardless of the differences between the two methods, using a carefully selected subset of the features results in better effectiveness than using all features.
Chapter 14 Vector space classification
The basic hypothesis in using the vector space model for classification is the contiguity hypothesis.Contiguity hypothesis. Documents in the same class form a contiguous region and regions of different classes do not overlap.
2015年3月30日星期一
Muddiest Point for week 11(3.30)
When implementing adaptive IR, we should collect the user's profile and build personalized search query or design adaptive system for them , but how can we collect the information about the user? How can we collect user profile without violate the user's privacy?
2015年3月28日星期六
Reading notes for week 11
Intelligent Information Retrieval:
In the user profiles for personalized information access, we can see that we can collecting information about the uses, and represent and build the user information profiles. Explicit information techniques are contrasted with implicitly collected user information using browser caches, proxy servers, browser agents, desktop agents, and search logs, rely on personal information input by the users, typically via HTML forms. More sophisticated personalization projects based on explicit feedback have focused on navigation. However, Implicit user information collection only track browsing activity, proxy servers seem to be a good compromise between easily capturing information and yet not placing a large burden on the us.
In order to construct an individual user’s profile, information may be collected explicitly, through direct user intervention, or implicitly, through agents that monitor user activity.
Collecting Information About Users
Methods for User Identification: collecting their information and sharing this with a server via some protocol, logins, cookies and session ids
Methods for User Information Collection
Explicit User Information Collection/ Implicit User Information Collection
Comparing Implicit and Explicit User Information Collection.
Personalized Web Exploration with Task Models:
TaskSieve is a Web search system that utilizes a relevance feedback based profile, called a “task model”, for personalization. Its innovations include flexible and user controlled integration of queries and task models, task-infused text snippet generation, and on-screen visualization of task models:
1). Retrieve documents along with their relevance scores by submitting the user query to a search engine.
2). Calculate similarity scores between retrieved documents and the model.
3). Calculate combined score of each document by equation that alpha * Task_Model_Score + (1 – alpha) * Search_Score
4). Re-rank the initial list by the combined score from step 3.
The idea of re-ranking is to promote documents, which are more relevant to the user task as measured by their similarity to the task model.
Muddiest points for week 11
In the link analysis, we define simple iterative logics as good nodes won't point to the bad nodes, and vice versa. But how we could define an bad nodes? If we use the search engine, would the search engine only returns the good nodes, no matter whether the bad nodes is as relative as the good nodes.
2015年3月20日星期五
Muddiest Point for week 10
When shuguang introduced the interface of Google, I am wondering why Google use the "I am feeling lucky". It's confused me for a long time since I first used google.
Reading Notes for Week 10
19.Web search basics
CPC model:clicking on the adver-
tisement leads the user to a web page set up by the advertiser, where the user
is induced to make a purchase.
The essential feature that led to the explosive growth of the web – decentralized content publishing with essentially no central control of authorship –
turned out to be the biggest challenge for web search engines in their quest to
index and retrieve this content. And the search engine developed because of the spam detection in the early time.
Advertising as the economic model
Goto:for every query term q it ac-
cepted bids from companies who wanted their web page shown on the query
q. In response to the query q, Goto would return the pages of all advertisers
who bid for q, ordered by their bids. Furthermore, when the user clicked
on one of the returned results, the corresponding advertiser would make a
payment to Goto
The search user experience
Construct the random page samples:
4.random queries
Near-duplicates and shingling
The Web contains multiple copies of the same content. By some estimates, as many as 40% of the pages on the Web are duplicates of other pages.
21.Link analysis
It is crucial that we understand the users of web search as well,and the more user traffic a web search engine can attract, the
more revenue it stands to earn from sponsored search.
There are three web search queries:(1) informational (2) navigational (3) transactional.
There are three web search queries:(1) informational (2) navigational (3) transactional.
Index size and estimation
It is difficult to reason about the fraction of the Web indexed by a search engine, because there is an infinite number of dynamic web pages and the number of the web page is very large.
It is difficult to reason about the fraction of the Web indexed by a search engine, because there is an infinite number of dynamic web pages and the number of the web page is very large.
The relative sizes of the indexes is imprecise:
1).In response to queries a search engine can return web pages whose con- tents it has not (fully or even partially) indexed.
1).In response to queries a search engine can return web pages whose con- tents it has not (fully or even partially) indexed.
2).Search engines generally organize their indexes in various tiers and parti-
tions, not all of which are examined on every search.
So
that there is no single measure of index size.
Construct the random page samples:
1. Random searches
2. Random IP addresses
2. Random IP addresses
3. Random walks
Near-duplicates and shingling
The Web contains multiple copies of the same content. By some estimates, as many as 40% of the pages on the Web are duplicates of other pages.
The simplest approach to detecting duplicates is to compute, for each web
page, a fingerprint that is a succinct (say 64-bit) digest of the characters on that
page. Then, whenever the fingerprints of two web pages are equal, we test
whether the pages themselves are equal and if so declare one of them to be a
duplicate copy of the other.
Illustration of shingle sketches. We see two documents going through
four stages of shingle sketch computation. In the first step (top row), we apply a 64-bit
hash to each shingle from each document to obtain H(d1) and H(d2) (circles). Next,
we apply a random permutation Π to permute H(d1) and H(d2), obtaining Π(d1)
and Π(d2) (squares).
21.Link analysis
In this chapter author focus on the
use of hyperlinks for ranking web search results. Such link analysis is one
of many factors considered by web search engines in computing a compos-
ite score for a web page on any given query.
PageRank
Definition: first technique for link analysis assigns to every node in the web graph a numerical score between 0 and 1
Definition: first technique for link analysis assigns to every node in the web graph a numerical score between 0 and 1
In assigning a PageRank score to each node of the web graph, we use the
teleport operation in two ways:
(1) When at a node with no out-links, the surfer invokes the teleport operation.
(2) At any node that has outgoing links, the surfer invokes the teleport operation with probability 0 < α < 1 and the standard random walk.
(1) When at a node with no out-links, the surfer invokes the teleport operation.
(2) At any node that has outgoing links, the surfer invokes the teleport operation with probability 0 < α < 1 and the standard random walk.
Markov chains
A Markov chain is characterized by an N × N transition probability matrix P
each of whose entries is in the interval [0, 1]; the entries in each row of P add
up to 1.
In this example we consider a user whose
interests are 60% sports and 40% politics. If the teleportation probability is 10%, this
user is modeled as teleporting 6% to sports pages and 4% to politics pages.
2015年3月13日星期五
Week 9 reading notes
Chapter 10 for MIR,
In this chapter, author introduced the user interface which should help users to express their information needs and formulate queries. This chapter talks about recent development in this area.
There are three design rules for users:
In this chapter, author introduced the user interface which should help users to express their information needs and formulate queries. This chapter talks about recent development in this area.
There are three design rules for users:
- Offer informative feedback
- Reduce working memory load
- Provide alternative interfaces for novice and expert users
- Information access interfaces must contend with specialkinds of simplicity/power trade
How to judge a interactive system
An important aspect of human-computer interaction is the methodology for evaluation of user interface techniques. Precision and recall measures have been widely used for comparing the ranking results of non-interactive systems, but are less appropriate for assessing interactive systems. The standard evaluations emphasize high recall levels; in the TREC tasks systems are compared to see how well they return the top 1000 documents.
The procedure to process interaction system:
- Start with an information need
- Select a system and a collection to search on
- Formulate a query
- Send the query to the system
- Receive the results in the form of information items
- Scan, evaluate, and interpret the results
- Either stop, or,
- Reformulate the query and go to step 4
About the starting points:
One study found that for non-expert users the results of clustering were difficult to use, and that graphical depictions (for example, representing clusters with circles and lines connecting documents) were much harder to use than textual representations (for example, showing titles and topical words, as in Scatter/Gather), because documents' contents are difficult to discern without actually reading some text .
About the query specification:
Boolean queries are problematic because the basic syntax is counter-intuitive.
Solutions to solve this problem are :1. allow users to choose from a selection of common simple ways of combining query terms 2. provide a simpler or more intuitive syntax.
One study found that for non-expert users the results of clustering were difficult to use, and that graphical depictions (for example, representing clusters with circles and lines connecting documents) were much harder to use than textual representations (for example, showing titles and topical words, as in Scatter/Gather), because documents' contents are difficult to discern without actually reading some text .
About the query specification:
Boolean queries are problematic because the basic syntax is counter-intuitive.
Solutions to solve this problem are :1. allow users to choose from a selection of common simple ways of combining query terms 2. provide a simpler or more intuitive syntax.
Visualization For Text Mining:
One of the most common strategies used in text mining is to identify important entities within the text and attempt to show connections among those entities.
Visualizing Document Concordances And Word Frequencies
In the field of literature analysis it is commonplace to analyze a text or a collection of texts by extracting a concordance : an alphabetical index of all the words in a text, showing those words in the contexts in which they appear. The standard way to view the concordance is to place the word of interest in the center of the view with “gutters” on either side, and then sort the surrounding text in some way.
Visualizing Literature And Citation Relationships
This information is used to assess the importance of the authors, the papers, and the topics in the field, and how these measures change over time.
Some more recent approaches have moved away from nodes-and-links in order to break the interesting relationships into pieces and show their connections via brushing-and-linking interactions.
2015年2月20日星期五
Reading notes for week6
I realize that I missed some reading notes of the paper required for the last weeks and I've read the chapter 9 in the last week, so I decided to bring some notes for the previous papers.
Cumulated Gain-Based Evaluation of IR Techniques
In this paper, the author presented us something about the measurements of the novel, which compute the cumulative gain the user obtains and examine the retrieval result up to a given ranked position.
And large amount of the output has been confused by the users for a long time when using the IR system.
There are three new evaluation measures. The firs one accumulates the relevance scores of retrieved documents along the ranked result list. The second one is similar but applies a discount factor on the relevance scores in order to devaluate late-retrieved documents. The third one computes the relative-to-the-ideal performance of IR techniques. An essential feature of the proposed measures is the weighting of documents at different levels of relevance.
For example. the discounted cumulated gain:A discounting function is needed that progressively reduces the document score as its rank increases but not too steeply to allow for user persistence in examining further documents. A simple way of discounting with this requirement is to divide the document score by the log of its rank.
Relevance Judgments:Possible differences between IR techniques in retrieving highly relevant documents might be evened up by their possible indifference in retrieving marginal documents. The net differences might seem practically marginal and statistically insignificant.
Moiddest Point for week 6
In pooling method we talked in the Monday, you mentioned use top n documents form each search result to build a pool for judgment, what I wanna ask is that how do you define the number n is the most proper.
2015年2月13日星期五
Week 5 Reading Notes
Chapter 10 & Chapter 11
Most of the time the Chapter 10 id talking about the XML and its use in the encoding the document and the text. An XML document is an ordered, labeled tree, each node of the tree is an XML element and is writteen with an opening and closing tag. XPath is a standard for enumerating path in an XML document collection.
Then the author mentioned some challenges that make structured retrieval(the collection consists of structured documents and queries are either structured).The first challenge in structured retrieval is that users want us to return parts of documents (i.e., XML elements), not entire documents as IR systems. So,it will use one of the useful principle:Structured document retrieval principle. When I read through the paper, I were impressed by one method , it is we can use one of the largest elements as the indexing unit.
Most of the time the Chapter 10 id talking about the XML and its use in the encoding the document and the text. An XML document is an ordered, labeled tree, each node of the tree is an XML element and is writteen with an opening and closing tag. XPath is a standard for enumerating path in an XML document collection.
Then the author mentioned some challenges that make structured retrieval(the collection consists of structured documents and queries are either structured).The first challenge in structured retrieval is that users want us to return parts of documents (i.e., XML elements), not entire documents as IR systems. So,it will use one of the useful principle:Structured document retrieval principle. When I read through the paper, I were impressed by one method , it is we can use one of the largest elements as the indexing unit.
The Picture is a example of speration of a Xml document using the vector space model. We can interpret all queries as extended queries – that is, there can
be an arbitrary number of intervening nodes in the document for any parent-
child node pair in the query.
We ensure that retrieval results respect this
preference by computing a weight for each match. A simple measure of the
similarity of a path cq in a query and a path cd in a document is the following
context resemblance function CR
In the Chapter12, the author introduces more systematically introduce this probabilistic approach to IR, which
provides a different formal basis for a retrieval model and results in different
techniques for setting term weights.we
can straightforwardly start to estimate the probability of a term t appearing
in a relevant document P(t|R = 1), and that this could be the basis of a
classifier that decides whether documents are relevant or not.
Binary Inde-
pendence Model: is exactly the same as the multivariate Bernoulli Naive Bayes model.
Under the BIM, we
model the probability P(R|d, q) that a document is relevant via the probability in terms of term incidence vectors P(R|⃗x,⃗q). Then, using Bayes rule as described above.
In probability estimates in practice,it is plausible to approximate statistics for nonrelevant documents by statistics from the whole collection.
Okapi BM25: a non-binary model:The BM25 weighting scheme, often called Okapi weighting, after the system in which it was first implemented, was developed as a way of building a probabilistic model sensitive to these quantities while not introducing too many additional pa- rameters into the model
In probability estimates in practice,it is plausible to approximate statistics for nonrelevant documents by statistics from the whole collection.
Okapi BM25: a non-binary model:The BM25 weighting scheme, often called Okapi weighting, after the system in which it was first implemented, was developed as a way of building a probabilistic model sensitive to these quantities while not introducing too many additional pa- rameters into the model
muddiest point for week 5
I am quite confused about the language model. Is it based on the vector space model. If we have the same document and can we construct different language model?
2015年2月7日星期六
muddiest point for week 4
Actually, I am a little bit confused about the question in the ppt45. Which word is a better search term? is it insurance?
2015年2月5日星期四
Week 4 Reading Notes
Chapter 8: Evaluation in Information Retrieval
& Chapter 9: Relevance feedback and query expansion
In the previous chapter I read in the past month, the authors mentioned many alternatives in designing an IR system, but in this chapter, they discussed how to measure the effectiveness of an IR system and introduce how to develop further measures for evaluating ranked retrieval result. User utility is measured and the user's happiness is very important is a good IR system, for example the speed of response and the size of the index are part of the factors in users' happiness. But it still mentioned that the users' satisfaction is determined by many other factors, for example the result of the design of the user interface and other measurements.And the standard approach to information retrieval system evaluation revolves around the notion of relevant and nonrelevant documents.But a document is relevant but only because it contain all the words in the query, but also because it may be related to the query, because the information need is not overt.
In addition, a list of the most standard test collections and evaluation series are introduced in this chapter, for example the The Cranfield collection and TREC and NII Test Collection. In evaluation of unranked retrieval sets, the two most frequent and basic measures for information retrieval effectiveness are precision and recall. Maybe the alternative seems to be accuracy, which is the fraction of its classification that are correct. But the accuracy is not an appropriate measure for information retrieval problem.
The advantage of having the two numbers for precision and recall is that one is more important than the other in many circumstances.
In a ranked retrieval context, appropriate sets of retrieved documents are naturally given by the top k retrieved documents. For each set, the precision and recall values can be plotted into a precision -recall curve.
Moreover,to evaluate the system, we have to germane the test information. The most standard approach is called pooling, where relevance is assessed over a subset of the collection that is formed from the top k documents returned by a number of different IR systems. But the relevance judgments are quite idiosyncratic and variable, and the success of an IR system depends on how good it is at satisfying the needs of these idiosyncratic human, one information need at a time.
Finally, the user utility is discussed, which means to make the user happy. For a wen search engine, happy search users are those who find what they want and desire to use this search engine again. But, it is very hard to investigate the satisfaction of the user.
Actually, we have already read contents about the synonymy in the chapter 1, which means the same concept may be referred to using different words.In the chapter9 . the author discussed the ways in which a system can help with query refinement either fully automatically or with user in the loop.
Global methods: the techniques for expanding or reformulating query
1). Query expansion/ reformulation with a thesaurus or WordNet
2). Query expansion via automatic thesaurus generation
3). Techniques like spelling correction
Local methods: adjust a query to the documents that initially appear to match the query.1). Relevance feedback(most commonly used) -->Interactive relevance feedback can give very substantial gains in retrieval performance.
2). Pseudo relevance feedback, also known as Blind relevance feedback--> provide a method for automatic local analysis and to do normal retrieval to find an initial set of most relevant documents.
3). Global indirect relevance feedback
The core idea of RF is to involve the user in the retrieval process so as to improve the final result set.-->The Rocchio Algorithm
Probabilistic relevance feedback --> Naive Bayes probabilistic model
Relevance feedback on the Web --> general relevance feedback has been little used in the web search. And the successful use of the web link can be viewed as a implicit feedback.
Automatic thesaurus generation--> an alternative to the cost of a manual thesaurus, and analyzing a collection of documents --> to exploit word occurrence-->to use a shallow grammatical analysis of the text and to exploit grammatical relations or dependencies.
2015年1月30日星期五
Reading notes for week 4
Chapter 6
We use Boolean queries do the searches, 0 presents does not match, 1 presents match. When the number of result of the matching documents is too large, we need a "rank " method to rand those result document. This chapter introduced the method to score a document.
In the first part of parametric and zone indexes. Parametric indexes is designed to select only the documents matching a specific query. Zone are similar to fields, except the contents of a zone canbe arbitrary free text. The difference between parametric and zone indexes is that the dictionary for parametric index comes from a fixed vocabulary( a set of languages, or a set of dates), the dictionary for a zone index must structure whatever vocabulary stems from the text of that zone. The weighted zone store is defined to be a linear combination of zone scores, when each zone of the document contributes a Boolean value.
In term frequency and weighting, people assign each term in each document a wight for that term, which depends one the number of occurrences of the term in the document, and compute a score between a query term t and a document d, based on the weight of t in d. Another problem mentioned in this part is that certain terms have little or no discriminating power in determining relevance. we can scale down the term weights of terms with high collection frequency, defined to be the total number of occurrences of a term in a collection.
The vector space model for scoring means a document vector which captures the relative importance of the terms in a document.it can compute the similarity between two documents of the vector difference simply, to compute the magnitude of the vectors. In this chapter, the author introduced many algorithems in vector space model which are hard to summarize but I think I have to spend more time in reading this chapters and strength my background of mathematics.
Chapter 7
In chapter 6, we developed the theory of computing the weight of the terms in a document to score those term and introduced the vector space model and basic cosine scoring algorithm, and in chapter 7, author comes up with some methods for speed up this computation and outline a complete search engine and how the pisces fit together.
Inexact top k document retrieval-->aims to dramatically lower the cost of computation of k documents without materially altering the users' perceived relevance of top K results.
Index elimination:
1). we can consider documents containing terms whose idf exceeds a preset threshold.
2). we can consider documents that contain many query terms.
Champion lists: In man search engines, we have available a measure of quality g(d) for each documents d that is query- independent and thus static.
Components of an information retrieval system:
We use Boolean queries do the searches, 0 presents does not match, 1 presents match. When the number of result of the matching documents is too large, we need a "rank " method to rand those result document. This chapter introduced the method to score a document.
In the first part of parametric and zone indexes. Parametric indexes is designed to select only the documents matching a specific query. Zone are similar to fields, except the contents of a zone canbe arbitrary free text. The difference between parametric and zone indexes is that the dictionary for parametric index comes from a fixed vocabulary( a set of languages, or a set of dates), the dictionary for a zone index must structure whatever vocabulary stems from the text of that zone. The weighted zone store is defined to be a linear combination of zone scores, when each zone of the document contributes a Boolean value.
In term frequency and weighting, people assign each term in each document a wight for that term, which depends one the number of occurrences of the term in the document, and compute a score between a query term t and a document d, based on the weight of t in d. Another problem mentioned in this part is that certain terms have little or no discriminating power in determining relevance. we can scale down the term weights of terms with high collection frequency, defined to be the total number of occurrences of a term in a collection.
The vector space model for scoring means a document vector which captures the relative importance of the terms in a document.it can compute the similarity between two documents of the vector difference simply, to compute the magnitude of the vectors. In this chapter, the author introduced many algorithems in vector space model which are hard to summarize but I think I have to spend more time in reading this chapters and strength my background of mathematics.
Chapter 7
In chapter 6, we developed the theory of computing the weight of the terms in a document to score those term and introduced the vector space model and basic cosine scoring algorithm, and in chapter 7, author comes up with some methods for speed up this computation and outline a complete search engine and how the pisces fit together.
Inexact top k document retrieval-->aims to dramatically lower the cost of computation of k documents without materially altering the users' perceived relevance of top K results.
Index elimination:
1). we can consider documents containing terms whose idf exceeds a preset threshold.
2). we can consider documents that contain many query terms.
Champion lists: In man search engines, we have available a measure of quality g(d) for each documents d that is query- independent and thus static.
Components of an information retrieval system:
Muddiest points for week 3
Still be confused about the Gamma Code. I mean, it is said that 1 and 0 can be switched in unary code , 3 can be either coded as 0001 or 1110, but how could we know about that. I think, instructor should give us more examples in the class, for the Gamma Code and for other materials in the class.
2015年1月23日星期五
Reading notes for week 3
Chapter 4. Index Construction
Some basic knowledge for computer hardware:
1.
We always keep disk data that frequently used in
the main memory ( catching)
2.
Seek
time. we have to reduce the seek time.
3.
Store block in the buffer
Index methods for large collection:
1.
represent term by termID (instead of string in methods introduced in chapter 1).
Blocked sorted-based
indexing algorithm:
1.
segment the collection into parts of equal size.
2.
Sorts the termID and docID;
3.
Stores intermediate sorted results on disk and
merges them into final index
Single-pass in-memory indexing
(writes each block’s dictionary to diskàcan index any size as
long as there is enough space available)
each posting list is dynamic and the SPIMI algorithm is
faster then BSBI algorithm because it saves memory. We have to sort the terms
(line 11) before doing this because we want to write postings lists in
lexicographic order to facilitate the final merging step.
Distributed indexing
(for web) is an application of MapReduce . Map consists map splits
of the input data into key-value pairs and for the reduce phase, we store all
values for a given key closely
Dynamic indexing:
The dynamic index is so complex, so some large search
engines adopt reconstruction-from-scratch strategy which means they do not
construct indexes dynamically, they build
new index from scratch periodically.
Chapter 5. Index
compression
The benefits of compression are very clear:
achieve less
disk space à
increased
use of caching à
faster transfer of data from disk to memory.
This chapter introduced the lossless compression which
preserves all the information (compared to the lossy compression introduced in
chapter 6).
The Heap’s law estimated the number of terms.
(i) the dictionary size continues to increase with
more documents in the collection, rather than a maximum vocabulary size being
reached.
(ii) the size of the dictionary is quite large for
large collections
The Zipf’s law models the distribution of terms
The àit helps us understand
how terms are distributed across documents which can help us to characterize
the properties of algorithms for compressing postings lists.
In the next session, author introduced the dictionary
compression, the main goal of which is to fit it in main memory, or at least a
large portion of it, to support high query throughput.After I read this
section, I realize that even with the best compression scheme, it may not be
feasible to store the entire dictionary in the main memory for very large text
collection.
The inverted posting file compression contains variable byte
codes andγ
codes. The former encoding uses an integral number of bytes to encode a gap and
the latter one adapts the length of the code on the finer grained bit level.
And it uses entropy H(p) to determines its coding properties.
订阅:
博文 (Atom)