Qian's Information Retrieval Blog : 2015

2015年4月10日星期五

muddiest points for week 12

In k-means clustering, if we don't know how many clusters it have, how we define the value of k?

2015年4月3日星期五

reading notes for week12

Chapter 13 text classification and Naive Bayes

The notion of classification is very general and has many applications within and beyond information retrieval (IR). For instance, in computer vision, a classifier may be used to divide images into classes such as landscape, portrait, and neither.

The classification in IR always involves the following parts:

1). The automatic detection of spam pages (which then are not included in the search engine index).
2). The automatic detection of sexually explicit content (which is included in search results only if the user turns an option such as SafeSearch off)
3). Sentiment detection or the automatic classification of a movie or product review as positive or negative.
4). Personal email sorting. A user may have folders like talk announcements, electronic bills, email from family and friends, and so on, and may want a classifier to classify each incoming email and automatically move it to the appropriate folder.
5). Topic-specific or vertical search. Vertical search engines restrict searches to a particular topic.

Using a learning method or learning algorithm, we then wish to learn a clas- sifier or classification function γ that maps documents to classes:

This type of learning is called supervised learning because a supervisor (the human who defines the classes and labels training documents) serves as a teacher directing the learning process. We denote the supervised learning method by Γ and write Γ(D) = γ. The learning method Γ takes the training set D as input and returns the learned classification function γ.

The first supervised learning method we introduce is the multinomial Naive Bayes or multinomial NB model, a probabilistic learning method. The probability of a document d being in class c is computed as

An alternative to the multinomial model is the multivariate Bernoulli model or Bernoulli model. It is equivalent to the binary independence model of Section which generates an indicator for each term of the vocabulary, either 1 indicating presence of the term in the document or 0 indicating absence.

To reduce the number of parameters, we make the Naive Bayes conditional independence assumption. We assume that attribute values are independent of each other given the class:

Feature selection is the process of selecting a subset of the terms occurring in the training set and using only this subset as features in text classification.
Feature selection serves two main purposes.
First, it makes training and applying a classifier more efficient by decreasing the size of the effective vocabulary.

Second, feature selection often increases classification accuracy by eliminating noise features.

A common feature selection method is to compute A(t, c) as the expected mutual information (MI) of term t and class c.5 MI measures how much in- formation the presence/absence of a term contributes to making the correct classification decision on c. Formally:

As you might expect, keeping the in- formative terms and eliminating the non-informative ones tends to reduce noise and improve the classifier’s accuracy.

For the multinomial model (MI feature selection), the peak occurs later, at 100 features, and its effectiveness recovers somewhat at the end when we use all features. The reason is that the multinomial takes the number of occurrences into account in parameter estimation and classification and therefore better exploits a larger number of features than the Bernoulli model. Regardless of the differences between the two methods, using a carefully selected subset of the features results in better effectiveness than using all features.

Chapter 14 Vector space classification

The basic hypothesis in using the vector space model for classification is the contiguity hypothesis.
Contiguity hypothesis. Documents in the same class form a contiguous region and regions of different classes do not overlap.

2015年3月30日星期一

Muddiest Point for week 11(3.30)

When implementing adaptive IR, we should collect the user's profile and build personalized search query or design adaptive system for them , but how can we collect the information about the user? How can we collect user profile without violate the user's privacy?

2015年3月28日星期六

Reading notes for week 11

Intelligent Information Retrieval:

In the user profiles for personalized information access, we can see that we can collecting information about the uses, and represent and build the user information profiles. Explicit information techniques are contrasted with implicitly collected user information using browser caches, proxy servers, browser agents, desktop agents, and search logs, rely on personal information input by the users, typically via HTML forms. More sophisticated personalization projects based on explicit feedback have focused on navigation. However, Implicit user information collection only track browsing activity, proxy servers seem to be a good compromise between easily capturing information and yet not placing a large burden on the us.

In order to construct an individual user’s profile, information may be collected explicitly, through direct user intervention, or implicitly, through agents that monitor user activity.

Collecting Information About Users

Methods for User Identification: collecting their information and sharing this with a server via some protocol, logins, cookies and session ids

Methods for User Information Collection

Explicit User Information Collection/ Implicit User Information Collection

Comparing Implicit and Explicit User Information Collection.

Personalized Web Exploration with Task Models:

TaskSieve is a Web search system that utilizes a relevance feedback based profile, called a “task model”, for personalization. Its innovations include flexible and user controlled integration of queries and task models, task-infused text snippet generation, and on-screen visualization of task models:

1). Retrieve documents along with their relevance scores by submitting the user query to a search engine.

2). Calculate similarity scores between retrieved documents and the model.

3). Calculate combined score of each document by equation that alpha * Task_Model_Score + (1 – alpha) * Search_Score

4). Re-rank the initial list by the combined score from step 3.

The idea of re-ranking is to promote documents, which are more relevant to the user task as measured by their similarity to the task model.

Muddiest points for week 11

In the link analysis, we define simple iterative logics as good nodes won't point to the bad nodes, and vice versa. But how we could define an bad nodes? If we use the search engine, would the search engine only returns the good nodes, no matter whether the bad nodes is as relative as the good nodes.

2015年3月20日星期五

Muddiest Point for week 10

When shuguang introduced the interface of Google, I am wondering why Google use the "I am feeling lucky". It's confused me for a long time since I first used google.

Reading Notes for Week 10

19.Web search basics

The essential feature that led to the explosive growth of the web – decentralized content publishing with essentially no central control of authorship – turned out to be the biggest challenge for web search engines in their quest to index and retrieve this content. And the search engine developed because of the spam detection in the early time.

Advertising as the economic model

CPC model:clicking on the adver- tisement leads the user to a web page set up by the advertiser, where the user is induced to make a purchase.

Goto:for every query term q it ac- cepted bids from companies who wanted their web page shown on the query q. In response to the query q, Goto would return the pages of all advertisers who bid for q, ordered by their bids. Furthermore, when the user clicked on one of the returned results, the corresponding advertiser would make a payment to Goto

The search user experience

It is crucial that we understand the users of web search as well,and the more user traffic a web search engine can attract, the more revenue it stands to earn from sponsored search.
There are three web search queries:(1) informational (2) navigational (3) transactional.

Index size and estimation
It is difficult to reason about the fraction of the Web indexed by a search engine, because there is an infinite number of dynamic web pages and the number of the web page is very large.

The relative sizes of the indexes is imprecise:
1).In response to queries a search engine can return web pages whose con- tents it has not (fully or even partially) indexed.

2).Search engines generally organize their indexes in various tiers and parti- tions, not all of which are examined on every search.

So that there is no single measure of index size.

Construct the random page samples:

1. Random searches
2. Random IP addresses

3. Random walks

4.random queries

Near-duplicates and shingling
The Web contains multiple copies of the same content. By some estimates, as many as 40% of the pages on the Web are duplicates of other pages.

The simplest approach to detecting duplicates is to compute, for each web page, a fingerprint that is a succinct (say 64-bit) digest of the characters on that page. Then, whenever the fingerprints of two web pages are equal, we test whether the pages themselves are equal and if so declare one of them to be a duplicate copy of the other.

shingling

Illustration of shingle sketches. We see two documents going through four stages of shingle sketch computation. In the first step (top row), we apply a 64-bit hash to each shingle from each document to obtain H(d1) and H(d2) (circles). Next, we apply a random permutation Π to permute H(d1) and H(d2), obtaining Π(d1) and Π(d2) (squares).

21.Link analysis

In this chapter author focus on the use of hyperlinks for ranking web search results. Such link analysis is one of many factors considered by web search engines in computing a compos- ite score for a web page on any given query.

PageRank
Definition: first technique for link analysis assigns to every node in the web graph a numerical score between 0 and 1

In assigning a PageRank score to each node of the web graph, we use the teleport operation in two ways:
(1) When at a node with no out-links, the surfer invokes the teleport operation.
(2) At any node that has outgoing links, the surfer invokes the teleport operation with probability 0 < α < 1 and the standard random walk.

Markov chains

A Markov chain is characterized by an N × N transition probability matrix P each of whose entries is in the interval [0, 1]; the entries in each row of P add up to 1.

The PageRank computation:
⃗π P = λ ⃗π .
The N entries in the principal eigenvector π⃗ are the steady-state proba- bilities of the random walk with teleporting, and thus the PageRank values for the corresponding web pages.

Topic-specific PageRank.

In this example we consider a user whose interests are 60% sports and 40% politics. If the teleportation probability is 10%, this user is modeled as teleporting 6% to sports pages and 4% to politics pages.

2015年3月13日星期五

Muddiest point for week 7

I don't have any muddiest points for this week.

Week 9 reading notes

Chapter 10 for MIR,

In this chapter, author introduced the user interface which should help users to express their information needs and formulate queries. This chapter talks about recent development in this area.
There are three design rules for users:

Offer informative feedback
Reduce working memory load
Provide alternative interfaces for novice and expert users
Information access interfaces must contend with specialkinds of simplicity/power trade

How to judge a interactive system

An important aspect of human-computer interaction is the methodology for evaluation of user interface techniques. Precision and recall measures have been widely used for comparing the ranking results of non-interactive systems, but are less appropriate for assessing interactive systems. The standard evaluations emphasize high recall levels; in the TREC tasks systems are compared to see how well they return the top 1000 documents.

The procedure to process interaction system:

Start with an information need
Select a system and a collection to search on
Formulate a query
Send the query to the system
Receive the results in the form of information items
Scan, evaluate, and interpret the results
Either stop, or,
Reformulate the query and go to step 4

About the starting points:

One study found that for non-expert users the results of clustering were difficult to use, and that graphical depictions (for example, representing clusters with circles and lines connecting documents) were much harder to use than textual representations (for example, showing titles and topical words, as in Scatter/Gather), because documents' contents are difficult to discern without actually reading some text .

About the query specification:
Boolean queries are problematic because the basic syntax is counter-intuitive.
Solutions to solve this problem are :1. allow users to choose from a selection of common simple ways of combining query terms 2. provide a simpler or more intuitive syntax.

Visualization For Text Mining:

One of the most common strategies used in text mining is to identify important entities within the text and attempt to show connections among those entities.

Visualizing Document Concordances And Word Frequencies

In the field of literature analysis it is commonplace to analyze a text or a collection of texts by extracting a concordance : an alphabetical index of all the words in a text, showing those words in the contexts in which they appear. The standard way to view the concordance is to place the word of interest in the center of the view with “gutters” on either side, and then sort the surrounding text in some way.

Visualizing Literature And Citation Relationships

This information is used to assess the importance of the authors, the papers, and the topics in the field, and how these measures change over time.

Some more recent approaches have moved away from nodes-and-links in order to break the interesting relationships into pieces and show their connections via brushing-and-linking interactions.

2015年2月20日星期五

Reading notes for week6

I realize that I missed some reading notes of the paper required for the last weeks and I've read the chapter 9 in the last week, so I decided to bring some notes for the previous papers.

Cumulated Gain-Based Evaluation of IR Techniques

In this paper, the author presented us something about the measurements of the novel, which compute the cumulative gain the user obtains and examine the retrieval result up to a given ranked position.

And large amount of the output has been confused by the users for a long time when using the IR system.

There are three new evaluation measures. The firs one accumulates the relevance scores of retrieved documents along the ranked result list. The second one is similar but applies a discount factor on the relevance scores in order to devaluate late-retrieved documents. The third one computes the relative-to-the-ideal performance of IR techniques. An essential feature of the proposed measures is the weighting of documents at different levels of relevance.

For example. the discounted cumulated gain:A discounting function is needed that progressively reduces the document score as its rank increases but not too steeply to allow for user persistence in examining further documents. A simple way of discounting with this requirement is to divide the document score by the log of its rank.

Relevance Judgments:Possible differences between IR techniques in retrieving highly relevant documents might be evened up by their possible indifference in retrieving marginal documents. The net differences might seem practically marginal and statistically insignificant.

Moiddest Point for week 6

In pooling method we talked in the Monday, you mentioned use top n documents form each search result to build a pool for judgment, what I wanna ask is that how do you define the number n is the most proper.

2015年2月13日星期五

Week 5 Reading Notes

Chapter 10 & Chapter 11

Most of the time the Chapter 10 id talking about the XML and its use in the encoding the document and the text. An XML document is an ordered, labeled tree, each node of the tree is an XML element and is writteen with an opening and closing tag. XPath is a standard for enumerating path in an XML document collection.
Then the author mentioned some challenges that make structured retrieval(the collection consists of structured documents and queries are either structured).The first challenge in structured retrieval is that users want us to return parts of documents (i.e., XML elements), not entire documents as IR systems. So,it will use one of the useful principle:Structured document retrieval principle. When I read through the paper, I were impressed by one method , it is we can use one of the largest elements as the indexing unit.

The Picture is a example of speration of a Xml document using the vector space model. We can interpret all queries as extended queries – that is, there can be an arbitrary number of intervening nodes in the document for any parent- child node pair in the query.

We ensure that retrieval results respect this preference by computing a weight for each match. A simple measure of the similarity of a path cq in a query and a path cd in a document is the following context resemblance function CR

In the Chapter12, the author introduces more systematically introduce this probabilistic approach to IR, which provides a different formal basis for a retrieval model and results in different techniques for setting term weights.we can straightforwardly start to estimate the probability of a term t appearing in a relevant document P(t|R = 1), and that this could be the basis of a classifier that decides whether documents are relevant or not.

Binary Inde- pendence Model: is exactly the same as the multivariate Bernoulli Naive Bayes model.

Under the BIM, we model the probability P(R|d, q) that a document is relevant via the probability in terms of term incidence vectors P(R|⃗x,⃗q). Then, using Bayes rule as described above.
In probability estimates in practice,it is plausible to approximate statistics for nonrelevant documents by statistics from the whole collection.

Okapi BM25: a non-binary model:The BM25 weighting scheme, often called Okapi weighting, after the system in which it was first implemented, was developed as a way of building a probabilistic model sensitive to these quantities while not introducing too many additional pa- rameters into the model

muddiest point for week 5

I am quite confused about the language model. Is it based on the vector space model. If we have the same document and can we construct different language model?

2015年2月7日星期六

muddiest point for week 4

Actually, I am a little bit confused about the question in the ppt45. Which word is a better search term? is it insurance?

2015年2月5日星期四

Week 4 Reading Notes

Chapter 8: Evaluation in Information Retrieval

& Chapter 9: Relevance feedback and query expansion

In the previous chapter I read in the past month, the authors mentioned many alternatives in designing an IR system, but in this chapter, they discussed how to measure the effectiveness of an IR system and introduce how to develop further measures for evaluating ranked retrieval result. User utility is measured and the user's happiness is very important is a good IR system, for example the speed of response and the size of the index are part of the factors in users' happiness. But it still mentioned that the users' satisfaction is determined by many other factors, for example the result of the design of the user interface and other measurements.

And the standard approach to information retrieval system evaluation revolves around the notion of relevant and nonrelevant documents.But a document is relevant but only because it contain all the words in the query, but also because it may be related to the query, because the information need is not overt.

In addition, a list of the most standard test collections and evaluation series are introduced in this chapter, for example the The Cranfield collection and TREC and NII Test Collection. In evaluation of unranked retrieval sets, the two most frequent and basic measures for information retrieval effectiveness are precision and recall. Maybe the alternative seems to be accuracy, which is the fraction of its classification that are correct. But the accuracy is not an appropriate measure for information retrieval problem.

The advantage of having the two numbers for precision and recall is that one is more important than the other in many circumstances.

In a ranked retrieval context, appropriate sets of retrieved documents are naturally given by the top k retrieved documents. For each set, the precision and recall values can be plotted into a precision -recall curve.

Moreover,to evaluate the system, we have to germane the test information. The most standard approach is called pooling, where relevance is assessed over a subset of the collection that is formed from the top k documents returned by a number of different IR systems. But the relevance judgments are quite idiosyncratic and variable, and the success of an IR system depends on how good it is at satisfying the needs of these idiosyncratic human, one information need at a time.

Finally, the user utility is discussed, which means to make the user happy. For a wen search engine, happy search users are those who find what they want and desire to use this search engine again. But, it is very hard to investigate the satisfaction of the user.

Actually, we have already read contents about the synonymy in the chapter 1, which means the same concept may be referred to using different words.In the chapter9 . the author discussed the ways in which a system can help with query refinement either fully automatically or with user in the loop.
Global methods: the techniques for expanding or reformulating query

1). Query expansion/ reformulation with a thesaurus or WordNet

2). Query expansion via automatic thesaurus generation

3). Techniques like spelling correction

Local methods: adjust a query to the documents that initially appear to match the query.
1). Relevance feedback(most commonly used) -->Interactive relevance feedback can give very substantial gains in retrieval performance.
2). Pseudo relevance feedback, also known as Blind relevance feedback--> provide a method for automatic local analysis and to do normal retrieval to find an initial set of most relevant documents.
3). Global indirect relevance feedback
The core idea of RF is to involve the user in the retrieval process so as to improve the final result set.-->The Rocchio Algorithm
Probabilistic relevance feedback --> Naive Bayes probabilistic model
Relevance feedback on the Web --> general relevance feedback has been little used in the web search. And the successful use of the web link can be viewed as a implicit feedback.

Automatic thesaurus generation--> an alternative to the cost of a manual thesaurus, and analyzing a collection of documents --> to exploit word occurrence-->to use a shallow grammatical analysis of the text and to exploit grammatical relations or dependencies.

2015年1月30日星期五

Reading notes for week 4

Chapter 6

We use Boolean queries do the searches, 0 presents does not match, 1 presents match. When the number of result of the matching documents is too large, we need a "rank " method to rand those result document. This chapter introduced the method to score a document.

In the first part of parametric and zone indexes. Parametric indexes is designed to select only the documents matching a specific query. Zone are similar to fields, except the contents of a zone canbe arbitrary free text. The difference between parametric and zone indexes is that the dictionary for parametric index comes from a fixed vocabulary( a set of languages, or a set of dates), the dictionary for a zone index must structure whatever vocabulary stems from the text of that zone. The weighted zone store is defined to be a linear combination of zone scores, when each zone of the document contributes a Boolean value.

In term frequency and weighting, people assign each term in each document a wight for that term, which depends one the number of occurrences of the term in the document, and compute a score between a query term t and a document d, based on the weight of t in d. Another problem mentioned in this part is that certain terms have little or no discriminating power in determining relevance. we can scale down the term weights of terms with high collection frequency, defined to be the total number of occurrences of a term in a collection.

The vector space model for scoring means a document vector which captures the relative importance of the terms in a document.it can compute the similarity between two documents of the vector difference simply, to compute the magnitude of the vectors. In this chapter, the author introduced many algorithems in vector space model which are hard to summarize but I think I have to spend more time in reading this chapters and strength my background of mathematics.

Chapter 7
In chapter 6, we developed the theory of computing the weight of the terms in a document to score those term and introduced the vector space model and basic cosine scoring algorithm, and in chapter 7, author comes up with some methods for speed up this computation and outline a complete search engine and how the pisces fit together.
Inexact top k document retrieval-->aims to dramatically lower the cost of computation of k documents without materially altering the users' perceived relevance of top K results.
Index elimination:
1). we can consider documents containing terms whose idf exceeds a preset threshold.
2). we can consider documents that contain many query terms.
Champion lists: In man search engines, we have available a measure of quality g(d) for each documents d that is query- independent and thus static.

Components of an information retrieval system:

Muddiest points for week 3

Still be confused about the Gamma Code. I mean, it is said that 1 and 0 can be switched in unary code , 3 can be either coded as 0001 or 1110, but how could we know about that. I think, instructor should give us more examples in the class, for the Gamma Code and for other materials in the class.

2015年1月23日星期五

Reading notes for week 3

Chapter 4. Index Construction

Some basic knowledge for computer hardware:

1. We always keep disk data that frequently used in the main memory ( catching)

2. Seek time. we have to reduce the seek time.

3. Store block in the buffer

Index methods for large collection:

1. represent term by termID (instead of string in methods introduced in chapter 1).

Blocked sorted-based indexing algorithm:

1. segment the collection into parts of equal size.

2. Sorts the termID and docID;

3. Stores intermediate sorted results on disk and merges them into final index

Single-pass in-memory indexing

(writes each block’s dictionary to diskàcan index any size as long as there is enough space available)

each posting list is dynamic and the SPIMI algorithm is faster then BSBI algorithm because it saves memory. We have to sort the terms (line 11) before doing this because we want to write postings lists in lexicographic order to facilitate the final merging step.

Distributed indexing (for web) is an application of MapReduce . Map consists map splits of the input data into key-value pairs and for the reduce phase, we store all values for a given key closely

Dynamic indexing:

The dynamic index is so complex, so some large search engines adopt reconstruction-from-scratch strategy which means they do not construct indexes dynamically, they build new index from scratch periodically.

Chapter 5. Index compression

The benefits of compression are very clear:

achieve less disk space à

increased use of caching à

faster transfer of data from disk to memory.

This chapter introduced the lossless compression which preserves all the information (compared to the lossy compression introduced in chapter 6).

The Heap’s law estimated the number of terms.

(i) the dictionary size continues to increase with more documents in the collection, rather than a maximum vocabulary size being reached.

(ii) the size of the dictionary is quite large for large collections

The Zipf’s law models the distribution of terms

The àit helps us understand how terms are distributed across documents which can help us to characterize the properties of algorithms for compressing postings lists.

In the next session, author introduced the dictionary compression, the main goal of which is to fit it in main memory, or at least a large portion of it, to support high query throughput.After I read this section, I realize that even with the best compression scheme, it may not be feasible to store the entire dictionary in the main memory for very large text collection.

The inverted posting file compression contains variable byte codes andγ codes. The former encoding uses an integral number of bytes to encode a gap and the latter one adapts the length of the code on the finer grained bit level. And it uses entropy H(p) to determines its coding properties.

订阅：博文 (Atom)