Qian's Information Retrieval Blog : 二月 2015

2015年2月20日星期五

Reading notes for week6

I realize that I missed some reading notes of the paper required for the last weeks and I've read the chapter 9 in the last week, so I decided to bring some notes for the previous papers.

Cumulated Gain-Based Evaluation of IR Techniques

In this paper, the author presented us something about the measurements of the novel, which compute the cumulative gain the user obtains and examine the retrieval result up to a given ranked position.

And large amount of the output has been confused by the users for a long time when using the IR system.

There are three new evaluation measures. The firs one accumulates the relevance scores of retrieved documents along the ranked result list. The second one is similar but applies a discount factor on the relevance scores in order to devaluate late-retrieved documents. The third one computes the relative-to-the-ideal performance of IR techniques. An essential feature of the proposed measures is the weighting of documents at different levels of relevance.

For example. the discounted cumulated gain:A discounting function is needed that progressively reduces the document score as its rank increases but not too steeply to allow for user persistence in examining further documents. A simple way of discounting with this requirement is to divide the document score by the log of its rank.

Relevance Judgments:Possible differences between IR techniques in retrieving highly relevant documents might be evened up by their possible indifference in retrieving marginal documents. The net differences might seem practically marginal and statistically insignificant.

Moiddest Point for week 6

In pooling method we talked in the Monday, you mentioned use top n documents form each search result to build a pool for judgment, what I wanna ask is that how do you define the number n is the most proper.

2015年2月13日星期五

Week 5 Reading Notes

Chapter 10 & Chapter 11

Most of the time the Chapter 10 id talking about the XML and its use in the encoding the document and the text. An XML document is an ordered, labeled tree, each node of the tree is an XML element and is writteen with an opening and closing tag. XPath is a standard for enumerating path in an XML document collection.
Then the author mentioned some challenges that make structured retrieval(the collection consists of structured documents and queries are either structured).The first challenge in structured retrieval is that users want us to return parts of documents (i.e., XML elements), not entire documents as IR systems. So,it will use one of the useful principle:Structured document retrieval principle. When I read through the paper, I were impressed by one method , it is we can use one of the largest elements as the indexing unit.

The Picture is a example of speration of a Xml document using the vector space model. We can interpret all queries as extended queries – that is, there can be an arbitrary number of intervening nodes in the document for any parent- child node pair in the query.

We ensure that retrieval results respect this preference by computing a weight for each match. A simple measure of the similarity of a path cq in a query and a path cd in a document is the following context resemblance function CR

In the Chapter12, the author introduces more systematically introduce this probabilistic approach to IR, which provides a different formal basis for a retrieval model and results in different techniques for setting term weights.we can straightforwardly start to estimate the probability of a term t appearing in a relevant document P(t|R = 1), and that this could be the basis of a classifier that decides whether documents are relevant or not.

Binary Inde- pendence Model: is exactly the same as the multivariate Bernoulli Naive Bayes model.

Under the BIM, we model the probability P(R|d, q) that a document is relevant via the probability in terms of term incidence vectors P(R|⃗x,⃗q). Then, using Bayes rule as described above.
In probability estimates in practice,it is plausible to approximate statistics for nonrelevant documents by statistics from the whole collection.

Okapi BM25: a non-binary model:The BM25 weighting scheme, often called Okapi weighting, after the system in which it was first implemented, was developed as a way of building a probabilistic model sensitive to these quantities while not introducing too many additional pa- rameters into the model

muddiest point for week 5

I am quite confused about the language model. Is it based on the vector space model. If we have the same document and can we construct different language model?

2015年2月7日星期六

muddiest point for week 4

Actually, I am a little bit confused about the question in the ppt45. Which word is a better search term? is it insurance?

2015年2月5日星期四

Week 4 Reading Notes

Chapter 8: Evaluation in Information Retrieval

& Chapter 9: Relevance feedback and query expansion

In the previous chapter I read in the past month, the authors mentioned many alternatives in designing an IR system, but in this chapter, they discussed how to measure the effectiveness of an IR system and introduce how to develop further measures for evaluating ranked retrieval result. User utility is measured and the user's happiness is very important is a good IR system, for example the speed of response and the size of the index are part of the factors in users' happiness. But it still mentioned that the users' satisfaction is determined by many other factors, for example the result of the design of the user interface and other measurements.

And the standard approach to information retrieval system evaluation revolves around the notion of relevant and nonrelevant documents.But a document is relevant but only because it contain all the words in the query, but also because it may be related to the query, because the information need is not overt.

In addition, a list of the most standard test collections and evaluation series are introduced in this chapter, for example the The Cranfield collection and TREC and NII Test Collection. In evaluation of unranked retrieval sets, the two most frequent and basic measures for information retrieval effectiveness are precision and recall. Maybe the alternative seems to be accuracy, which is the fraction of its classification that are correct. But the accuracy is not an appropriate measure for information retrieval problem.

The advantage of having the two numbers for precision and recall is that one is more important than the other in many circumstances.

In a ranked retrieval context, appropriate sets of retrieved documents are naturally given by the top k retrieved documents. For each set, the precision and recall values can be plotted into a precision -recall curve.

Moreover,to evaluate the system, we have to germane the test information. The most standard approach is called pooling, where relevance is assessed over a subset of the collection that is formed from the top k documents returned by a number of different IR systems. But the relevance judgments are quite idiosyncratic and variable, and the success of an IR system depends on how good it is at satisfying the needs of these idiosyncratic human, one information need at a time.

Finally, the user utility is discussed, which means to make the user happy. For a wen search engine, happy search users are those who find what they want and desire to use this search engine again. But, it is very hard to investigate the satisfaction of the user.

Actually, we have already read contents about the synonymy in the chapter 1, which means the same concept may be referred to using different words.In the chapter9 . the author discussed the ways in which a system can help with query refinement either fully automatically or with user in the loop.
Global methods: the techniques for expanding or reformulating query

1). Query expansion/ reformulation with a thesaurus or WordNet

2). Query expansion via automatic thesaurus generation

3). Techniques like spelling correction

Local methods: adjust a query to the documents that initially appear to match the query.
1). Relevance feedback(most commonly used) -->Interactive relevance feedback can give very substantial gains in retrieval performance.
2). Pseudo relevance feedback, also known as Blind relevance feedback--> provide a method for automatic local analysis and to do normal retrieval to find an initial set of most relevant documents.
3). Global indirect relevance feedback
The core idea of RF is to involve the user in the retrieval process so as to improve the final result set.-->The Rocchio Algorithm
Probabilistic relevance feedback --> Naive Bayes probabilistic model
Relevance feedback on the Web --> general relevance feedback has been little used in the web search. And the successful use of the web link can be viewed as a implicit feedback.

Automatic thesaurus generation--> an alternative to the cost of a manual thesaurus, and analyzing a collection of documents --> to exploit word occurrence-->to use a shallow grammatical analysis of the text and to exploit grammatical relations or dependencies.