2015年2月13日星期五

Week 5 Reading Notes

Chapter 10 & Chapter 11

Most of the time the Chapter 10 id talking about the XML and its use in the encoding the document and the text. An XML document is an ordered, labeled tree, each node of the tree is an XML element and is writteen with an opening and closing tag. XPath is a standard for enumerating path in an XML document collection.
Then the author mentioned some challenges that make structured retrieval(the collection consists of structured documents and queries are either structured).The first challenge in structured retrieval is that users want us to return parts of documents (i.e., XML elements), not entire documents as IR systems. So,it will use one of the useful principle:Structured document retrieval principle. When I read through the paper, I were impressed by one method , it is we can use one of the largest elements as the indexing unit.
The Picture is a example of speration of a Xml document using the vector space model. We can interpret all queries as extended queries – that is, there can be an arbitrary number of intervening nodes in the document for any parent- child node pair in the query.


We ensure that retrieval results respect this preference by computing a weight for each match. A simple measure of the similarity of a path cq in a query and a path cd in a document is the following context resemblance function CR

In the Chapter12, the author introduces more systematically introduce this probabilistic approach to IR, which provides a different formal basis for a retrieval model and results in different techniques for setting term weights.we can straightforwardly start to estimate the probability of a term t appearing in a relevant document P(t|R = 1), and that this could be the basis of a classifier that decides whether documents are relevant or not.  

Binary Inde- pendence Model: is exactly the same as the multivariate Bernoulli Naive Bayes model.


Under the BIM, we model the probability P(R|d, q) that a document is relevant via the probability in terms of term incidence vectors P(R|⃗x,⃗q). Then, using Bayes rule as described above.
In probability estimates in practice,it is plausible to approximate statistics for nonrelevant documents by statistics from the whole collection.

Okapi BM25: a non-binary model:The BM25 weighting scheme, often called Okapi weighting, after the system in which it was first implemented, was developed as a way of building a probabilistic model sensitive to these quantities while not introducing too many additional pa- rameters into the model

没有评论:

发表评论