2015年1月9日星期五

Reading notes for week 1 & 2

After reading the reading assignments for week 1 and week 2, I got a brief understanding of information retrieval.Originally, IR was developed in libraries,but with the wide spread of World Wide Web, it is widely used in Web.  As defined in section 1.1 of Introduction to Information Retrieval: IR is trying to find some information from an unstructured nature which satisfies an query needed from large collections. And I think the the biggest difference between IR and database is that the data IR processes is unstructured data. 

In section 1.2, I learned how to build an inverted index.We have to build an index on the linguistic contents in documents and give docID to each document, and they can be sorted by docID in linked lists.This is, to some extent, an efficient structure and can reduce the storage space. Two good ways to store and access documents efficiently , the linked lists and variable arrays are briefly discussed in this section.

Linguistic processing and tokenization were briefly introduced in the section 1. And chapter 2 gave details in how to tokenize and do linguistic preprocessing of tokens. As introduced in this chapter, when determining the vocabulary of terms, tokenization is the process of chopping original character streams into tokens, and the main topic of it is to determine use the correct token, which is language-specific. Moreover, the method of word segmentation includes the hidden Markov models or conditional random field. 

The first time I saw the assignment 1and was confused about the stopping words. The chapter 2 helped me a lot, especially in why we should drop the stop words and how we determine them. Using stop list to filter the document enables machine to reduce the storage space. This chapter also introduced stemming and lemmatization, which aim to reduce the inflectional forms and sometimes derivationally related forms of a word to a common base form. Porter’s Algorithm is also introduced as the most common algorithm in stemming English.But the stemmed form does not matter, the most important aspect is the equivalence class it produces. And we can see that it differs it in different language. In the next part, the author introduced the the combination schemes which is the fruitfully combination of the byword indexes and positional indexes.

After we have basic knowledge of query and inverted index, chapter 3 teaches people to develop data structures that help the search for terms in the vocabulary in an inverted index. 
Search structures for dictionaries include hashing and search trees. The B- tree is very interesting, which is one of binary tree an allow the number of sub-trees under an certain interval [a,b], which can mitigate rebalancing.

Wildcard queries are used in those three situations:
1,uncertain of the spelling of a query term;
2, aware of multiple variants of spelling a term;
3,seek for documents containing variants of a term and unsure about stemming;
4,uncertain of the correct rendition of a foreign word or phrase

The main techniques for handling general wildcard queries are permuterm indexes and k-gram indexes.
I find there are a  close relationship between B-tree and the permuterm index.But the problem of permuterm index is distinctive, it may cause the results of the query to be abundant.

The principle for spelling correction
1, Find the nearest selection
2, Select the one which is more common when meet two correctly spelled queries.
The author focus on two spelling corrections: the isolated-term correction and the context-sensitive correction.





没有评论:

发表评论