Qian's Information Retrieval Blog

2015年4月10日星期五

muddiest points for week 12

In k-means clustering, if we don't know how many clusters it have, how we define the value of k?

2015年4月3日星期五

reading notes for week12

Chapter 13 text classification and Naive Bayes

The notion of classification is very general and has many applications within and beyond information retrieval (IR). For instance, in computer vision, a classifier may be used to divide images into classes such as landscape, portrait, and neither.

The classification in IR always involves the following parts:

1). The automatic detection of spam pages (which then are not included in the search engine index).
2). The automatic detection of sexually explicit content (which is included in search results only if the user turns an option such as SafeSearch off)
3). Sentiment detection or the automatic classification of a movie or product review as positive or negative.
4). Personal email sorting. A user may have folders like talk announcements, electronic bills, email from family and friends, and so on, and may want a classifier to classify each incoming email and automatically move it to the appropriate folder.
5). Topic-specific or vertical search. Vertical search engines restrict searches to a particular topic.

Using a learning method or learning algorithm, we then wish to learn a clas- sifier or classification function γ that maps documents to classes:

This type of learning is called supervised learning because a supervisor (the human who defines the classes and labels training documents) serves as a teacher directing the learning process. We denote the supervised learning method by Γ and write Γ(D) = γ. The learning method Γ takes the training set D as input and returns the learned classification function γ.

The first supervised learning method we introduce is the multinomial Naive Bayes or multinomial NB model, a probabilistic learning method. The probability of a document d being in class c is computed as

An alternative to the multinomial model is the multivariate Bernoulli model or Bernoulli model. It is equivalent to the binary independence model of Section which generates an indicator for each term of the vocabulary, either 1 indicating presence of the term in the document or 0 indicating absence.

To reduce the number of parameters, we make the Naive Bayes conditional independence assumption. We assume that attribute values are independent of each other given the class:

Feature selection is the process of selecting a subset of the terms occurring in the training set and using only this subset as features in text classification.
Feature selection serves two main purposes.
First, it makes training and applying a classifier more efficient by decreasing the size of the effective vocabulary.

Second, feature selection often increases classification accuracy by eliminating noise features.

A common feature selection method is to compute A(t, c) as the expected mutual information (MI) of term t and class c.5 MI measures how much in- formation the presence/absence of a term contributes to making the correct classification decision on c. Formally:

As you might expect, keeping the in- formative terms and eliminating the non-informative ones tends to reduce noise and improve the classifier’s accuracy.

For the multinomial model (MI feature selection), the peak occurs later, at 100 features, and its effectiveness recovers somewhat at the end when we use all features. The reason is that the multinomial takes the number of occurrences into account in parameter estimation and classification and therefore better exploits a larger number of features than the Bernoulli model. Regardless of the differences between the two methods, using a carefully selected subset of the features results in better effectiveness than using all features.

Chapter 14 Vector space classification

The basic hypothesis in using the vector space model for classification is the contiguity hypothesis.
Contiguity hypothesis. Documents in the same class form a contiguous region and regions of different classes do not overlap.

2015年3月30日星期一

Muddiest Point for week 11(3.30)

When implementing adaptive IR, we should collect the user's profile and build personalized search query or design adaptive system for them , but how can we collect the information about the user? How can we collect user profile without violate the user's privacy?

2015年3月28日星期六

Reading notes for week 11

Intelligent Information Retrieval:

In the user profiles for personalized information access, we can see that we can collecting information about the uses, and represent and build the user information profiles. Explicit information techniques are contrasted with implicitly collected user information using browser caches, proxy servers, browser agents, desktop agents, and search logs, rely on personal information input by the users, typically via HTML forms. More sophisticated personalization projects based on explicit feedback have focused on navigation. However, Implicit user information collection only track browsing activity, proxy servers seem to be a good compromise between easily capturing information and yet not placing a large burden on the us.

In order to construct an individual user’s profile, information may be collected explicitly, through direct user intervention, or implicitly, through agents that monitor user activity.

Collecting Information About Users

Methods for User Identification: collecting their information and sharing this with a server via some protocol, logins, cookies and session ids

Methods for User Information Collection

Explicit User Information Collection/ Implicit User Information Collection

Comparing Implicit and Explicit User Information Collection.

Personalized Web Exploration with Task Models:

TaskSieve is a Web search system that utilizes a relevance feedback based profile, called a “task model”, for personalization. Its innovations include flexible and user controlled integration of queries and task models, task-infused text snippet generation, and on-screen visualization of task models:

1). Retrieve documents along with their relevance scores by submitting the user query to a search engine.

2). Calculate similarity scores between retrieved documents and the model.

3). Calculate combined score of each document by equation that alpha * Task_Model_Score + (1 – alpha) * Search_Score

4). Re-rank the initial list by the combined score from step 3.

The idea of re-ranking is to promote documents, which are more relevant to the user task as measured by their similarity to the task model.

Muddiest points for week 11

In the link analysis, we define simple iterative logics as good nodes won't point to the bad nodes, and vice versa. But how we could define an bad nodes? If we use the search engine, would the search engine only returns the good nodes, no matter whether the bad nodes is as relative as the good nodes.

2015年3月20日星期五

Muddiest Point for week 10

When shuguang introduced the interface of Google, I am wondering why Google use the "I am feeling lucky". It's confused me for a long time since I first used google.