Qian's Information Retrieval Blog
2015年4月10日星期五
muddiest points for week 12
In k-means clustering, if we don't know how many clusters it have, how we define the value of k?
2015年4月3日星期五
reading notes for week12
Chapter 13 text classification and Naive Bayes
The classification in IR always involves the following parts:
1). The automatic detection of spam pages (which then are not included in
the search engine index).
2). The automatic detection of sexually explicit content (which is included in search results only if the user turns an option such as SafeSearch off)
3). Sentiment detection or the automatic classification of a movie or product review as positive or negative.
4). Personal email sorting. A user may have folders like talk announcements, electronic bills, email from family and friends, and so on, and may want a classifier to classify each incoming email and automatically move it to the appropriate folder.
5). Topic-specific or vertical search. Vertical search engines restrict searches to a particular topic.
2). The automatic detection of sexually explicit content (which is included in search results only if the user turns an option such as SafeSearch off)
3). Sentiment detection or the automatic classification of a movie or product review as positive or negative.
4). Personal email sorting. A user may have folders like talk announcements, electronic bills, email from family and friends, and so on, and may want a classifier to classify each incoming email and automatically move it to the appropriate folder.
5). Topic-specific or vertical search. Vertical search engines restrict searches to a particular topic.
This type of learning is called supervised learning because a supervisor (the
human who defines the classes and labels training documents) serves as a
teacher directing the learning process. We denote the supervised learning
method by Γ and write Γ(D) = γ. The learning method Γ takes the training
set D as input and returns the learned classification function γ.
The first supervised learning method we introduce is the multinomial Naive Bayes or multinomial NB model, a probabilistic learning method. The probability of a document d being in class c is computed as
The first supervised learning method we introduce is the multinomial Naive Bayes or multinomial NB model, a probabilistic learning method. The probability of a document d being in class c is computed as
An alternative to the multinomial model is the multivariate Bernoulli model
or Bernoulli model. It is equivalent to the binary independence model of Section which generates an indicator for each term of the vocabulary, either 1 indicating presence of the term in the document or 0 indicating absence.
To reduce the number of parameters, we make the Naive Bayes conditional
independence assumption. We assume that attribute values are independent of
each other given the class:
Feature selection is the process of selecting a subset of the terms occurring
in the training set and using only this subset as features in text classification.
Feature selection serves two main purposes.
First, it makes training and applying a classifier more efficient by decreasing the size of the effective vocabulary.
Feature selection serves two main purposes.
First, it makes training and applying a classifier more efficient by decreasing the size of the effective vocabulary.
Second, feature selection often increases classification accuracy by eliminating noise features.
As you might expect, keeping the in-
formative terms and eliminating the non-informative ones tends to reduce
noise and improve the classifier’s accuracy.
For the multinomial model (MI feature selection), the peak occurs later, at 100 features, and its effectiveness recovers somewhat at the end when we use all features. The reason is that the multinomial takes the number of occurrences into account in parameter estimation and classification and therefore better exploits a larger number of features than the Bernoulli model. Regardless of the differences between the two methods, using a carefully selected subset of the features results in better effectiveness than using all features.
Contiguity hypothesis. Documents in the same class form a contiguous region and regions of different classes do not overlap.
For the multinomial model (MI feature selection), the peak occurs later, at 100 features, and its effectiveness recovers somewhat at the end when we use all features. The reason is that the multinomial takes the number of occurrences into account in parameter estimation and classification and therefore better exploits a larger number of features than the Bernoulli model. Regardless of the differences between the two methods, using a carefully selected subset of the features results in better effectiveness than using all features.
Chapter 14 Vector space classification
The basic hypothesis in using the vector space model for classification is the contiguity hypothesis.Contiguity hypothesis. Documents in the same class form a contiguous region and regions of different classes do not overlap.
2015年3月30日星期一
Muddiest Point for week 11(3.30)
When implementing adaptive IR, we should collect the user's profile and build personalized search query or design adaptive system for them , but how can we collect the information about the user? How can we collect user profile without violate the user's privacy?
2015年3月28日星期六
Reading notes for week 11
Intelligent Information Retrieval:
In the user profiles for personalized information access, we can see that we can collecting information about the uses, and represent and build the user information profiles. Explicit information techniques are contrasted with implicitly collected user information using browser caches, proxy servers, browser agents, desktop agents, and search logs, rely on personal information input by the users, typically via HTML forms. More sophisticated personalization projects based on explicit feedback have focused on navigation. However, Implicit user information collection only track browsing activity, proxy servers seem to be a good compromise between easily capturing information and yet not placing a large burden on the us.
In order to construct an individual user’s profile, information may be collected explicitly, through direct user intervention, or implicitly, through agents that monitor user activity.
Collecting Information About Users
Methods for User Identification: collecting their information and sharing this with a server via some protocol, logins, cookies and session ids
Methods for User Information Collection
Explicit User Information Collection/ Implicit User Information Collection
Comparing Implicit and Explicit User Information Collection.
Personalized Web Exploration with Task Models:
TaskSieve is a Web search system that utilizes a relevance feedback based profile, called a “task model”, for personalization. Its innovations include flexible and user controlled integration of queries and task models, task-infused text snippet generation, and on-screen visualization of task models:
1). Retrieve documents along with their relevance scores by submitting the user query to a search engine.
2). Calculate similarity scores between retrieved documents and the model.
3). Calculate combined score of each document by equation that alpha * Task_Model_Score + (1 – alpha) * Search_Score
4). Re-rank the initial list by the combined score from step 3.
The idea of re-ranking is to promote documents, which are more relevant to the user task as measured by their similarity to the task model.
Muddiest points for week 11
In the link analysis, we define simple iterative logics as good nodes won't point to the bad nodes, and vice versa. But how we could define an bad nodes? If we use the search engine, would the search engine only returns the good nodes, no matter whether the bad nodes is as relative as the good nodes.
2015年3月20日星期五
Muddiest Point for week 10
When shuguang introduced the interface of Google, I am wondering why Google use the "I am feeling lucky". It's confused me for a long time since I first used google.
订阅:
博文 (Atom)