2015年4月3日星期五

reading notes for week12

Chapter 13 text classification and Naive Bayes

The notion of classification is very general and has many applications within and beyond information retrieval (IR). For instance, in computer vision, a classifier may be used to divide images into classes such as landscape, portrait, and neither.  

The classification in IR always involves the following parts:


1). The automatic detection of spam pages (which then are not included in the search engine index).
2). The automatic detection of sexually explicit content (which is included in search results only if the user turns an option such as SafeSearch off)
3). Sentiment detection or the automatic classification of a movie or product review as positive or negative.
4). Personal email sorting. A user may have folders like talk announcements, electronic bills, email from family and friends, and so on, and may want a classifier to classify each incoming email and automatically move it to the appropriate folder.
5). Topic-specific or vertical search. Vertical search engines restrict searches to a particular topic.


Using a learning method or learning algorithm, we then wish to learn a clas- sifier or classification function γ that maps documents to classes:

This type of learning is called supervised learning because a supervisor (the human who defines the classes and labels training documents) serves as a teacher directing the learning process. We denote the supervised learning method by Γ and write Γ(D) = γ. The learning method Γ takes the training set D as input and returns the learned classification function γ.

The first supervised learning method we introduce is the multinomial Naive Bayes or multinomial NB model, a probabilistic learning method. The probability of a document d being in class c is computed as 

An alternative to the multinomial model is the multivariate Bernoulli model or Bernoulli model. It is equivalent to the binary independence model of Section which generates an indicator for each term of the vocabulary, either 1 indicating presence of the term in the document or 0 indicating absence.


To reduce the number of parameters, we make the Naive Bayes conditional independence assumption. We assume that attribute values are independent of each other given the class: 
  





Feature selection is the process of selecting a subset of the terms occurring in the training set and using only this subset as features in text classification.
 Feature selection serves two main purposes. 
First, it makes training and applying a classifier more efficient by decreasing the size of the effective vocabulary.  


Second, feature selection often increases classification accuracy by eliminating noise features.  


A common feature selection method is to compute A(t, c) as the expected mutual information (MI) of term t and class c.5 MI measures how much in- formation the presence/absence of a term contributes to making the correct classification decision on c. Formally:

As you might expect, keeping the in- formative terms and eliminating the non-informative ones tends to reduce noise and improve the classifier’s accuracy.
 For the multinomial model (MI feature selection), the peak occurs later, at 100 features, and its effectiveness recovers somewhat at the end when we use all features. The reason is that the multinomial takes the number of occurrences into account in parameter estimation and classification and therefore better exploits a larger number of features than the Bernoulli model. Regardless of the differences between the two methods, using a carefully selected subset of the features results in better effectiveness than using all features.

Chapter 14 Vector space classification

The basic hypothesis in using the vector space model for classification is the contiguity hypothesis.
Contiguity hypothesis. Documents in the same class form a contiguous region and regions of different classes do not overlap.