2015年4月10日星期五
muddiest points for week 12
In k-means clustering, if we don't know how many clusters it have, how we define the value of k?
2015年4月3日星期五
reading notes for week12
Chapter 13 text classification and Naive Bayes
The classification in IR always involves the following parts:
1). The automatic detection of spam pages (which then are not included in
the search engine index).
2). The automatic detection of sexually explicit content (which is included in search results only if the user turns an option such as SafeSearch off)
3). Sentiment detection or the automatic classification of a movie or product review as positive or negative.
4). Personal email sorting. A user may have folders like talk announcements, electronic bills, email from family and friends, and so on, and may want a classifier to classify each incoming email and automatically move it to the appropriate folder.
5). Topic-specific or vertical search. Vertical search engines restrict searches to a particular topic.
2). The automatic detection of sexually explicit content (which is included in search results only if the user turns an option such as SafeSearch off)
3). Sentiment detection or the automatic classification of a movie or product review as positive or negative.
4). Personal email sorting. A user may have folders like talk announcements, electronic bills, email from family and friends, and so on, and may want a classifier to classify each incoming email and automatically move it to the appropriate folder.
5). Topic-specific or vertical search. Vertical search engines restrict searches to a particular topic.
This type of learning is called supervised learning because a supervisor (the
human who defines the classes and labels training documents) serves as a
teacher directing the learning process. We denote the supervised learning
method by Γ and write Γ(D) = γ. The learning method Γ takes the training
set D as input and returns the learned classification function γ.
The first supervised learning method we introduce is the multinomial Naive Bayes or multinomial NB model, a probabilistic learning method. The probability of a document d being in class c is computed as
The first supervised learning method we introduce is the multinomial Naive Bayes or multinomial NB model, a probabilistic learning method. The probability of a document d being in class c is computed as
An alternative to the multinomial model is the multivariate Bernoulli model
or Bernoulli model. It is equivalent to the binary independence model of Section which generates an indicator for each term of the vocabulary, either 1 indicating presence of the term in the document or 0 indicating absence.
To reduce the number of parameters, we make the Naive Bayes conditional
independence assumption. We assume that attribute values are independent of
each other given the class:
Feature selection is the process of selecting a subset of the terms occurring
in the training set and using only this subset as features in text classification.
Feature selection serves two main purposes.
First, it makes training and applying a classifier more efficient by decreasing the size of the effective vocabulary.
Feature selection serves two main purposes.
First, it makes training and applying a classifier more efficient by decreasing the size of the effective vocabulary.
Second, feature selection often increases classification accuracy by eliminating noise features.
As you might expect, keeping the in-
formative terms and eliminating the non-informative ones tends to reduce
noise and improve the classifier’s accuracy.
For the multinomial model (MI feature selection), the peak occurs later, at 100 features, and its effectiveness recovers somewhat at the end when we use all features. The reason is that the multinomial takes the number of occurrences into account in parameter estimation and classification and therefore better exploits a larger number of features than the Bernoulli model. Regardless of the differences between the two methods, using a carefully selected subset of the features results in better effectiveness than using all features.
Contiguity hypothesis. Documents in the same class form a contiguous region and regions of different classes do not overlap.
For the multinomial model (MI feature selection), the peak occurs later, at 100 features, and its effectiveness recovers somewhat at the end when we use all features. The reason is that the multinomial takes the number of occurrences into account in parameter estimation and classification and therefore better exploits a larger number of features than the Bernoulli model. Regardless of the differences between the two methods, using a carefully selected subset of the features results in better effectiveness than using all features.
Chapter 14 Vector space classification
The basic hypothesis in using the vector space model for classification is the contiguity hypothesis.Contiguity hypothesis. Documents in the same class form a contiguous region and regions of different classes do not overlap.
订阅:
博文 (Atom)