The essential feature that led to the explosive growth of the web – decentralized content publishing with essentially no central control of authorship –
turned out to be the biggest challenge for web search engines in their quest to
index and retrieve this content. And the search engine developed because of the spam detection in the early time.
Advertising as the economic model
Goto:for every query term q it ac-
cepted bids from companies who wanted their web page shown on the query
q. In response to the query q, Goto would return the pages of all advertisers
who bid for q, ordered by their bids. Furthermore, when the user clicked
on one of the returned results, the corresponding advertiser would make a
payment to Goto
The search user experience
Construct the random page samples:
4.random queries
Near-duplicates and shingling
The Web contains multiple copies of the same content. By some estimates, as many as 40% of the pages on the Web are duplicates of other pages.
21.Link analysis
It is crucial that we understand the users of web search as well,and the more user traffic a web search engine can attract, the
more revenue it stands to earn from sponsored search.
There are three web search queries:(1) informational (2) navigational (3) transactional.
There are three web search queries:(1) informational (2) navigational (3) transactional.
Index size and estimation
It is difficult to reason about the fraction of the Web indexed by a search engine, because there is an infinite number of dynamic web pages and the number of the web page is very large.
It is difficult to reason about the fraction of the Web indexed by a search engine, because there is an infinite number of dynamic web pages and the number of the web page is very large.
The relative sizes of the indexes is imprecise:
1).In response to queries a search engine can return web pages whose con- tents it has not (fully or even partially) indexed.
1).In response to queries a search engine can return web pages whose con- tents it has not (fully or even partially) indexed.
2).Search engines generally organize their indexes in various tiers and parti-
tions, not all of which are examined on every search.
So
that there is no single measure of index size.
Construct the random page samples:
1. Random searches
2. Random IP addresses
2. Random IP addresses
3. Random walks
Near-duplicates and shingling
The Web contains multiple copies of the same content. By some estimates, as many as 40% of the pages on the Web are duplicates of other pages.
The simplest approach to detecting duplicates is to compute, for each web
page, a fingerprint that is a succinct (say 64-bit) digest of the characters on that
page. Then, whenever the fingerprints of two web pages are equal, we test
whether the pages themselves are equal and if so declare one of them to be a
duplicate copy of the other.
Illustration of shingle sketches. We see two documents going through
four stages of shingle sketch computation. In the first step (top row), we apply a 64-bit
hash to each shingle from each document to obtain H(d1) and H(d2) (circles). Next,
we apply a random permutation Π to permute H(d1) and H(d2), obtaining Π(d1)
and Π(d2) (squares).
21.Link analysis
In this chapter author focus on the
use of hyperlinks for ranking web search results. Such link analysis is one
of many factors considered by web search engines in computing a compos-
ite score for a web page on any given query.
PageRank
Definition: first technique for link analysis assigns to every node in the web graph a numerical score between 0 and 1
Definition: first technique for link analysis assigns to every node in the web graph a numerical score between 0 and 1
In assigning a PageRank score to each node of the web graph, we use the
teleport operation in two ways:
(1) When at a node with no out-links, the surfer invokes the teleport operation.
(2) At any node that has outgoing links, the surfer invokes the teleport operation with probability 0 < α < 1 and the standard random walk.
(1) When at a node with no out-links, the surfer invokes the teleport operation.
(2) At any node that has outgoing links, the surfer invokes the teleport operation with probability 0 < α < 1 and the standard random walk.
Markov chains
A Markov chain is characterized by an N × N transition probability matrix P
each of whose entries is in the interval [0, 1]; the entries in each row of P add
up to 1.
In this example we consider a user whose
interests are 60% sports and 40% politics. If the teleportation probability is 10%, this
user is modeled as teleporting 6% to sports pages and 4% to politics pages.
没有评论:
发表评论