Qian's Information Retrieval Blog : 三月 2015

2015年3月30日星期一

Muddiest Point for week 11(3.30)

When implementing adaptive IR, we should collect the user's profile and build personalized search query or design adaptive system for them , but how can we collect the information about the user? How can we collect user profile without violate the user's privacy?

2015年3月28日星期六

Reading notes for week 11

Intelligent Information Retrieval:

In the user profiles for personalized information access, we can see that we can collecting information about the uses, and represent and build the user information profiles. Explicit information techniques are contrasted with implicitly collected user information using browser caches, proxy servers, browser agents, desktop agents, and search logs, rely on personal information input by the users, typically via HTML forms. More sophisticated personalization projects based on explicit feedback have focused on navigation. However, Implicit user information collection only track browsing activity, proxy servers seem to be a good compromise between easily capturing information and yet not placing a large burden on the us.

In order to construct an individual user’s profile, information may be collected explicitly, through direct user intervention, or implicitly, through agents that monitor user activity.

Collecting Information About Users

Methods for User Identification: collecting their information and sharing this with a server via some protocol, logins, cookies and session ids

Methods for User Information Collection

Explicit User Information Collection/ Implicit User Information Collection

Comparing Implicit and Explicit User Information Collection.

Personalized Web Exploration with Task Models:

TaskSieve is a Web search system that utilizes a relevance feedback based profile, called a “task model”, for personalization. Its innovations include flexible and user controlled integration of queries and task models, task-infused text snippet generation, and on-screen visualization of task models:

1). Retrieve documents along with their relevance scores by submitting the user query to a search engine.

2). Calculate similarity scores between retrieved documents and the model.

3). Calculate combined score of each document by equation that alpha * Task_Model_Score + (1 – alpha) * Search_Score

4). Re-rank the initial list by the combined score from step 3.

The idea of re-ranking is to promote documents, which are more relevant to the user task as measured by their similarity to the task model.

Muddiest points for week 11

In the link analysis, we define simple iterative logics as good nodes won't point to the bad nodes, and vice versa. But how we could define an bad nodes? If we use the search engine, would the search engine only returns the good nodes, no matter whether the bad nodes is as relative as the good nodes.

2015年3月20日星期五

Muddiest Point for week 10

When shuguang introduced the interface of Google, I am wondering why Google use the "I am feeling lucky". It's confused me for a long time since I first used google.

Reading Notes for Week 10

19.Web search basics

The essential feature that led to the explosive growth of the web – decentralized content publishing with essentially no central control of authorship – turned out to be the biggest challenge for web search engines in their quest to index and retrieve this content. And the search engine developed because of the spam detection in the early time.

Advertising as the economic model

CPC model:clicking on the adver- tisement leads the user to a web page set up by the advertiser, where the user is induced to make a purchase.

Goto:for every query term q it ac- cepted bids from companies who wanted their web page shown on the query q. In response to the query q, Goto would return the pages of all advertisers who bid for q, ordered by their bids. Furthermore, when the user clicked on one of the returned results, the corresponding advertiser would make a payment to Goto

The search user experience

It is crucial that we understand the users of web search as well,and the more user traffic a web search engine can attract, the more revenue it stands to earn from sponsored search.
There are three web search queries:(1) informational (2) navigational (3) transactional.

Index size and estimation
It is difficult to reason about the fraction of the Web indexed by a search engine, because there is an infinite number of dynamic web pages and the number of the web page is very large.

The relative sizes of the indexes is imprecise:
1).In response to queries a search engine can return web pages whose con- tents it has not (fully or even partially) indexed.

2).Search engines generally organize their indexes in various tiers and parti- tions, not all of which are examined on every search.

So that there is no single measure of index size.

Construct the random page samples:

1. Random searches
2. Random IP addresses

3. Random walks

4.random queries

Near-duplicates and shingling
The Web contains multiple copies of the same content. By some estimates, as many as 40% of the pages on the Web are duplicates of other pages.

The simplest approach to detecting duplicates is to compute, for each web page, a fingerprint that is a succinct (say 64-bit) digest of the characters on that page. Then, whenever the fingerprints of two web pages are equal, we test whether the pages themselves are equal and if so declare one of them to be a duplicate copy of the other.

shingling

Illustration of shingle sketches. We see two documents going through four stages of shingle sketch computation. In the first step (top row), we apply a 64-bit hash to each shingle from each document to obtain H(d1) and H(d2) (circles). Next, we apply a random permutation Π to permute H(d1) and H(d2), obtaining Π(d1) and Π(d2) (squares).

21.Link analysis

In this chapter author focus on the use of hyperlinks for ranking web search results. Such link analysis is one of many factors considered by web search engines in computing a compos- ite score for a web page on any given query.

PageRank
Definition: first technique for link analysis assigns to every node in the web graph a numerical score between 0 and 1

In assigning a PageRank score to each node of the web graph, we use the teleport operation in two ways:
(1) When at a node with no out-links, the surfer invokes the teleport operation.
(2) At any node that has outgoing links, the surfer invokes the teleport operation with probability 0 < α < 1 and the standard random walk.

Markov chains

A Markov chain is characterized by an N × N transition probability matrix P each of whose entries is in the interval [0, 1]; the entries in each row of P add up to 1.

The PageRank computation:
⃗π P = λ ⃗π .
The N entries in the principal eigenvector π⃗ are the steady-state proba- bilities of the random walk with teleporting, and thus the PageRank values for the corresponding web pages.

Topic-specific PageRank.

In this example we consider a user whose interests are 60% sports and 40% politics. If the teleportation probability is 10%, this user is modeled as teleporting 6% to sports pages and 4% to politics pages.

2015年3月13日星期五

Muddiest point for week 7

I don't have any muddiest points for this week.

Week 9 reading notes

Chapter 10 for MIR,

In this chapter, author introduced the user interface which should help users to express their information needs and formulate queries. This chapter talks about recent development in this area.
There are three design rules for users:

Offer informative feedback
Reduce working memory load
Provide alternative interfaces for novice and expert users
Information access interfaces must contend with specialkinds of simplicity/power trade

How to judge a interactive system

An important aspect of human-computer interaction is the methodology for evaluation of user interface techniques. Precision and recall measures have been widely used for comparing the ranking results of non-interactive systems, but are less appropriate for assessing interactive systems. The standard evaluations emphasize high recall levels; in the TREC tasks systems are compared to see how well they return the top 1000 documents.

The procedure to process interaction system:

Start with an information need
Select a system and a collection to search on
Formulate a query
Send the query to the system
Receive the results in the form of information items
Scan, evaluate, and interpret the results
Either stop, or,
Reformulate the query and go to step 4

About the starting points:

One study found that for non-expert users the results of clustering were difficult to use, and that graphical depictions (for example, representing clusters with circles and lines connecting documents) were much harder to use than textual representations (for example, showing titles and topical words, as in Scatter/Gather), because documents' contents are difficult to discern without actually reading some text .

About the query specification:
Boolean queries are problematic because the basic syntax is counter-intuitive.
Solutions to solve this problem are :1. allow users to choose from a selection of common simple ways of combining query terms 2. provide a simpler or more intuitive syntax.

Visualization For Text Mining:

One of the most common strategies used in text mining is to identify important entities within the text and attempt to show connections among those entities.

Visualizing Document Concordances And Word Frequencies

In the field of literature analysis it is commonplace to analyze a text or a collection of texts by extracting a concordance : an alphabetical index of all the words in a text, showing those words in the contexts in which they appear. The standard way to view the concordance is to place the word of interest in the center of the view with “gutters” on either side, and then sort the surrounding text in some way.

Visualizing Literature And Citation Relationships

This information is used to assess the importance of the authors, the papers, and the topics in the field, and how these measures change over time.

Some more recent approaches have moved away from nodes-and-links in order to break the interesting relationships into pieces and show their connections via brushing-and-linking interactions.