COSC282 BIG DATA ANALYTICS FALL 2015 LECTURE 11 - OCT 21 1
Topics for Today Assignment 6 Vector Space Model Term Weighting Term Frequency Inverse Document Frequency
Something about Assignment 6
Search Engine
W H AT I S O U R M I S S I O N H E R E? 7
Query 8 Search Results: a ranked list of documents
Document Retrieval It is the most important information retrieval task: document retrieval". it is to find relevant documents for a query not to find answers - that is question answering Sometimes also known as ad-hoc retrieval
Document Retrieval Process Information Need Document Representation Corpus Query Representation Indexing Retrieval Algorithms Index Retrieval Results Evaluation/ Feedback
How do Retrieval Algorithms Work?
How to find the relevant documents for a query? By keyword matching boolean models By similarity vector space model By imaging how to write out a query how likely a query is written with this document in mind generate with some randomness query generation language model By trusting how other people think about the documents /web pages link-based methods, pagerank, hits
Vector Space Model
Ch. 6 Formal Definition of Document Retrieval Task: rank-order the documents in the collection with respect to a query Input: a query and all documents in your collection Output: a ranked list of documents (of all documents in the collection, but you can stop early) 14
Basic Procedure Assign a score (say in [0, 1]) to each document in the collection This score measures how well the document and the given query match. Then, sort the scores usually in descending order of the score from the most relevant to the least relevant
Sec. 6.3 Vector Space Model Treat query as a tiny document Represent both query and documents as word vectors in a word space Rank documents according to their proximity to the query in the space of words
Represent Documents in a Space of Word Vectors Sec. 6.3 Suppose the corpus only has two words: Jealous and Gossip They form a space of Jealous and Gossip d1: gossip gossip jealous gossip gossip gossip gossip gossip gossip gossip gossip d2: gossip gossip jealous gossip gossip gossip gossip gossip gossip gossip jealous jealous jealous jealous jealous jealous jealous gossip jealous d3: jealous gossip jealous jealous jealous jealous jealous jealous jealous jealous jealous q: gossip gossip jealous gossip gossip gossip gossip gossip jealous jealous jealous jealous 17
Calculate the Query- Document Similarity
Sec. 6.3 Formalizing vector space proximity First cut: distance between the end points of the two vectors? How to do it?
Euclidean Distance In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" (i.e. straight-line) distance between two points in Euclidean space. If if p = (p1, p2,..., pn) and q = (q1, q2,..., qn) are two points in the Euclidean space, their Euclidean distance is
Sec. 6.3 In a space of Jealous and Gossip Here, if you look at the content (or we say the word distributions) of each document, d2 is actually the most similar document to q However, d2 produces a bigger distance score to q d1: gossip gossip jealous gossip gossip gossip gossip gossip gossip gossip gossip d2: gossip gossip jealous gossip gossip gossip gossip gossip gossip gossip jealous jealous jealous jealous jealous jealous jealous gossip jealous d3: jealous gossip jealous jealous jealous jealous jealous jealous jealous jealous jealous q: gossip gossip jealous gossip gossip gossip gossip gossip jealous jealous jealous jealous 21
Sec. 6.3 In a space of Jealous and Gossip The Euclidean distance between q and d 2 is large even though the distribution of terms in the query q and the distribution of terms in the document d 2 are very similar.
Why Euclidean Distance is A Bad Idea for Query-Document Similarity Because Euclidean distance is large for vectors with different lengths. short query and long documents will be always have big Euclidean distance we cannot rank them fairly, as compared with others not possible to get a universal ranking
How can we do better?
http://www.thornybleeder.com/index_files/never-forget-what-really-matters.jpg
What matters is the content similarity Here the angles between the vectors captures this similarity better, not the distance metrics.
Sec. 6.3 Use angle instead of distance Key idea: Rank documents according to angle with query The angle between similar vectors is small, between dissimilar vectors is large. This is exactly what we need to score a querydocument pair. This is equivalent to perform a document length normalization
Sec. 6.3 Document Length Normalization A vector can be (length-) normalized by dividing each of its components by its length: Dividing a vector this way makes it a unit (length) vector (on surface of unit hypersphere) Long and short documents now have comparable weights
Cosine similarity illustrated 12
Sec. 6.3 Cosine Similarity q i is the tf-idf weight of term i in the query d i is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d or, equivalently, the cosine of the angle between q and d.
Exercise - Please go to Piazza Consider two documents D 1, D 2 and a query Q D 1 = (0.5, 0.8, 0.3), D 2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)
Results Consider two documents D 1, D 2 and a query Q D 1 = (0.5, 0.8, 0.3), D 2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)
What are the numbers in a vector? They are term weights to indicate the importance of a term in a document
Term Weighting
Sec. 1.1 Recall: Term-Document Matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 1 if document contains term, 0 otherwise
the numbers in the matrix could be more meaningful to indicate the importance of each term in a document
Term Frequency How many times a term appears in a document
Sec. 6.2 Term-Document Matrix with Term Frequency Consider the number of occurrences of a term in a document: Each document is a word count vector in N v : a column below Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 157 73 0 0 0 0 Brutus 4 157 0 1 0 0 Caesar 232 227 0 2 1 1 Calpurnia 0 10 0 0 0 0 Cleopatra 57 0 0 0 0 0 mercy 2 0 3 5 5 1 worser 2 0 1 1 1 0
but,
http://www.1zoom.net/big2/16/119769-frederika.jpg
Some terms are common, well less common than the stop words but still common e.g. Georgetown is uniquely important in NBA.com e.g. Georgetown appears at too many pages in our university web site, so it is not a very important term in those pages. How to discount them?
Inverse document frequency Document Frequency: How many documents in which a term appears Inverse Document Frequency: is the inverse of the above our way of discount the common terms
Sec. 6.2.1 Inverse Document Frequency (idf) df t is the document frequency of t = the number of documents that contain t df t is an inverse measure of the informativeness of t df t N N is the total number of documents We define the idf (inverse document frequency) of t by idf = log ( N/df t 10 t ) We use log (N/df t ) instead of N/df t to dampen the effect of idf. 43
Sec. 6.2.1 Exercise: Calculate IDF (Suppose N = 1 million) term df t idf t calpurnia 1 animal 100 sunday 1,000 fly 10,000 under 100,000 the 1,000,000 idf = log ( N/df t 10 t ) There is one idf value for each term t in a collection. 44
TF-IDF Term Weighting
Sec. 6.2.2 tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. w = log(1 + tf ) log t d t, d 10 ( N / df, t ) Best known weighting scheme in information retrieval Increases with the number of occurrences within a document Increases with the rarity of the term in the collection
Binary count weight matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 5.25 3.18 0 0 0 0.35 Brutus 1.21 6.1 0 1 0 0 Caesar 8.59 2.54 0 1.51 0.25 0 Calpurnia 0 1.54 0 0 0 0 Cleopatra 2.85 0 0 0 0 0 mercy 1.51 0 1.9 0.12 5.25 0.88 worser 1.37 0 0.11 4.15 0.25 1.95 Each document is now represented by a real-valued vector of tf-idf weights R V
Sec. 6.4 tf-idf weighting has many variants
Sec. 6.4 Weighting may differ in queries vs documents Many search engines allow for different weightings for queries vs. documents A very standard weighting scheme is: lnc.ltc Document: logarithmic tf (l as first character), no idf and cosine normalization Query: logarithmic tf (l in leftmost column), idf (t in second column), no normalization
Vector Space Model Advantages Simple computational framework for ranking documents given a query Any similarity measure or term weighting scheme could be used Disadvantages Assumption of term independence
Summary of Today Vector Space Weighting TF-IDF Term Weighting Assignment 6 due today Midterm next Wed