COSC282 BIG DATA ANALYTICS FALL 2015 LECTURE 11 - OCT 21

Similar documents
Inverted Index Construction

Learning multi-grained aspect target sequence for Chinese sentiment analysis. H Peng, Y Ma, Y Li, E Cambria Knowledge-Based Systems (2018)

CIS530 Homework 3: Vector Space Models

CIS530 HW3. Ignacio Arranz, Jishnu Renugopal January 30, 2018

Indexing local features. Wed March 30 Prof. Kristen Grauman UT-Austin

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University

STUDY GUIDE. Romeo and Juliet WILLIAM SHAKESPEARE

Authorship Verification with the Minmax Metric

A TEXT RETRIEVAL APPROACH TO CONTENT-BASED AUDIO RETRIEVAL

NETFLIX MOVIE RATING ANALYSIS

STUDY GUIDE. romeo and juliet William Shakespeare

Automatic Polyphonic Music Composition Using the EMILE and ABL Grammar Inductors *

STUDY GUIDE. a midsummer night's dream William Shakespeare

Features of Shakespeare s language Shakespeare's language

CPSC 121: Models of Computation. Module 1: Propositional Logic

USING ARTIST SIMILARITY TO PROPAGATE SEMANTIC INFORMATION

DUNSINANE. 9:20 Chaparral High School Hamlet, 4.5 Measure for measure, 3.1

MidiFind: Fast and Effec/ve Similarity Searching in Large MIDI Databases

CURIE Day 3: Frequency Domain Images

MYRIAD-MINDED SHAKESPEARE

The mf-index: A Citation-Based Multiple Factor Index to Evaluate and Compare the Output of Scientists

Julius Caesar by William Shakespeare

Homework 2 Key-finding algorithm

MPEG-7 AUDIO SPECTRUM BASIS AS A SIGNATURE OF VIOLIN SOUND

Julius Caesar In Plain And Simple English: A Modern Translation And The Original Version By William Shakespeare READ ONLINE

Citation & Journal Impact Analysis

Word Sense Disambiguation in Queries. Shaung Liu, Clement Yu, Weiyi Meng

Subjective Similarity of Music: Data Collection for Individuality Analysis

Lesson 10 November 10, 2009 BMC Elementary


Pattern Based Melody Matching Approach to Music Information Retrieval

University of Liverpool Library. Introduction to Journal Bibliometrics and Research Impact. Contents

Musical Signal Processing with LabVIEW Introduction to Audio and Musical Signals. By: Ed Doering

Channel calculation with a Calculation Project

Introduction to Shakespeare Lesson Plan

Using Generic Summarization to Improve Music Information Retrieval Tasks

Outline. Why do we classify? Audio Classification

Resampling Statistics. Conventional Statistics. Resampling Statistics

Standard reference books. Histories of literature. Unseen critical appreciation

SINGING is a popular social activity and a good way of expressing

Analysis of local and global timing and pitch change in ordinary

Recognising Cello Performers using Timbre Models

Creating Mindmaps of Documents

The Shakespeare Plays: Julius Caesar By McGraw-Hill READ ONLINE

Impact Factors: Scientific Assessment by Numbers

Chord Classification of an Audio Signal using Artificial Neural Network

A Discriminative Approach to Topic-based Citation Recommendation

Antony And Cleopatra (Oxford School Shakespeare Series) By William Shakespeare, Roma Gill

Singing from the same sheet: A new approach to measuring tune similarity and its legal implications

Should author self- citations be excluded from citation- based research evaluation? Perspective from in- text citation functions

Detecting Musical Key with Supervised Learning

Personalized TV Recommendation with Mixture Probabilistic Matrix Factorization

The Tragedy of Macbeth

Institute of Southern Punjab, Multan

Absolute Relevance? Ranking in the Scholarly Domain. Tamar Sadeh, PhD CNI, Baltimore, MD April 2012

Supplementary Note. Supplementary Table 1. Coverage in patent families with a granted. all patent. Nature Biotechnology: doi: /nbt.

Assigning and Visualizing Music Genres by Web-based Co-Occurrence Analysis

COMPLETE WORKS: TABLE TOP SHAKESPEARE EDUCATION PACK

Sequential Association Rules in Atonal Music

UWaterloo at SemEval-2017 Task 7: Locating the Pun Using Syntactic Characteristics and Corpus-based Metrics

DIFFERENTIATE SOMETHING AT THE VERY BEGINNING THE COURSE I'LL ADD YOU QUESTIONS USING THEM. BUT PARTICULAR QUESTIONS AS YOU'LL SEE

TERRESTRIAL broadcasting of digital television (DTV)

R&D White Paper WHP 085. The Rel : a perception-based measure of resolution. Research & Development BRITISH BROADCASTING CORPORATION.

Sequential Association Rules in Atonal Music

Sherlock Holmes and the adventures of the dancing men

EE391 Special Report (Spring 2005) Automatic Chord Recognition Using A Summary Autocorrelation Function

Retrieval of textual song lyrics from sung inputs

CSE 166: Image Processing. Overview. Representing an image. What is an image? History. What is image processing? Today. Image Processing CSE 166

Comparative Analysis of Stein s. and Euclid s Algorithm with BIST for GCD Computations. 1. Introduction

Signal Processing. Case Study - 3. It s Too Loud. Hardware. Sound Levels

100 Statistical Tests In R By N.D Lewis READ ONLINE

Introduction to Your Teacher s Pack!

The reduction in the number of flip-flops in a sequential circuit is referred to as the state-reduction problem.

Improving MeSH Classification of Biomedical Articles using Citation Contexts

Linear mixed models and when implied assumptions not appropriate

Evaluation Tools. Journal Impact Factor. Journal Ranking. Citations. H-index. Library Service Section Elyachar Central Library.

Julius Caesar (Arkangel Shakespeare Collection) By William Shakespeare

CPSC 121: Models of Computation Lab #5: Flip-Flops and Frequency Division

Music Information Retrieval Using Audio Input

Objective Video Quality Assessment of Direct Recording and Datavideo HDR-40 Recording System

Julius Caesar (Arkangel Shakespeare Collection) By William Shakespeare READ ONLINE

COMP 9519: Tutorial 1

Logic Design ( Part 3) Sequential Logic- Finite State Machines (Chapter 3)

35 Faculty of Engineering, Chulalongkorn University

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

Music Information Retrieval Community

Identifying Related Documents For Research Paper Recommender By CPA and COA

Research Article. ISSN (Print) *Corresponding author Shireen Fathima

Citation Analysis. Presented by: Rama R Ramakrishnan Librarian (Instructional Services) Engineering Librarian (Aerospace & Mechanical)

William Shakespeare ( ) England s genius

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval

JULIUS CAESAR. Shakespeare. Cambridge School. Edited by Rob Smith and Vicki Wienand

DOWNLOAD OR READ : SHAKESPERES JULIUS CAESAR PDF EBOOK EPUB MOBI

Note: Please use the actual date you accessed this material in your citation.

FPGA IMPLEMENTATION AN ALGORITHM TO ESTIMATE THE PROXIMITY OF A MOVING TARGET

Application of cepstrum prewhitening on non-stationary signals

System Identification

An Efficient Reduction of Area in Multistandard Transform Core

Sequential Logic Notes

The Tempest (Dover Thrift Editions) By William Shakespeare

Transcription:

COSC282 BIG DATA ANALYTICS FALL 2015 LECTURE 11 - OCT 21 1

Topics for Today Assignment 6 Vector Space Model Term Weighting Term Frequency Inverse Document Frequency

Something about Assignment 6

Search Engine

W H AT I S O U R M I S S I O N H E R E? 7

Query 8 Search Results: a ranked list of documents

Document Retrieval It is the most important information retrieval task: document retrieval". it is to find relevant documents for a query not to find answers - that is question answering Sometimes also known as ad-hoc retrieval

Document Retrieval Process Information Need Document Representation Corpus Query Representation Indexing Retrieval Algorithms Index Retrieval Results Evaluation/ Feedback

How do Retrieval Algorithms Work?

How to find the relevant documents for a query? By keyword matching boolean models By similarity vector space model By imaging how to write out a query how likely a query is written with this document in mind generate with some randomness query generation language model By trusting how other people think about the documents /web pages link-based methods, pagerank, hits

Vector Space Model

Ch. 6 Formal Definition of Document Retrieval Task: rank-order the documents in the collection with respect to a query Input: a query and all documents in your collection Output: a ranked list of documents (of all documents in the collection, but you can stop early) 14

Basic Procedure Assign a score (say in [0, 1]) to each document in the collection This score measures how well the document and the given query match. Then, sort the scores usually in descending order of the score from the most relevant to the least relevant

Sec. 6.3 Vector Space Model Treat query as a tiny document Represent both query and documents as word vectors in a word space Rank documents according to their proximity to the query in the space of words

Represent Documents in a Space of Word Vectors Sec. 6.3 Suppose the corpus only has two words: Jealous and Gossip They form a space of Jealous and Gossip d1: gossip gossip jealous gossip gossip gossip gossip gossip gossip gossip gossip d2: gossip gossip jealous gossip gossip gossip gossip gossip gossip gossip jealous jealous jealous jealous jealous jealous jealous gossip jealous d3: jealous gossip jealous jealous jealous jealous jealous jealous jealous jealous jealous q: gossip gossip jealous gossip gossip gossip gossip gossip jealous jealous jealous jealous 17

Calculate the Query- Document Similarity

Sec. 6.3 Formalizing vector space proximity First cut: distance between the end points of the two vectors? How to do it?

Euclidean Distance In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" (i.e. straight-line) distance between two points in Euclidean space. If if p = (p1, p2,..., pn) and q = (q1, q2,..., qn) are two points in the Euclidean space, their Euclidean distance is

Sec. 6.3 In a space of Jealous and Gossip Here, if you look at the content (or we say the word distributions) of each document, d2 is actually the most similar document to q However, d2 produces a bigger distance score to q d1: gossip gossip jealous gossip gossip gossip gossip gossip gossip gossip gossip d2: gossip gossip jealous gossip gossip gossip gossip gossip gossip gossip jealous jealous jealous jealous jealous jealous jealous gossip jealous d3: jealous gossip jealous jealous jealous jealous jealous jealous jealous jealous jealous q: gossip gossip jealous gossip gossip gossip gossip gossip jealous jealous jealous jealous 21

Sec. 6.3 In a space of Jealous and Gossip The Euclidean distance between q and d 2 is large even though the distribution of terms in the query q and the distribution of terms in the document d 2 are very similar.

Why Euclidean Distance is A Bad Idea for Query-Document Similarity Because Euclidean distance is large for vectors with different lengths. short query and long documents will be always have big Euclidean distance we cannot rank them fairly, as compared with others not possible to get a universal ranking

How can we do better?

http://www.thornybleeder.com/index_files/never-forget-what-really-matters.jpg

What matters is the content similarity Here the angles between the vectors captures this similarity better, not the distance metrics.

Sec. 6.3 Use angle instead of distance Key idea: Rank documents according to angle with query The angle between similar vectors is small, between dissimilar vectors is large. This is exactly what we need to score a querydocument pair. This is equivalent to perform a document length normalization

Sec. 6.3 Document Length Normalization A vector can be (length-) normalized by dividing each of its components by its length: Dividing a vector this way makes it a unit (length) vector (on surface of unit hypersphere) Long and short documents now have comparable weights

Cosine similarity illustrated 12

Sec. 6.3 Cosine Similarity q i is the tf-idf weight of term i in the query d i is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d or, equivalently, the cosine of the angle between q and d.

Exercise - Please go to Piazza Consider two documents D 1, D 2 and a query Q D 1 = (0.5, 0.8, 0.3), D 2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)

Results Consider two documents D 1, D 2 and a query Q D 1 = (0.5, 0.8, 0.3), D 2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)

What are the numbers in a vector? They are term weights to indicate the importance of a term in a document

Term Weighting

Sec. 1.1 Recall: Term-Document Matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 1 if document contains term, 0 otherwise

the numbers in the matrix could be more meaningful to indicate the importance of each term in a document

Term Frequency How many times a term appears in a document

Sec. 6.2 Term-Document Matrix with Term Frequency Consider the number of occurrences of a term in a document: Each document is a word count vector in N v : a column below Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 157 73 0 0 0 0 Brutus 4 157 0 1 0 0 Caesar 232 227 0 2 1 1 Calpurnia 0 10 0 0 0 0 Cleopatra 57 0 0 0 0 0 mercy 2 0 3 5 5 1 worser 2 0 1 1 1 0

but,

http://www.1zoom.net/big2/16/119769-frederika.jpg

Some terms are common, well less common than the stop words but still common e.g. Georgetown is uniquely important in NBA.com e.g. Georgetown appears at too many pages in our university web site, so it is not a very important term in those pages. How to discount them?

Inverse document frequency Document Frequency: How many documents in which a term appears Inverse Document Frequency: is the inverse of the above our way of discount the common terms

Sec. 6.2.1 Inverse Document Frequency (idf) df t is the document frequency of t = the number of documents that contain t df t is an inverse measure of the informativeness of t df t N N is the total number of documents We define the idf (inverse document frequency) of t by idf = log ( N/df t 10 t ) We use log (N/df t ) instead of N/df t to dampen the effect of idf. 43

Sec. 6.2.1 Exercise: Calculate IDF (Suppose N = 1 million) term df t idf t calpurnia 1 animal 100 sunday 1,000 fly 10,000 under 100,000 the 1,000,000 idf = log ( N/df t 10 t ) There is one idf value for each term t in a collection. 44

TF-IDF Term Weighting

Sec. 6.2.2 tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. w = log(1 + tf ) log t d t, d 10 ( N / df, t ) Best known weighting scheme in information retrieval Increases with the number of occurrences within a document Increases with the rarity of the term in the collection

Binary count weight matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 5.25 3.18 0 0 0 0.35 Brutus 1.21 6.1 0 1 0 0 Caesar 8.59 2.54 0 1.51 0.25 0 Calpurnia 0 1.54 0 0 0 0 Cleopatra 2.85 0 0 0 0 0 mercy 1.51 0 1.9 0.12 5.25 0.88 worser 1.37 0 0.11 4.15 0.25 1.95 Each document is now represented by a real-valued vector of tf-idf weights R V

Sec. 6.4 tf-idf weighting has many variants

Sec. 6.4 Weighting may differ in queries vs documents Many search engines allow for different weightings for queries vs. documents A very standard weighting scheme is: lnc.ltc Document: logarithmic tf (l as first character), no idf and cosine normalization Query: logarithmic tf (l in leftmost column), idf (t in second column), no normalization

Vector Space Model Advantages Simple computational framework for ranking documents given a query Any similarity measure or term weighting scheme could be used Disadvantages Assumption of term independence

Summary of Today Vector Space Weighting TF-IDF Term Weighting Assignment 6 due today Midterm next Wed