YOU ARE WHAT YOU LIKE INFORMATION LEAKAGE THROUGH USERS INTERESTS

Similar documents
WHAT'S HOT: LINEAR POPULARITY PREDICTION FROM TV AND SOCIAL USAGE DATA Jan Neumann, Xiaodong Yu, and Mohamad Ali Torkamani Comcast Labs

Using Genre Classification to Make Content-based Music Recommendations

The Relationship Between Movie theater Attendance and Streaming Behavior. Survey Findings. December 2018

Singer Traits Identification using Deep Neural Network

DEMOGRAPHIC DIFFERENCES IN WORKPLACE GOSSIPING BEHAVIOUR IN ORGANIZATIONS - AN EMPIRICAL STUDY ON EMPLOYEES IN SMES

Lyrics Classification using Naive Bayes

The Relationship Between Movie Theatre Attendance and Streaming Behavior. Survey insights. April 24, 2018

Music Information Retrieval Community

This is a licensed product of AM Mindpower Solutions and should not be copied

SIGNAL + CONTEXT = BETTER CLASSIFICATION

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

Can Song Lyrics Predict Genre? Danny Diekroeger Stanford University

Contextual Inquiry and 1st Rough Sketches

CS229 Project Report Polyphonic Piano Transcription

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

Neural Network Predicating Movie Box Office Performance

Music Information Retrieval with Temporal Features and Timbre

Automatic Music Clustering using Audio Attributes

Jazz Melody Generation and Recognition

BIBLIOGRAPHIC DATA: A DIFFERENT ANALYSIS PERSPECTIVE. Francesca De Battisti *, Silvia Salini

The Million Song Dataset

Automatic Music Genre Classification

Detecting Musical Key with Supervised Learning

Take a Break, Bach! Let Machine Learning Harmonize That Chorale For You. Chris Lewis Stanford University

Automatic Piano Music Transcription

Personalized TV Recommendation with Mixture Probabilistic Matrix Factorization

Automatic Rhythmic Notation from Single Voice Audio Sources

2018 READER SURVEY REPORT READERS ON READING

J-Pop Vs. K-Pop. The world s most famous and popular language is music. Pre-Reading. A. Warm-Up Questions. B. Vocabulary Preview.

Talking Social TV 2. Ed Keller. Beth Rockwood. SVP, Discovery Communications & Chair, CRE Social Media Committee. CEO Keller Fay Group

State of the art of Music Recommender Systems and

Detect Missing Attributes for Entities in Knowledge Bases via Hierarchical Clustering

Topics in Computer Music Instrument Identification. Ioanna Karydi

3. Population and Demography

Large Scale Concepts and Classifiers for Describing Visual Sentiment in Social Multimedia

Music Genre Classification and Variance Comparison on Number of Genres

Music Genre Classification

1) New Paths to New Machine Learning Science. 2) How an Unruly Mob Almost Stole. Jeff Howbert University of Washington

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

Computational Modelling of Harmony

How Millennials Get News: Inside the Habits of America s First Digital Generation


Is that the Right Red?

By: Claudia Romo, Heidy Martinez, Ara Velazquez

Kayhan Kalhor with Ali Bahrami Fard. I Will Not Stand Alone

Lyrics Take Centre Stage In Streaming Music

A Generic Semantic-based Framework for Cross-domain Recommendation

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Dance: the Power of Music

Sarcasm Detection in Text: Design Document

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval

Analysis of local and global timing and pitch change in ordinary

MUSIC CONSUMER INSIGHT REPORT

AUSTRALIAN MULTI-SCREEN REPORT QUARTER

Creating Mindmaps of Documents

Digital Ad. Maximizing TV Stations' Revenues. The Digital Opportunity. A Special Report from Media Group Online, Inc.

VIRTUAL NETWORKING AND CITATION ANALYSIS

Which Me Should I Be?

Music Radar: A Web-based Query by Humming System

AUSTRALIAN MULTI-SCREEN REPORT QUARTER

Automatic Labelling of tabla signals

DeepID: Deep Learning for Face Recognition. Department of Electronic Engineering,

Lyric-Based Music Mood Recognition

Sarcasm in Social Media. sites. This research topic posed an interesting question. Sarcasm, being heavily conveyed

Audio: Generation & Extraction. Charu Jaiswal

Lecture 5: Clustering and Segmentation Part 1

arxiv: v1 [cs.ir] 16 Jan 2019

Supervised Learning in Genre Classification

Don t Judge a Book by its Cover: A Discrete Choice Model of Cultural Experience Good Consumption

A Fast Alignment Scheme for Automatic OCR Evaluation of Books

Ensemble LUT classification for degraded document enhancement

Homework 2 Key-finding algorithm

Finding Sarcasm in Reddit Postings: A Deep Learning Approach

KLM: TARGETX. User-Interface for Testing TARGETX Brief Testing Overview Bronson Edralin 04/06/15

We aim to cover the following topics:

Hidden Markov Model based dance recognition

music, singing and wellbeing

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

BBC Trust Review of the BBC s Speech Radio Services

GENDER IDENTIFICATION AND AGE ESTIMATION OF USERS BASED ON MUSIC METADATA

Meeting: and Reading. strongly. average of. libraries. skills. popular

SPONSORSHIP OPPORTUNITIES

Just How Predictable Are the Oscars?

AUSTRALIAN MULTI-SCREEN REPORT QUARTER

Creating a Feature Vector to Identify Similarity between MIDI Files

AUSTRALIAN MULTI-SCREEN REPORT QUARTER

Brand Love Study Overview & Methods. 2016: The Big Picture

Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

The 10 Greatest Pop Stars (10 (Franklin Watts)) By R. B. Hallett READ ONLINE

BitWise (V2.1 and later) includes features for determining AP240 settings and measuring the Single Ion Area.

STAYING INFORMED ACROSS THE GARDEN STATE WHERE DO YOU GO AND WHAT DO YOU KNOW?

Personalized TV Watching Behaviour Recommendations for Effective User Fingerprinting

Composer Style Attribution

marilyn manson DB6352E6613B2621DF4E9229A4A4727A Marilyn Manson 1 / 6

Algebra I Module 2 Lessons 1 19

Bi-Modal Music Emotion Recognition: Novel Lyrical Features and Dataset

Automatic Composition from Non-musical Inspiration Sources

Universität Bamberg Angewandte Informatik. Seminar KI: gestern, heute, morgen. We are Humor Beings. Understanding and Predicting visual Humor

6.UAP Project. FunPlayer: A Real-Time Speed-Adjusting Music Accompaniment System. Daryl Neubieser. May 12, 2016

Transcription:

NDSS Symposium 2012 YOU ARE WHAT YOU LIKE INFORMATION LEAKAGE THROUGH USERS INTERESTS Abdelberi (Beri) Chaabane, Gergely Acs, Mohamed Ali Kaafar

Internet = Online Social Networks? Most visited websites: Facebook (2sd), YouTube (3 rd ), Twitter (10 th ) Facebook 1 : > 800M users > 350M users access through their mobile > 250M photos are uploaded per day > 20M application installation per day And privacy?? 1: https://www.facebook.com/press/info.php?statistics

Identifying the threat 3 Users private/pub data! hmmm Mark Z. is a bad Guy!! Privacy Policies! User Public Profile Inference Technique! ~ Private Profiles!

* Goal Inferring Missing/Hidden information from a public user profile Using Friendship or links information [2,3] Only using user s revealed data *: http://13thfloorgrowingold.wordpress.com/

What people reveals? 5 Friendship! Gender! Likes Missing values 25% 21% 43% 75% 79% 57% Current City! Looking for! Hometown Relationship! Interested In! 23% 22% 22% 17% 16% 77% 78% 78% 83% 84% Birthday 6% Religion! 1% 94% 99%

Homophily or not homophily 6 Age = 23 Age = Hidden Age = 25 Mme Michou Age? Age = Hidden Age = 20

Quiz Who is this guy? Who likes his music?

Music? Why would that work? 8 In real life, an individual interest (or lifestyle) might reveal many aspects of his personal information demographics or geopolitical aspects. Availability Seemingly harmless ;-) by default settings?

Not that easy 9 Heterogeneity Too general I like Jazz Music Too specific Angus Young Difficult to semantically link interests What is the link between Angus Young, Brian Johnson and High Voltage?

likes 10 One of the MOST available data Describe users tastes Can be used to derive user information Gender, Location, Age, Marital status, Religion, etc. Very sparse (millions of likes) User-generated (No defined pattern) No standard granularity

A toy example 11 Mohammad-Reza Shajarian, Nazeri, Gogosh What does it mean (lack of semantics) What can we infer?

Semantics: a naïve example 12 Shajarian: 1940 births; Living people; Iranian classical; vocalists Iranian; humanitarians Iranian; male singers; Iranian musicians Nazei: Grammy Award winners; Iranian Kurdish people; Living people; Iranian classical vocalists; Iranian humanitarians; Iranian Légion d'honneur recipients; Iranian male singers Gogosh: people of Azerbaijani; descent Iranian female; Persian-language singers; Iranian pop singers; Iranian Shi'a; Muslims People from Tehran Btw it belongs to http://facebook.com/kave.salamatian

Semantics: a naïve example II Shajarian: 1940 births; Living people; Iranian classical; vocalists Iranian; humanitarians Iranian; male singers; Iranian musicians Nazei: Grammy Award winners; Iranian Kurdish people; Living people; Iranian classical vocalists; Iranian humanitarians; Iranian Légion d'honneur recipients; Iranian male singers Gogosh: people of Azerbaijani; descent Iranian female; Persian-language singers; Iranian pop singers; Iranian Shi'a; Muslims People from Tehran Iranian classical Vocalist Iranian humanitarians Iranian Iranian Kurdish people people of Azerbaijani Persian-language Topic about Iran Iranian Shi'a Muslims People Topic about Islam (Religion) vocalists Iranian Iranian classical vocalists Topic about classical music

The Algorithm Step1: Extract Semantics

15 Step1

Tree of wikipedia Fundamental Concepts Life Matter Society Concepts children Concepts children children Communication Mass Media Social networks Social Network services Facebook

Extract semantic (Description) 17 Ontologized version of wikipedia Using the structured knowledge of Wikipedia Extract keywords from a certain granularity Each like is an article Extract Parent Categories of the like article Using the same granularity

Extract semantic (Description) Using the same granularity allows us to semantically link similar concepts AC/DC: Australian heavy metal musical groups; Australian hard rock musical groups; Blues rock groups; Musical groups established in 1973; Angus Young: AC/DC members; Australian blues guitarists; Australian rock guitarists; Australian heavy metal guitarists High Voltage: AC/DC songs ; Songs written by Angus Young; 1970s rock song stubs

The Algorithm Step1: Extract Semantics

20 Step2

LDA Intuition K topics All available Interests Interest1: w1, w2, w3..! LDA (k Topics)! Topic1:! Prob (I1 T1) Prob(I2 T1)..! Classify I1: Interest1 T1: Topic 1!

LDA as a Probabilistic model 1. Treat data as observations that arise from a generative probabilistic process that includes hidden variables For documents, the hidden variables reflect the thematic structure of the collection. 2. Infer the hidden structure using posterior inference What are the topics that describe this collection? 3. Situate new data into the estimated model. How does this new document fit into the estimated topic structure? D.Blei (MLSS 09)

LDA 23 Words collected into documents Each document is a mixture of a small number of topics Each word's creation is attributable to one of the document's topics Topics are not nominative Input: Documents (words Frequency) Number of Topics (K) Output Word distribution per topic Probability for each documents to belong to each topic

Topic example

The Algorithm

26 Step3

Inferring Hidden Attribute 27 IFV uniquely quantifies the interest of each user along topics Classify users based on IFV Simple approach Using the nearest neighbors (K-NN) Similar users grouped together. User sharing the same taste should share the same attributes

Nearest Friend Neighbor 28 We define an appropriate distance measure in this space: chi-squared distance metric Using Kd-tree to reduce the computation from to

Example IFV user1 0.1 0.2 0.6 0.1 Attribute=? user2 0.01 0.3 0.6 0.7 Attribute=? user3 0.1 0.2 0.4 0.1 Attribute=Val Attribute to infer User n 0.1 0.1 0.1 0.1 Attribute=Val The n nearest users to user1 are: S={user3, userm, } The attribute is equal the the majority of the attribute in S (Majority voting)

Datasets 30 Public Profiles Crawled more than 400k profiles (Raw-Profiles) More than 100k Latin-written profiles with music interests (Pub-Profiles) Private Profiles Using a Facebook App. More than 4000 Private profiles (used 2.5 K, Volunteer- Profiles)

Attribute inference We infer the following attributes: Binary Gender {Male, Female} Relationship {Single, Married} Multi-value Country {US,PH,IN,ID,GB,GR,FR,MX,IT,BR } (top10) Age group {13-17, 18-24, 25-34, 35-44, 44-54, >54}

Base-Line Inference 32 Rely on marginal distributions Maximum Likelihood of attributes P(u.x = val U) = {v u.v = val^v U} U Guess the attributes x value from its most likely value for all users

Inference Accuracy of PubProfiles 33 More than 20% of gain in most cases

Deeper view: Gender 34 It is clear from the results that music Interests predict Female with a high probability May be explained by the number of female profiles in our dataset (62%)

Deeper view: Relationship 35 It is challenging since less than 17% of crawled users disclose this attributes Single users are more distinguishable o Single users share on average 9 music Interests whereas married share only 5.7

Deeper view: Country 36 80% of users belong to top 10 countries Country with specific (regional) music have better accuracy we clearly see the role of the semantic

Accuracy for VolunteerProfile 37 The results are slightly the same as for PubProfile Our method is independent from the source of information

Discussion 38 No need for frequent model updates The approach is rather General OSN Independent: Many other sources of Information (deezer, lastfm, blogs, forums) etc. Use a free, open and updated encyclopedia

Discussion Augment the model by analyzing more interest category Movies Books Sport Multilanguage Wikipedia to handle foreign language More aggressive stemming

Conclusion 40 Wikipedia Ontology to extract Semantics LDA to extract Topics Socio, demographics, geo political aspects virtual Communities K-NN to infer attributes The approach is general Using seemingly harmless information Efficient, inconspicuous profiling

Crawling Facebook! 43 Crawling Facebook was challenging Protection using JavaScript rendering: Using a homemade lightweight browser Protection using a threshold for a maximum number of request Using multiple machines Avoiding Biased Sampling Crawling Facebook public directory (100 millions users) Randomly choose a user and crawl his/her profile Parsing HTML pages It is just a mess

Availability of attributes 44 Attributes! Raw (%)! Pub(%)! Volunteer (%)! Gender! 79! 84! 96! Interests! 57! 100! 62! Current City! 23! 29! 48! Looking For! 22! 34! -! Home Town! 22! 31! 48! Relationship! 17! 24! 43! Interested In! 16! 26! -! Birth Date! 6! 11! 72! Religion! 1! 2! 0!