Why visualize data? Advanced GDA and Software: Multivariate approaches, Interactive Graphics, Mondrian, iplots and R. German Bundestagswahl 2005

Similar documents
MATH& 146 Lesson 11. Section 1.6 Categorical Data

Chapter 5. Describing Distributions Numerically. Finding the Center: The Median. Spread: Home on the Range. Finding the Center: The Median (cont.

Escaping RGBland: Selecting Colors for Statistical Graphics

Chapter 1 Midterm Review

Visual Encoding Design

STAT 113: Statistics and Society Ellen Gundlach, Purdue University. (Chapters refer to Moore and Notz, Statistics: Concepts and Controversies, 8e)

Bootstrap Methods in Regression Questions Have you had a chance to try any of this? Any of the review questions?

UNIVERSITY OF MASSACHUSETTS Department of Biostatistics and Epidemiology BioEpi 540W - Introduction to Biostatistics Fall 2002

Notes Unit 8: Dot Plots and Histograms

What is Statistics? 13.1 What is Statistics? Statistics

E X P E R I M E N T 1

Frequencies. Chapter 2. Descriptive statistics and charts

Algebra I Module 2 Lessons 1 19

Analysis of data from the pilot exercise to develop bibliometric indicators for the REF

Tradeoffs in information graphics 1. Andrew Gelman 2 and Antony Unwin Oct 2012

UNIVERSITY OF CAMBRIDGE INTERNATIONAL EXAMINATIONS General Certificate of Education Ordinary Level

Math 81 Graphing. Cartesian Coordinate System Plotting Ordered Pairs (x, y) (x is horizontal, y is vertical) center is (0,0) Quadrants:

Histograms and Frequency Polygons are statistical graphs used to illustrate frequency distributions.

QCTool. PetRos EiKon Incorporated

Chapter 4. Displaying Quantitative Data. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Objective: Write on the goal/objective sheet and give a before class rating. Determine the types of graphs appropriate for specific data.

Statistics for Engineers

6 ~ata-ink Maximization and Graphical Design

AP Statistics Sampling. Sampling Exercise (adapted from a document from the NCSSM Leadership Institute, July 2000).

MATH 214 (NOTES) Math 214 Al Nosedal. Department of Mathematics Indiana University of Pennsylvania. MATH 214 (NOTES) p. 1/3

Homework Packet Week #5 All problems with answers or work are examples.

The Relationship Between Movie Theatre Attendance and Streaming Behavior. Survey insights. April 24, 2018

BARB Establishment Survey Annual Data Report: Volume 1 Total Network and Appendices

download instant at

BARB Establishment Survey Quarterly Data Report: Total Network

Distribution of Data and the Empirical Rule

Klee or Kid? The subjective experience of drawings from children and Paul Klee Pronk, T.

abc Mark Scheme Statistics 3311 General Certificate of Secondary Education Higher Tier 2007 examination - June series

Supplementary Figures Supplementary Figure 1 Comparison of among-replicate variance in invasion dynamics

Relationships Between Quantitative Variables

3. Population and Demography

More About Regression

Graphical User Interface for Modifying Structables and their Mosaic Plots

SIDRA INTERSECTION 8.0 UPDATE HISTORY

Mosaic Displays in S-PLUS: A General Implementation and a Case Study.

Relationships. Between Quantitative Variables. Chapter 5. Copyright 2006 Brooks/Cole, a division of Thomson Learning, Inc.

Branding Guidelines NOTICE:

STAT 250: Introduction to Biostatistics LAB 6

BBC Television Services Review

Dektak Step by Step Instructions:

Quantitative methods

Chapter 3. Averages and Variation

Subject: Florida Statewide Republican Primary Election survey conducted for FloridaPolitics.com

2012, the Author. This is the final version of a paper published in Participations: Journal of Audience and Reception Studios.

For the SIA. Applications of Propagation Delay & Skew tool. Introduction. Theory of Operation. Propagation Delay & Skew Tool

The Relationship Between Movie theater Attendance and Streaming Behavior. Survey Findings. December 2018

Multiple-point simulation of multiple categories Part 1. Testing against multiple truncation of a Gaussian field

THE UNIVERSITY OF QUEENSLAND

Full file at

TEST 4 MATHEMATICS. Name:. Date of birth:. Primary School:. Today s date:.

N12/5/MATSD/SP2/ENG/TZ0/XX. mathematical STUDIES. Wednesday 7 November 2012 (morning) 1 hour 30 minutes. instructions to candidates

Resampling Statistics. Conventional Statistics. Resampling Statistics

1 Introduction to the life course perspective. 2 Working with life course data. 3 Familial life course analysis. 4 Visualization.

InCites Indicators Handbook

Somewhere over the Rainbow How to Make Effective Use of Colors in Statistical Graphics

2012 Inspector Survey Analysis Report. November 6, 2012 Presidential General Election

PulseCounter Neutron & Gamma Spectrometry Software Manual

Subject: Florida Statewide Republican Governor Primary Election survey conducted for FloridaPolitics.com

Estimation of inter-rater reliability

Subject: Florida U.S. Congressional District 13 Primary Election survey

Basic Elements > Logos and Markings

Introduction to IBM SPSS Statistics (v24)

Cancer in females. Visual Display of (Public Health) Data - Theory and Practice. Michael C. Samuel, Dr. P.H. Senior Epidemiologist / Data Scientist

DV: Liking Cartoon Comedy

EXPLORING DISTRIBUTIONS

Version : 1.0: klm. General Certificate of Secondary Education November Higher Unit 1. Final. Mark Scheme

LeCroy Digital Oscilloscopes

MIS 0855 Data Science (Section 005) Fall 2016 In-Class Exercise (Week 6) Advanced Data Visualization with Tableau

Book of visual identification

BBC Trust Review of the BBC s Speech Radio Services

Statistics: A Gentle Introduction (3 rd ed.): Test Bank. 1. Perhaps the oldest presentation in history of descriptive statistics was

Chapter 7 Probability

IMDB Movie Review Analysis

THE OPERATION OF A CATHODE RAY TUBE

Television and the Internet: Are they real competitors? EMRO Conference 2006 Tallinn (Estonia), May Carlos Lamas, AIMC

AP Statistics Sec 5.1: An Exercise in Sampling: The Corn Field

Just How Predictable Are the Oscars?

How Large a Sample? CHAPTER 24. Issues in determining sample size

Program Overview Area 1- ACC - Art, Digital Media, and Communications Pathway in ACC District

Supplemental results from a Garden To Café scannable taste test survey for snack fruit administered in classrooms at PSABX on 12/14/2017

Processes for the Intersection

1996 Yampi Shelf, Browse Basin Airborne Laser Fluorosensor Survey Interpretation Report [WGC Browse Survey Number ]

Western Statistics Teachers Conference 2000

NAA ENHANCING THE QUALITY OF MARKING PROJECT: THE EFFECT OF SAMPLE SIZE ON INCREASED PRECISION IN DETECTING ERRANT MARKING

Sociology 7704: Regression Models for Categorical Data Instructor: Natasha Sarkisian

BIBLIOMETRIC REPORT. Bibliometric analysis of Mälardalen University. Final Report - updated. April 28 th, 2014

Chapter 6. Normal Distributions

AUDIENCES Image: The Huntsman: Winter s War 2016 Universal Pictures. Courtesy of Universal Studios Licensing LLC

The Measurement Tools and What They Do

Monday 15 May 2017 Afternoon Time allowed: 1 hour 30 minutes

How to present your paper in correct APA style

Introduction. Edge Enhancement (SEE( Advantages of Scalable SEE) Lijun Yin. Scalable Enhancement and Optimization. Case Study:

Don t Judge a Book by its Cover: A Discrete Choice Model of Cultural Experience Good Consumption

THE OPERATION OF A CATHODE RAY TUBE

Measuring Variability for Skewed Distributions

Transcription:

Advanced GDA and Software: Multivariate approaches, Interactive Graphics, Mondrian, iplots and R Why visualize data? Looking for global trends overall structure Looking for local features data quality University of Augsburg unwin@math.uniaugsburg.de PolBeRG/ELECDEM Workshop Budapest 28th April, 2012 groups or clusters outliers, tail distributions and extremes patterns of all kinds Possible examples German Bundestagswahl 2005 German Election 2005 (results + demographics) German Reichstag Election 1930 Irish Presidential Election 1990 (last opinion poll) Irish Referenda in the 1980s Bowling Alone US Lifestyle survey over 20 years Movies (120,404 films rated on imdb.com) Oscar nominees 19282000 (age, gender,...) Votes and % party support for 299 constituencies CDU/CSU, SPD, FDP, Grüne, Linke, Rest Accompanying polygon map of the constituencies Population demographics gender, age, housing, education, employment www.bundeswahlleiter.de/de/bundestagswahlen/ BTW_BUND_05/strukturdaten/ Shipman s victims (UK doctor murdered patients)

Germany 2005 questions Parallel coordinate plots Where are the different parties strongest? What associations are there between the parties? Each variable has its own vertical axis. Each case is represented by line segments joining its points on the axes. B1518 B2535 B60mehr Are age distributions and unemployment figures associated with party strength? Which constituencies stand out locally as being different from their neighbours? Bunter15 B1825 B3560 PCP options Choice of variables Scaling and order of the axes (affect the display a lot) Rescale axes: inversion, common scaling Display as boxplots Reorder variables by hand sorting by statistics (max, median, IQrange...) Interactive tools are important Irish Presidential Election 1990 Three candidates: Lenihan (FF Foreign Minister, 1982 Phone Scandal) Currie (FG Opposition, from Northern Ireland) Robinson (Labour, first female candidate) MRBI Opinion Poll just before the election 1000 people asked demographics preferences and influences

Co Dub Conn/Ulster LeinsterRest Muns Co MunsRest Ireland 1990 questions Female Was the survey balanced by sex, age and social class? Which groups were strongly for Mary Robinson? How influential was party affiliation? Male Multiple barchart for Area by Sex by Age How crucial for Mary Robinson were the second preferences she got from Austin Currie supporters? Four factors were rumoured to be important in determining people s votes. Were they? Doubledecker plot: Rural/Urban by Sex by Age for Robinson Mosaicplots: structure Variable category combinations are represented by rectangles Rectangle area is (usually) proportional to frequency Rectangles may have equal width (height), so that height (width) is proportional to frequency Rectangles may be aligned in various ways Variables may be rotated Rectangles may be coloured Mosaicplots: variants Classical (Observed) Expected (based on a model) Fluctuation diagram Same binsize Multiple barcharts Doubledecker rmb Weighted mosaics

Mosaicplots: options What is Interactive Graphics? Choice of variables (in effect aggregation) Type of mosaicplot Order of variables Whether variables are plotted horizontally or vertically Order of categories within variables Aggregation of categories within variables Formatting: plot size, aspect ratio, spacing between levels Interactive tools are important Querying Selection, highlighting and linking Reformating (rescaling, sorting, colouring, resizing) Zooming Multiple views But: check the probabilities and implicit comparisons Case study: Movies dataset Movies data downloaded from the web (imdb.com) Just over 120,000 films Information on Year and Length Type (23 binary variables) Average rating and number of votes Movie questions What is the distribution of ratings? Do modern films get more votes and higher ratings? What sort of ratings do action films get? Are short films less often rated than nonshorts? What kinds of film get high ratings based on few votes? What combinations of film types are there? Which film titles occur most often and are these films all from different years?

Case study: Oscars Oscars: Questions Redelmeier and Singh (2001) studied the mortality of actors and actresses who had been nominated for the Oscars compared to controls. There are 1670 cases and 15 variables: 7 demographic variables including Gender, Year of birth, Year of death 8 film career variables including # films, # four star films, First nomination year Are there any data quality issues? How many males and females should there be in the dataset? What kinds of stars won when they were young? What relationships are there between the numbers of nominations and wins and the numbers of films and fourstar films? How have the winners changed over time, if at all? IG Advantages IG Disadvantages Direct querying Multivariate information via linking Fast, flexible analyses (including sensitivity analyses) Running through alternatives quickly Experimental reformatting, versatile reordering Generate ideas/hypotheses Check implications for other variables and don t let computing get in the way of thinking Not mathematically defined Difficult to record the process Cannot replicate analyses Difficult to save results of analyses Can often not test results statistically Not presentation graphics quality Data dredging: you always find something

Mondrian Mondrian Mondrian for interactive graphical analysis one of the Augsburg Impressionists stats.math.uniaugsburg.de/mondrian/ for Windows, Unix, MacOS by Martin Theus Information http://rosuda.org/mondrian/ Further help the reference card Plots missing value, histogram, boxplot, barchart, scatterplot, splom, mosaicplots, parallel coordinate plot, map Rserve is necessary to use R from Mondrian density estimation, CDPlot, smoothers PolBeRG/ELECDEM Budapest, 28th April, 2012 PolBeRG/ELECDEM Mondrian features Budapest, 28th April, 2012 R and Graphics Querying Base graphics v. grid Selection and highlighting (incl. selection sequences) Zooming Packages include (cf. also the Graphics Task View) Rescaling vcd :for displaying categorical data Resizing points Sorting ggplot2 :implementation of Grammar of Graphics including qplot Colouring alphablending lattice :for drawing trellis plots Printing iplots :for interactive graphics PolBeRG/ELECDEM Budapest, 28th April, 2012 PolBeRG/ELECDEM Budapest, 28th April, 2012

iplots and R Uses the JGR interface to R http://stats.math.uniaugsburg.de/jgr/ iplots is an interactive graphics R package http://stats.math.uniaugsburg.de/iplots/ ibar, ihist, iplot, imosaic, ipcp Query with the ctrl key Options available from the View menu Developed primarily by Simon Urbanek Summary Parallel coordinate plots are for multivariate continuous data Mosaicplots are for multivariate categorical data both have to be interactive Interactive graphics are for exploring data becoming used for web presentations in a limited way Graphics require interactive thought! Case study: Shipman dataset From the Appendix In 2000 the British doctor, Harold Shipman, was convicted of murdering 15 of his patients. The official report (www.theshipmaninquiry.org.uk/), which examined the deaths of all patients under his care over twenty years, concluded that he had probably murdered over 200. Details of the deaths of 508 of his patients where there was doubt about the cause of death have been taken from Appendix F of the report. Variable Description ID patient number Day day of death Month month of death Year year of death Weekday day of week of death Date days since 1/1/1904 Name full name of patient Surname surname of patient Sex gender of patient Age age at death Location place of death Decision o cial view on Dr. s guilt PolBeRG/ELECDEM Budapest, 27th April, 2012 PolBeRG/ELECDEM Budapest, 27th April, 2012

Shipman questions? Which variables might be most useful? Draw plots to look for patterns. Were there periods when there were there no deaths/murders? Was there a pattern in the deaths by day of the week? Is there any pattern in the age and gender of the victims? Was the place of death relevant? PolBeRG/ELECDEM Budapest, 27th April, 2012