An analyst usually does not concentrate on each individual data values but would like to have a whole picture of how the variables distributed. In this chapter, we will introduce some tools to tabulate the data, summarize the data. Moreover, graphs and charts present statistical data in visual form. 2.1. Tabulation of data Example 1: Frequency table Frequency table shows the frequency of all the possible responses. The distribution of variables of the data file survey.sav can be obtained by tabulating the data into a frequency table. Analyze Descriptive Statistics Frequencies Variables: age Frequencies Statistics Age N Valid Missing 40 0 variable Age Frequency table for the age Valid 12-19 20-29 30-39 40-49 Total Valid Cumulative Frequency Percent Percent Percent 3 7.5 7.5 7.5 18 45.0 45.0 52.5 16 40.0 40.0 92.5 3 7.5 7.5 100.0 40 100.0 100.0 Marjorie Chiu, 2009 2-1
Counting responses for combinations of variables Department of Applied Mathematics Cross tabulation is to form a table that contains counts of the number of times various combinations of values of two categorical variables occur. Analyze Descriptive Statistics Crosstabs Row: sex Column: age Percentage within row, within column and percentage of total also can be obtained by changing he options of "Cells". Crosstabs Case Processing Summary sex * Age Cases Valid Missing Total N Percent N Percent N Percent 40 100.0% 0.0% 40 100.0% Row variable Column variable sex * Age Crosstabulation sex Total Male Female Count % within sex % within Age % of Total Count % within sex % within Age % of Total Count % within sex % within Age % of Total Age 12-19 20-29 30-39 40-49 Total 12 3 15 80.0% 20.0% 100.0% 66.7% 18.8% 37.5% 30.0% 7.5% 37.5% 3 6 13 3 25 12.0% 24.0% 52.0% 12.0% 100.0% 100.0% 33.3% 81.3% 100.0% 62.5% 7.5% 15.0% 32.5% 7.5% 62.5% 3 18 16 3 40 7.5% 45.0% 40.0% 7.5% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 7.5% 45.0% 40.0% 7.5% 100.0% Row percentage sum to 100% across each row Column percentage sum to 100% across each column Marjorie Chiu, 2009 2-2
2.2. Summarize the data by means of central tendency and dispersion Department of Applied Mathematics When we work with numerical data, it seems apparent that in most set of data there is a tendency for the observed values to group themselves about some interior values; some central values seem to be the characteristics of the data. This phenomenon is referred to as central tendency. Arithmetic mean, median and the mode are the three commonly used measures of the central tendency. In addition, it is necessary to have some measures on how data are scattered. That is, we want to know what is the dispersion, or variability in a set of data. Range, deciles, percentile, fractile, quartiles, mean absolute deviation, variance, standard deviation and coefficient of variation are used to describe the dispersion of the data. Formulas: Mean x = Variance f x i f i i ( x x) f 2 i i s =, f 1 where f i is the frequency of the i th item, i 2 x i is the value of the i th item or class mark, x is the sample mean Standard deviation s= 2 s Categorical data (survey.sav) Analyze Descriptive Statistics Frequencies Variable: age Statistics: median, mode Remark: The frequency table is displayed also. Marjorie Chiu, 2009 2-3
Frequencies Statistics Age N Median Mode Valid Missing 40 0 2.00 2 Quantitative data Example 2: (rings) Lot-for-lot ordering is the simplest deterministic inventory model. In this model, items are purchased from a supplier (say, a wholesaler) in the exact amounts required for each time period. It is well suited for inventory items of high value or with a discontinuous demand. One hundred consecutive weekly purchases of diamond rings are made by a retail jeweler from a wholesaler to replenish the inventory sold to customers during the preceding week. The number of rings is shown below. 44 35 34 25 41 66 50 38 45 41 40 43 49 31 44 52 55 45 51 63 33 68 27 30 58 62 45 52 12 72 49 38 66 64 60 41 30 65 46 35 70 54 43 64 24 25 52 42 53 22 23 35 51 43 11 58 75 50 67 51 32 57 24 43 35 37 42 58 42 59 25 37 40 28 60 31 64 72 48 16 26 57 33 18 46 69 74 39 26 55 78 40 50 46 47 36 29 47 63 55 The data was saved in SPSS format with file name rings.sav. Open this file and summarize the number of rings sold. Analyze Descriptive Statistics Frequencies Variable: rings Statistics of the selected variable such as mean, median, model, standard deviation, variance, range, quartiles, etc can be evaluated and are used to describe the characteristic of data. Remark: Option of Charts can be changed to display a histogram. Marjorie Chiu, 2009 2-4
Frequencies RINGS N Mean Std. Error of Mean Median Mode Std. Deviation Variance Skewness Std. Error of Skewness Kurtosis Std. Error of Kurtosis Range Minimum Maximum Sum Percentiles Statistics Valid Missing 10 20 25 30 40 50 60 70 75 80 90 a. Multiple modes exist. The smallest value is shown 100 0 45.42 1.52 45.00 35 a 15.20 231.18.000.241 -.579.478 67 11 78 4542 25.00 31.20 35.00 37.00 41.40 45.00 49.60 53.70 57.00 59.80 66.00 Variable Summary of descriptive statistics Marjorie Chiu, 2009 2-5
Valid 11 12 16 18 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 57 58 59 60 62 63 64 65 66 67 68 69 70 72 74 75 78 Total RINGS Valid Cumulative Frequency Percent Percent Percent 1 1.0 1.0 1.0 1 1.0 1.0 2.0 1 1.0 1.0 3.0 1 1.0 1.0 4.0 1 1.0 1.0 5.0 1 1.0 1.0 6.0 2 2.0 2.0 8.0 3 3.0 3.0 11.0 2 2.0 2.0 13.0 1 1.0 1.0 14.0 1 1.0 1.0 15.0 1 1.0 1.0 16.0 2 2.0 2.0 18.0 2 2.0 2.0 20.0 1 1.0 1.0 21.0 2 2.0 2.0 23.0 1 1.0 1.0 24.0 4 4.0 4.0 28.0 1 1.0 1.0 29.0 2 2.0 2.0 31.0 2 2.0 2.0 33.0 1 1.0 1.0 34.0 3 3.0 3.0 37.0 3 3.0 3.0 40.0 3 3.0 3.0 43.0 4 4.0 4.0 47.0 2 2.0 2.0 49.0 3 3.0 3.0 52.0 3 3.0 3.0 55.0 2 2.0 2.0 57.0 1 1.0 1.0 58.0 2 2.0 2.0 60.0 3 3.0 3.0 63.0 3 3.0 3.0 66.0 3 3.0 3.0 69.0 1 1.0 1.0 70.0 1 1.0 1.0 71.0 3 3.0 3.0 74.0 2 2.0 2.0 76.0 3 3.0 3.0 79.0 1 1.0 1.0 80.0 2 2.0 2.0 82.0 1 1.0 1.0 83.0 2 2.0 2.0 85.0 3 3.0 3.0 88.0 1 1.0 1.0 89.0 2 2.0 2.0 91.0 1 1.0 1.0 92.0 1 1.0 1.0 93.0 1 1.0 1.0 94.0 1 1.0 1.0 95.0 2 2.0 2.0 97.0 1 1.0 1.0 98.0 1 1.0 1.0 99.0 1 1.0 1.0 100.0 100 100.0 100.0 Marjorie Chiu, 2009 2-6
2.3. Histogram (rings.sav) Department of Applied Mathematics A histogram composes a number of bars and is used to show the distribution of a variable, the skewness of the distribution can be observed. Each bar presents the frequency of a range of values that is directly proportional to the area of the bar. Graphs Legacy Dialogs Histogram Variable: rings Graph The distribution of a variable rings is quite symmetric. 2.4. Stem and leaf display (rings.sav) Marjorie Chiu, 2009 2-7
Stem and leaf display shows the distribution of a variable like a histogram. Moreover, it depicts the actual value of the data points simultaneously. Analyze Descriptive Statistics Explore Dependent List: rings Remark: Descriptive statistics and histogram also can be obtained. RINGS Stem-and-Leaf Plot Frequency Stem & Leaf 2.00 1. 12 2.00 1. 68 4.00 2. 2344 8.00 2. 55566789 8.00 3. 00112334 10.00 3. 5555677889 15.00 4. 000111222333344 11.00 4. 55566677899 11.00 5. 00011122234 9.00 5. 555778889 8.00 6. 00233444 6.00 6. 566789 4.00 7. 0224 2.00 7. 58 Stem width: 10 Each leaf: 1 case(s) The value of the stem and each individual digits in the leaf compose a data value according to the stem width. For example, the first row of the stem-and-leaf display consists of two data with values 11 and 12. 2.5. Boxplot (rings.sav) Boxplot helps us to visualize the distribution of a variable. It simultaneously displays the median, the interquartile range, and the smallest and largest values. Marjorie Chiu, 2009 2-8
100 80 Whiskers extend to largest and smallest observed values within 1.5 box lengths 60 40 Box extends from 25 th to 75 th percentile. 75 th percentile median 25 th percentile 20 0 N = 100 RINGS Variable 2.6. Bar chart (survey.sav) The number of cases in the category can be shown by the bar chart, in which the length (height) of the bar is directly proportional to the frequency. Analyze Descriptive Statistics Frequencies Variable: age Specify the charts as bar chart. Marjorie Chiu, 2009 2-9
20 Age 10 Frequency 0 12-19 20-29 30-39 40-49 Age Class category Variable / category axis Simple bar chart shows the frequency of different age groups OR Graphs OR Graphs Legacy Dialogs Bar Legacy Dialogs Interactive Summaries for groups of cases Define simple bar chart Bar Category axis: age Bars represent: of cases Drag the variable "age" to the x-axis. The options can be adjusted if necessary, for example, including empty categories. Marjorie Chiu, 2009 2-10
Interactive Graph (Simple bar chart) Department of Applied Mathematics Note: Interactive bar chart also shows the classes where there is no occurrence. Multiple bar chart is particularly useful if one desires to make quick comparison between different sets of data. Graphs Legacy Dialogs Bar Summaries for groups of cases Define clustered bar chart Bars represent: of cases Category axis: age Define clusters by: sex Marjorie Chiu, 2009 2-11
Graph (Multiple bar chart) Compose of 2 clusters Legend Category axis Component bar chart shows how different components making up the total using distinctive shadings or colours. Graphs Legacy Dialogs Bar Summaries for groups of cases Define stacked bar chart Category axis: age Define stacks by: sex Marjorie Chiu, 2009 2-12
Graph (Component bar chart) Compose of 2 clusters Category axis 2.7. Pie chart Pie charts are widely used to show the component parts of a total. They are popular because of their simplicity. In constructing a pie chart, the angles of a slice from the center must be in proportion with the percentage of the total. Analyze Descriptive Statistics Frequencies Variable: age Charts: pie chart Marjorie Chiu, 2009 2-13
OR Graphs OR Graphs Legacy Dialogs Pie Legacy Dialogs Interactive Data in chart are: Summaries for groups of cases Pie Slices represent: of cases Define slices by: age Simple Slice by: age (in style) Options: including empty categories Pies: Slice Labels (count, percent) Marjorie Chiu, 2009 2-14
Interactive Graph 2.8. Scatter plot A two-dimensional scatter plot shows a general picture of how the two quantitative variables relate to each other. Example 3: (car) An equation is to be developed from which we can predict the gasoline mileage of an automobile based on its weight and the temperature at the time of operation. The ASCII data are available in file car.dat. The three columns represent miles per gallon (miles; column 1-4; 1 d.p.), weight in tons (weight; column 6-9; 2 d.p.) and temperature (temperature; column 11-12) in Fahrenheit. Read in the ASCII data first and then save the file in SPSS format. Give a scatter plot of miles against temperature and then miles against weight. Graphs Legacy Dialogs Scatter/Dot Simple Scatter Y axis: miles X axis: temperature Then use the miles in y-axis and weight in the x-axis to produce another scatter plot. Marjorie Chiu, 2009 2-15
Graph 19.0 18.5 18.0 17.5 17.0 16.5 Miles per gallon 16.0 15.5 15.0 20 30 40 50 60 70 80 90 100 Temperature in Fahrenheit 19.0 18.5 18.0 17.5 17.0 16.5 Miles per gallon 16.0 15.5 15.0 1.0 1.2 1.4 1.6 1.8 2.0 2.2 Weight in tons The gasoline mileage of an automobile seems do not relate to the temperature but the mileage and the weight appears to have negative association. Marjorie Chiu, 2009 2-16
2.9. Time series plot Department of Applied Mathematics Time series plot is usually to show the variation of data as time advanced. Example 4: (miles.sav) Plot the time series data of the aircraft miles by the ABC airlines from 1986 to 1990. Analyze Forecasting Sequence Charts Variable: miles Time axis labels: date Sequence Plot Model Description Model Name MOD_1 Series or Sequence 1 miles Transformation None Non-Seasonal Differencing 0 Seasonal Differencing 0 Length of Seasonal Period 4 Horizontal Axis Labels Intervention Onsets Reference Lines Area Below the Curve Date_ None None Not filled Applying the model specifications from MOD_1 Case Processing Summary miles Series or Sequence Length 20 Number of Missing Values in User-Missing 0 the Plot System-Missing 0 Marjorie Chiu, 2009 2-17
Variable Time axis label The time series plot shows an upward trend with seasonal variation. Marjorie Chiu, 2009 2-18
Exercise 2 Question 1. Use the popular car data 1993 to construct a cross tabulation of the number of cars by car type and cylinder number. Calculate also the cell percentages within subgroups and of overall total. Question 2. Give a time series plot for the cod catch data. Briefly describe the plotting. Question 3. (hotel) A hotel is concerned about the number of people who book rooms by telephone but do not actually turn up. Over the past few weeks it has kept records of the number of people who do this, as shown below. How can these data be summarized? Describe its distribution briefly. Day 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 No-shows 4 5 2 3 3 2 1 4 7 2 0 3 1 4 5 Day 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 No-shows 2 6 2 3 3 4 2 5 5 2 4 3 3 1 4 Day 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 No-shows 5 3 6 4 3 1 4 5 6 3 3 2 4 3 4 Marjorie Chiu, 2009 2-19