MATH& 146 Lesson 11 Section 1.6 Categorical Data 1
Frequency The first step to organizing categorical data is to count the number of data values there are in each category of interest. We can organize these counts (or frequencies) into a frequency table, which records the totals and the category names (called levels). 2
Frequency A class with 20 students had the following distribution of grades: A, A, A, B, B, B, B, B, C, C, C, D, D, D, D, D, D, F, F, F GRADE FREQUENCY A 3 B 5 C 3 D 6 F 3 3
Relative Frequency A relative frequency is the proportion of times a category occurs. Relative frequencies can be written as fractions, decimals, or percents. GRADE FREQUENCY RELATIVE FREQUENCY A 3 0.15 B 5 0.25 C 3 0.15 D 6 0.30 F 3 0.15 4
Cumulative Relative Frequency Cumulative relative frequency is the accumulation of the previous relative frequencies. GRADE FREQUENCY RELATIVE FREQUENCY CUMULATIVE RELATIVE FREQUENCY A 3 0.15 0.15 B 5 0.25 0.40 C 3 0.15 0.55 D 6 0.30 0.85 F 3 0.15 1.00 5
Example 1 Fifty part-time students were asked how many courses they were taking this term. The (incomplete) results are shown below: # of Courses Frequency Relative Frequency Cumulative Relative Frequency 1 30 0.6 2 15 3 a. Fill in the blanks in the table above. b. What percent of students take exactly two courses? c. What percent of students take at most two courses? 6
Graphs of Categorical Data There are two simple visual summaries that are used for categorical data Circle graphs (pie charts) show the amount of data that belong to each category as a proportional part of the whole. Bar graphs consist of bars that are separated from each other. The bars can be rectangles or they can be rectangular boxes and they can be vertical or horizontal. 7
Graphs of Categorical Data To get a better sense of graphing categorical data, consider the following table about the Titanic. The table lists the numbers and percentages in each class on the Titanic's voyage. CLASS FREQUENCY RELATIVE FREQUENCY First 325 15% Second 285 13% Third 706 32% Crew 885 40% Total 2201 100% 8
Pie Charts When you are interested in relative frequencies, a pie chart might be your display of choice. They slice the circle into pieces whose sizes are proportional to the fraction of the whole in each category. 9
10
Pie Charts There are two rules to follow when creating a pie chart: 1) The pieces have to add up to 100%. 2) No person can be represented in more than one piece. BAD PIE CHART 271% even without an Other category. 11
Bar Charts A bar chart displays the distribution of a categorical variable, showing the counts for each category next to each other for easy comparison. Notice that each bar is separated from each other. 12
Pie Charts vs. Bar Charts While pie charts are well known, they are not typically as useful as other charts. It is generally more difficult to compare group sizes in a pie chart than in a bar chart, especially when categories have nearly identical counts or proportions. 13
Example 2 Which category is largest? Which is smallest? 14
The Titanic Here is part of a data matrix about the passengers and crew aboard the Titanic. Each case (row) of the data table represents a person on board the ship. Survived Age Sex Class Died Adult Male Third Survived Adult Male Crew Died Child Male Third Survived Child Female First Died Adult Male Third Died Adult Female Crew 15
The Titanic The problem with data matrices is that you can't see what's going on. And seeing is just what we want to do. We need ways to show the data so that we can see patterns, relationships, trends, and exceptions. Survived Age Sex Class Died Adult Male Third Survived Adult Male Crew Died Child Male Third Survived Child Female First Died Adult Male Third Died Adult Female Crew 16
Survival The Titanic To look at two categorical variables together, we often arrange the counts in a two-way table. Here is a two-way table of those aboard the Titanic, classified according to class of ticket and whether or not they survived. Class First Second Third Crew Total Survived 203 118 178 212 711 Died 122 167 528 673 1490 Total 325 285 706 885 2201 17
Survival The Titanic Because the table shows how the individuals are distributed along each variable, contingent on the value of the other variable, such a table is called a contingency table. Class First Second Third Crew Total Survived 203 118 178 212 711 Died 122 167 528 673 1490 Total 325 285 706 885 2201 18
Survival Contingency Tables Class First Second Third Crew Total Survived 203 118 178 212 711 Died 122 167 528 673 1490 Total 325 285 706 885 2201 The margins of the table, both on the right and at the bottom, give totals. The bottom line is just the frequency table of the variable Class. Class Frequency First 325 Second 285 Third 706 Crew 885 Total 2201 19
Survival Contingency Tables Class First Second Third Crew Total Survived 203 118 178 212 711 Died 122 167 528 673 1490 Total 325 285 706 885 2201 The right column of the table is the frequency table of the variable Survival. Survival Frequency Survived 711 Died 1490 Total 2201 20
Survival Contingency Tables Class First Second Third Crew Total Survived 203 118 178 212 711 Died 122 167 528 673 1490 Total 325 285 706 885 2201 Each cell of the table gives the count for a combination of values of the two variables. For example, the highlighted cell shows that 118 second-class passengers survived. So what does the green highlighted cell show? 21
Survival Row Proportions The table below shows the row proportions for the Titanic data set. The row proportions are computed as the counts divided by their row totals. Class First Second Third Crew Total Survived 203/711 =.286 118/711 =.166 178/711 =.250 212/711 =.298 711/711 = Died 122/1490 =.082 167/1490 =.112 528/1490 =.354 673/1490 =.452 1490/1490 = Total 325/2201 =.148 285/2201 =.129 706/2201 =.321 885/2201 =.402 2201/2201 = 22
Survival Example 3 a) What does 167/1490 =.112 (second column, second row) represent in the table? b) What does 885/2201 =.402 (fourth column, third row) represent in the table? Class First Second Third Crew Total Survived 203/711 =.286 118/711 =.166 178/711 =.250 212/711 =.298 711/711 = Died 122/1490 =.082 167/1490 =.112 528/1490 =.354 673/1490 =.452 1490/1490 = Total 325/2201 =.148 285/2201 =.129 706/2201 =.321 885/2201 =.402 2201/2201 = 23
Survival Column Proportions A contingency table of the column proportions is computed in a similar way, where each column proportion is computed as the count divided by the corresponding column total. Class First Second Third Crew Total Survived 203/325 =.625 118/285 =.414 178/706 =.252 212/885 =.240 711/2201 =.323 Died 122/325 =.375 167/285 =.586 528/706 =.748 673/885 =.760 Total 325/325 = 285/285 = 706/706 = 885/885 = 1490/2201 =.677 2201/2201 = 24
Survival Example 4 a) What does 167/285 =.586 (second column, second row) represent in the table? b) What does 711/2201 =.323 (fifth column, first row) represent in the table? Class First Second Third Crew Total Survived 203/325 =.625 118/285 =.414 178/706 =.252 212/885 =.240 711/2201 =.323 Died 122/325 =.375 167/285 =.586 528/706 =.748 673/885 =.760 Total 325/325 = 285/285 = 706/706 = 885/885 = 1490/2201 =.677 2201/2201 = 25
Survival Column Proportions In the table, the value 0.625 indicates that 62.5% of first class passengers survived. This rate of survival is much higher compared to second class passengers (41.4%), third class passengers (25.2%), or crew members (24.0%). Class First Second Third Crew Total Survived 203/325 =.625 118/285 =.414 178/706 =.252 212/885 =.240 711/2201 =.323 Died 122/325 =.375 167/285 =.586 528/706 =.748 673/885 =.760 Total 325/325 = 285/285 = 706/706 = 885/885 = 1490/2201 =.677 2201/2201 = 26
Survival Column Proportions Because these differences in survival rates between the classes is unlikely from random chance alone, this provides evidence that the class and survival variables are associated. We say the two variables are dependent. Class First Second Third Crew Total Survived 203/325 =.625 118/285 =.414 178/706 =.252 212/885 =.240 711/2201 =.323 Died 122/325 =.375 167/285 =.586 528/706 =.748 673/885 =.760 Total 325/325 = 285/285 = 706/706 = 885/885 = 1490/2201 =.677 2201/2201 = 27
Mosaic Plots Mosaic plots are graphical displays of contingency tables. The widths of the bars match the proportions for each level, while the heights match the column proportions. 28
Independent When the variables are independent, all proportions are the same, so the boxes line up in a grid. Column 1 Column 2 Column 3 Row 1 5 10 15 Row 2 8 16 24 29
Dependent When the variables are dependent, proportions are not the same, so the boxes do not line up. Column 1 Column 2 Column 3 Row 1 5 16 18 Row 2 8 10 18 30
Example 5 The mosaic plot below compares class and survival on the Titanic. Based on the plot, are the variables independent? 31
Example 6 A random set of 100 people who have pets were polled to see if there was an association between gender and whether they preferred either a dog or a cat. The results of the survey are below. Dog Cat Total Male 40 10 50 Female 20 30 50 Total 60 40 100 32
Example 6 continued a) Compute and interpret the column proportions. b) Does there appear to be an association between gender and type of pet? Explain. Dog Cat Total Male 40 10 50 Female 20 30 50 Total 60 40 100 33
Example 7 The mosaic plot below compares gender and type of pet. Based on the plot, are the variables independent? 34
Example 8 There are 10 boys and 12 girls in Mr. Fleck's fourth grade class and 15 boys and 18 girls in Mrs. Parker s fourth grade class. One student is randomly selected to be hall monitor. a) Use this information to complete the contingency table below. Teacher Mr. Fleck Mrs. Parker Total Gender Boy Girl Total 35
Example 8 continued a) Compute and interpret the row proportions. b) Does there appear to be an association between teacher and student's gender? Explain. Gender Boy Girl Total Mr. Fleck 10 12 22 Mrs. Parker 15 18 33 Total 25 30 55 36
Example 9 The mosaic plot below compares teacher and student gender. Based on the plot, are the variables independent? 37