Reliability. What We Will Cover. What Is It? An estimate of the consistency of a test score.

Similar documents
Interests Testing. Definition. What We Will Cover in This Section

Validity. What Is It? Types We Will Discuss. The degree to which an inference from a test score is appropriate or meaningful.

Chapter 27. Inferences for Regression. Remembering Regression. An Example: Body Fat and Waist Size. Remembering Regression (cont.)

COMP Test on Psychology 320 Check on Mastery of Prerequisites

Sampling: What you don t know can hurt you. Juan Muñoz

NAA ENHANCING THE QUALITY OF MARKING PROJECT: THE EFFECT OF SAMPLE SIZE ON INCREASED PRECISION IN DETECTING ERRANT MARKING

Bootstrap Methods in Regression Questions Have you had a chance to try any of this? Any of the review questions?

Test Design and Item Analysis

Paired plot designs experience and recommendations for in field product evaluation at Syngenta

A Scoring Manual for Literalness in Proverb Interpretation

NETFLIX MOVIE RATING ANALYSIS

Relationship between styles of humor and divergent thinking

Tutorial 0: Uncertainty in Power and Sample Size Estimation. Acknowledgements:

The Text Reception Threshold as a Measure for the. Non-Auditory Components of Speech Understanding

STAT 113: Statistics and Society Ellen Gundlach, Purdue University. (Chapters refer to Moore and Notz, Statistics: Concepts and Controversies, 8e)

Moving on from MSTAT. March The University of Reading Statistical Services Centre Biometrics Advisory and Support Service to DFID

ONLINE SUPPLEMENT: CREATIVE INTERESTS AND PERSONALITY 1. Online Supplement

Use black ink or black ball-point pen. Pencil should only be used for drawing. *

Sector sampling. Nick Smith, Kim Iles and Kurt Raynor

Estimation of inter-rater reliability

MID-TERM EXAMINATION IN DATA MODELS AND DECISION MAKING 22:960:575

Western Statistics Teachers Conference 2000

Estimating. Proportions with Confidence. Chapter 10. Copyright 2006 Brooks/Cole, a division of Thomson Learning, Inc.

Key Factors Affecting Consumer Music Procurement Behavior (Observing Music Sites)

A study on testing reliability of 2013 Musical Aptitude Test scores conducted by Music Education Department in Uludag University

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

N12/5/MATSD/SP2/ENG/TZ0/XX. mathematical STUDIES. Wednesday 7 November 2012 (morning) 1 hour 30 minutes. instructions to candidates

Aging display s effect on interpretation of digital pathology slides

Problem Points Score USE YOUR TIME WISELY USE CLOSEST DF AVAILABLE IN TABLE SHOW YOUR WORK TO RECEIVE PARTIAL CREDIT

Becoming an expert in the musical domain: It takes more than just practice

A new tool for measuring musical sophistication: The Goldsmiths Musical Sophistication Index

Precision testing methods of Event Timer A032-ET

The Theory of Mind Test (TOM Test)

Modeling sound quality from psychoacoustic measures

in the Howard County Public School System and Rocketship Education

The Musicality of Non-Musicians: Measuring Musical Expertise in Britain

Temporal coordination in string quartet performance

Dynamic Performance Requirements for Phasor Meausrement Units

Ferenc, Szani, László Pitlik, Anikó Balogh, Apertus Nonprofit Ltd.

Why t? TEACHER NOTES MATH NSPIRED. Math Objectives. Vocabulary. About the Lesson

Monday 15 May 2017 Afternoon Time allowed: 1 hour 30 minutes

Mixed Effects Models Yan Wang, Bristol-Myers Squibb, Wallingford, CT

Quantitative methods

Brief Report. Development of a Measure of Humour Appreciation. Maria P. Y. Chik 1 Department of Education Studies Hong Kong Baptist University

Modeling memory for melodies

Frequently Asked Questions

MUSIC AND MEMORY. Jessica Locke Megan Draughn Olivia Cotton James Segodnia Caitlin Annas

The Impact of Media Censorship: Evidence from a Field Experiment in China

Hybrid resampling methods for confidence intervals: comment

Simulation Supplement B

Supplementary Figures Supplementary Figure 1 Comparison of among-replicate variance in invasion dynamics

To Link this Article: Vol. 7, No.1, January 2018, Pg. 1-11

Improving music composition through peer feedback: experiment and preliminary results

Evaluating Oscilloscope Mask Testing for Six Sigma Quality Standards

Validity of TV, Video, Video Game Viewing/Usage Diary: Comparison with the Data Measured by a Viewing State Measurement Device

SUBMISSION AND GUIDELINES

Measuring the Facets of Musicality: The Goldsmiths Musical Sophistication Index. Daniel Müllensiefen Goldsmiths, University of London

BER margin of COM 3dB

Libraries as Repositories of Popular Culture: Is Popular Culture Still Forgotten?

Doubletalk Detection

Varying Degrees of Difficulty in Melodic Dictation Examples According to Intervallic Content

YOUR NAME ALL CAPITAL LETTERS

EDUCATIONAL PSYCHOLOGY (ED PSY)

RANDOMIZED COMPLETE BLOCK DESIGN (RCBD) Probably the most used and useful of the experimental designs.

Effect of sense of Humour on Positive Capacities: An Empirical Inquiry into Psychological Aspects

Synthesized Clock Generator

MPEG-4 Audio Synchronization

User Calibration Software. CM-S20w. Instruction Manual. Make sure to read this before use.

Translation, Validity, and Reliability of a Persian Version of the Iowa Tinnitus Handicap Questionnaire

abc Mark Scheme Statistics 3311 General Certificate of Secondary Education Higher Tier 2007 examination - June series

How to present your paper in correct APA style

Reconstruction of Ca 2+ dynamics from low frame rate Ca 2+ imaging data CS229 final project. Submitted by: Limor Bursztyn

CURRENT RESEARCH IN SOCIAL PSYCHOLOGY

Nearest-neighbor and Bilinear Resampling Factor Estimation to Detect Blockiness or Blurriness of an Image*

Release Year Prediction for Songs

Reproducibility Assessment of Independent Component Analysis of Expression Ratios from DNA microarrays.

Sample APA Paper for Students Interested in Learning APA Style 6 th Edition. Jeffrey H. Kahn. Illinois State University

The Relationship Between Auditory Imagery and Musical Synchronization Abilities in Musicians

Identifying the Importance of Types of Music Information among Music Students

Magical. Happy. music cues Happy productive. You see, in our classroom the Science Guy song had a special message for my students:

SECTION I. THE MODEL. Discriminant Analysis Presentation~ REVISION Marcy Saxton and Jenn Stoneking DF1 DF2 DF3

CURRENT CHALLENGES IN THE EVALUATION OF PREDOMINANT MELODY EXTRACTION ALGORITHMS

Detecting Medicaid Data Anomalies Using Data Mining Techniques Shenjun Zhu, Qiling Shi, Aran Canes, AdvanceMed Corporation, Nashville, TN

Sitting through commercials: How commercial break timing and duration affect viewership

To cite this article:

Standardization of Field Performance Measurement Methods for Product Acceptance

From child to musician: skill development during the beginning stages of learning an instrument

Using DICTION. Some Basics. Importing Files. Analyzing Texts

More About Regression

Replicated Latin Square and Crossover Designs

Running head: FACIAL SYMMETRY AND PHYSICAL ATTRACTIVENESS 1

MATH 214 (NOTES) Math 214 Al Nosedal. Department of Mathematics Indiana University of Pennsylvania. MATH 214 (NOTES) p. 1/11

Sources of Error in Determining Countermovement Jump Height With the Impulse Method

Measuring the Impact of Electronic Publishing on Citation Indicators of Education Journals

Individual differences in attitudes towards gossip

Predicting the Importance of Current Papers

Does the number of users rating the movie accurately predict the average user rating?

Mixed Models Lecture Notes By Dr. Hanford page 151 More Statistics& SAS Tutorial at Type 3 Tests of Fixed Effects

Topic 4. Single Pitch Detection

Measuring Variability for Skewed Distributions

Transcription:

Reliability 4/8/2003 PSY 721 Reliability 1 What We Will Cover What reliability is. How a test s reliability is estimated. How to interpret and use reliability estimates. How to enhance reliability. 4/8/2003 PSY 721 Reliability 2 What Is It? An estimate of the consistency of a test score. Permits an estimate of the amount of error in a score. The more error, the less stable (reliable) the score is 4/8/2003 PSY 721 Reliability 3 1

Definitions True Score The stable characteristics of the individual being tested or the attribute being measured. Error Features of the individual, test content, and situation that influence a score but which have nothing to do with the attribute being measured. 4/8/2003 PSY 721 Reliability 4 Sources of Variability #1 1. Person. True level of the trait or construct being measured. Variability in the person not connected with the trait (error). 2. Test. Content that is related to the trait being measured. Errors in content sampling (error). Errors in item construction (error). 4/8/2003 PSY 721 Reliability 5 Sources of Variability #2 3. Test administration. - Inconsistent test administration (error). 4. Scorer error. - One scorer is inconsistent (error). - Two scorers don t give the same assessment to the same behavior (error). 4/8/2003 PSY 721 Reliability 6 2

Classical Test Theory OBSERVED = TRUE + ERROR SCORE SCORE X = T + e 4/8/2003 PSY 721 Reliability 7 Types of Reliability Estimates 4/8/2003 PSY 721 Reliability 8 Test Retest Reliability Coefficient of Stability Administer Test A Wait Readminister Test A 4/8/2003 PSY 721 Reliability 9 3

Test Retest Reliability Scatterplot Time 2 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 Time 1 4/8/2003 PSY 721 Reliability 10 What good is it? What sources of error are detected with test-retest reliability? 4/8/2003 PSY 721 Reliability 11 Issues with r tt The interval between time 1 and time 2 is important. Subject reactivity. Carry over effect. Time consuming. Assumes no change in the individual. 4/8/2003 PSY 721 Reliability 12 4

Alternate Forms Reliability Administer Test A No significant wait Administer Test B 4/8/2003 PSY 721 Reliability 13 What good is it? What sources of error are detected with alternate forms reliability? 4/8/2003 PSY 721 Reliability 14 Issues Practice effect. Fatigue effect. Time delay. Back-to-back. Interval. 4/8/2003 PSY 721 Reliability 15 5

Sweeney s Measure of Verbal Fluency Use each of the following words correctly in a sentence. 1. Cat 2. House 3. Automobile 4. Phrenologize 5. Coat 6. Marble 7. Dog-flogger 8. Variance 9. Beetle 10.Crayon 4/8/2003 PSY 721 Reliability 16 3. Internal Consistency 1. Split half reliability. 2. Kuder Richardson (KR-20). 3. Coefficient Alpha. 4/8/2003 PSY 721 Reliability 17 Split Half Reliability 1. Divide the test into two sub-tests. 2. Correlate the scores on the subtests. 4/8/2003 PSY 721 Reliability 18 6

Issues With Split Half Which halves? Correcting for length. Speeded tests. Only a single analysis. 4/8/2003 PSY 721 Reliability 19 Kuder-Richardson (KR-20) Used with test items that can be scored pass-fail. Represents the mean of all possible split-half coefficients. Expressed in terms of a correlation coefficient. 4/8/2003 PSY 721 Reliability 20 Coefficient Alpha Used when There is no pass-fail. Multiple responses to an item. Represents the mean of all possible split-half coefficients. Expressed in terms of a correlation coefficient. 4/8/2003 PSY 721 Reliability 21 7

Extroversion Scale (α =.66) Quiet Shy Bold Item Energetic Bashful Withdrawn Talkative Extraverted Corrected Item-Total Correlation.238.508.451.541.327.263.307.251 Alpha if Deleted.657.587.603.580.644.655.643.664 4/8/2003 PSY 721 Reliability 22 What you can learn. Errors due to content sampling. Errors due to heterogeneity of the content domain. Scoring errors. 4/8/2003 PSY 721 Reliability 23 Interrater(Scorer) Reliability 1. Do different scorers give the same evaluation of the same test? 2. Does the same scorer give the same evaluation of the same test? 4/8/2003 PSY 721 Reliability 24 8

How to Interpret and Use Reliability Estimates 4/8/2003 PSY 721 Reliability 25 Interpretation of r tt Can be interpreted as the % of variance attributable to TRUE SCORE. r tt = percent of TRUE SCORE variability in a score. 1 - r tt = percent of ERROR variability in a score. 4/8/2003 PSY 721 Reliability 26 Mathematically Speaking σ 2 Total = σ 2 True Score + σ 2 Error r tt = σ 2 True Score σ 2 True Score + σ 2 Error 4/8/2003 PSY 721 Reliability 27 9

True Score 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 Obtained Score True Score Error 4/8/2003 PSY 721 Reliability 28 Standard Error of Measurement (SEM) An index of the amount of error (inconsistency) in an individual s test score. An estimate of the standard deviation of the error in a test. 4/8/2003 PSY 721 Reliability 29 How to Calculate SEM SEM = SD 1 r test tt 4/8/2003 PSY 721 Reliability 30 10

Example Calculation Mean = 50 SD = 4 r tt =.89 SEM = 4 1. 89 SEM = 4. 11 SEM = 4 x. 33 SEM = 132. 4/8/2003 PSY 721 Reliability 31 Confidence Intervals How to account for the fact that we never measure something exactly. 4/8/2003 PSY 721 Reliability 32 46.04 47.36 48.68 50 51.32 52.64 53.96 4/8/2003 PSY 721 Reliability 33 11

Confidence Intervals -1.64 90% +1.64-1.96 +1.96 95% 4/8/2003 PSY 721 Reliability 34 Magic Numbers Confidence Interval z-score limits +/- 1.64 90 th 95 th +/- 1.96 99 th +/- 2.58 4/8/2003 PSY 721 Reliability 35 Reliability of Difference Scores Issue Both tests have random error. The difference between the two test scores does not take into account the SEM for each test. The Standard Error of Difference (SE diff ) is the estimate of error in difference scores. SE diff is greater than either SEM. 4/8/2003 PSY 721 Reliability 36 12

SE diff Example PDQ Test of Conceptual Flexibility Mean 40 SD 8 r tt.90 SEdiff = SD 2 r11 r 22 SEdiff = 8. 2 SEdiff = 358. SEM 2.53 4/8/2003 PSY 721 Reliability 37 How to Enhance Reliability Increase test length. Remove inconsistent items. Correct for attenuation. Standardize the scoring system. Live with it. 4/8/2003 PSY 721 Reliability 38 THE END 4/8/2003 PSY 721 Reliability 39 13