Overview of ITU-R BS.1534 (The MUSHRA Method) Dr. Gilbert Soulodre Advanced Audio Systems Communications Research Centre Ottawa, Canada gilbert.soulodre@crc.ca 1
Recommendation ITU-R BS.1534 Method for the subjective assessment of intermediate quality level of coding systems 2
Quick Intro MUSHRA - MUlti-Stimulus Hidden Reference and Anchor Multi-Stimulus: Listeners have instant random access to each of the test items and the reference signal. Hidden Reference: One of the test items is a copy of the reference signal. Anchor: One of the test items must be a version of the reference signal low-passed filtered at 3.5 khz. 3
Background (yet another standard...) ITU-R BS.1116 was developed to assess the performance of high-quality perceptual audio codecs. Artifacts were expected to be small and difficult to hear. Internet-quality codecs create clearly audible impairments. A new method for evaluating their performance in a rigorous fashion was required. ITU-R BS.1534 was developed based on multi-stimulus method devised by Soulodre for comparing signals and systems with clearly audible differences. 4
BS.1534 versus BS.1116 Both methods try to estimate worst-case performance. BS.1534 tries to keep the parts of BS.1116 that are most effective, while dealing with clearly audible differences. BS.1534 is the same as BS.1116 for - listening environment (noise and reverberation) - reproduction system - training of listeners - selection and use of critical audio material - high-quality reference signal 5
Main Differences BS.1116 Double-blind triplestimulus with hidden reference. BS.1534 Multi-stimulus with hidden reference. 5-point impairment scale Continuous quality scale Detection and grading process Sorting and grading process 6
TEST: Rank According to Size 7
Results Item Rank Diameter Golf ball 1 4.3 cm Baseball 2 7.3 Soccer ball 3 22 Basketball 4 24 The Moon 5 3.5 Million Ranking does not provide any information about the size of the relative differences. Want to get as much information from subjects as possible. Also, what if the actual Moon was used as the reference? Would differences between the balls be noticeable? Consider how you just performed the ranking task. 8
Choice of Reference Signal Subjects need a reference in order to know the best-case or benchmark performance. The choice of reference signal is critical to the outcome of the experiment. When evaluating impairments (distortions) it is important to use a high-quality reference signal. Using a degraded reference (e.g. bandlimited) will introduce biases. The difficulty for the listeners is to compare apples and oranges. 9
Perceptual Distance Large perceptual distance between the reference and the test items. Small perceptual distance between the various test items. Quality Ref d A d B A B d AB Solution: Double-blind Multi-Stimulus Method 10
Controls for the multi-stimulus method Provides subjects with random access to the test items. Subjects tend to rank (sort) and then grade the test items. Get benefits of both paired-comparisons and grading!! 11
Evaluating Methodologies In a BS.1116 test the test items are presented to the subjects sequentially. In a BS.1534 test the subjects have random access to the test items. A formal subjective test was conducted to compare the performance of the two methods and to evaluate consistency of grades given by subjects. Also evaluate the degree of resolution provided. Highly controlled impairments were applied to a source signal. Broadband random noise (white) increased systematically in 2dB increments (9 levels of noise). Systematic increase of noise level allows relative qualities of the signals to be measured objectively and indisputably. 12
Results of Noise Impairment Tests Imperceptible 5 Mean Subjective Grade Perceptible but not annoying Slightly annoying Annoying 4 3 2 Random Access Sequential Access Very annoying Unacceptable 1 0 2 4 6 8 10 12 14 16 Relative Noise Level, db 5 subjects performed test using both methods. Button assignment randomized for each subject. Both methods give monotonic decrease in grades with increasing noise level. 13
Results of Noise Impairment Tests Imperceptible 5 Perceptible but not annoying Sequential Access Random Access Mean Subjective Grade Slightly annoying Annoying 4 3 2 Very annoying Unacceptable 1 0 2 4 6 8 10 12 14 16 Relative Noise Level, db 0 2 4 6 8 10 12 14 16 Relative Noise Level, db Error bars indicate critical differences. Random Access method gives finer resolution. Error bars are half the size of the Sequential Access method. 14
Subject Consistency Sequential Access Random Access Imperceptible 5 Perceptible but not annoying 4 Subjective Grade Slightly annoying Annoying 3 2 Very annoying Unacceptable 1 0 2 4 6 8 10 12 14 16 Relative Noise Level, db 0 2 4 6 8 10 12 14 16 Relative Noise Level, db Sequential Access Method - grades are not monotonic. Random Access Method - grades decrease monotonically with increasing noise levels. Random access method provides greater consistency. 15
Anchors A true BS.1534 test requires that a 3.5 khz low-pass filtered version of the reference signal be included as a test item. Additional anchors can be included. Intended to allow the results from different experiments to be scaled and compared A bad idea! Scaling between experiments introduces bias. Limits flexibility in experimental design. Also, the anchors probably introduce bias by drawing the subject s attention to a specific type of impairment (bandlimiting). Anchors make no sense when testing systems where bandlimiting is not an issue. 16
BS.1116 Scales BS.1534 5.0 4.0 3.0 2.0 1.0 4.9 4.8 4.7 4.6 4.5 4.4 4.3 4.2 4.1 3.9 3.8 3.7 3.6 3.5 3.4 3.3 3.2 3.1 2.9 2.8 2.7 2.6 2.5 2.4 2.3 2.2 2.1 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 Imperceptible Perceptible but Not Annoying Slightly Annoying Annoying Acceptable but Very Annoying Excellent Good Fair Poor Bad BS.1534 uses a relative scale so grades depend on context. Cannot compare results between experiments. 17
Taking Care of Details It is very important to take care of the details when conducting a subjective test, a) conduct a pilot test b) randomize the buttons and trials for each subject c) training must be done properly - subjects must hear ALL of the test sequences, - subjects should hear the full range of impairments, - there should be no surprises during the blind rating phase. 18
Why not always use BS.1534? The multi-stimulus method of BS.1534 has many advantages when conducting subjective tests. It tends to provide more consistent inter-subject data due to the sorting and grading process. So why not use the multi-stimulus method for all subjective tests (instead of BS.1116)? ANSWER: With the multi-stimulus method, you can t tell if a subject is guessing! Need to use BS.1116 if differences between the reference and the test items are hard to hear. 19
When to use the multi-stimulus method? When the differences between the reference signal and the test items are clearly audible. Not limited to clearly audible impairments. When comparing systems/sounds that are very different from each other (i.e. when comparing apples and oranges). We ve used it to evaluate inverse filtering algorithms, and the perception of envelopment in multichannel surround. You may already be using it (solo buttons on a mixer)! Choose a scale that makes sense for your test. Leave out the anchors. 20