Performance evaluation of I 3 S on whale shark data

Similar documents
NETFLIX MOVIE RATING ANALYSIS

Characterization and improvement of unpatterned wafer defect review on SEMs

STANDARDS CONVERSION OF A VIDEOPHONE SIGNAL WITH 313 LINES INTO A TV SIGNAL WITH.625 LINES

Analysis of MPEG-2 Video Streams

A Framework for Segmentation of Interview Videos

Deep Search Cannot Communicate Callsigns

Chapter Two: Long-Term Memory for Timbre

What is Statistics? 13.1 What is Statistics? Statistics

AP Statistics Sampling. Sampling Exercise (adapted from a document from the NCSSM Leadership Institute, July 2000).

Example: compressing black and white images 2 Say we are trying to compress an image of black and white pixels: CSC310 Information Theory.

AP Statistics Sec 5.1: An Exercise in Sampling: The Corn Field

Year 12 Literature Conditions for SACs and due dates 2018

Module 3: Video Sampling Lecture 16: Sampling of video in two dimensions: Progressive vs Interlaced scans. The Lecture Contains:

Composer Commissioning Survey Report 2015

Module 4: Video Sampling Rate Conversion Lecture 25: Scan rate doubling, Standards conversion. The Lecture Contains: Algorithm 1: Algorithm 2:

m RSC Chromatographie Integration Methods Second Edition CHROMATOGRAPHY MONOGRAPHS Norman Dyson Dyson Instruments Ltd., UK

Commissioning of the ATLAS Transition Radiation Tracker (TRT)

Reproducibility Assessment of Independent Component Analysis of Expression Ratios from DNA microarrays.

Where to present your results. V4 Seminars for Young Scientists on Publishing Techniques in the Field of Engineering Science

Analysis of WFS Measurements from first half of 2004

Sincerely, Darci the STEM Mom

CSE 101. Algorithm Design and Analysis Miles Jones Office 4208 CSE Building Lecture 9: Greedy

How Scholarly Is Google Scholar? A Comparison of Google Scholar to Library Databases

MARK SCHEME for the November 2004 question paper 9702 PHYSICS

SIMULATION OF PRODUCTION LINES THE IMPORTANCE OF BREAKDOWN STATISTICS AND THE EFFECT OF MACHINE POSITION

Analysis of data from the pilot exercise to develop bibliometric indicators for the REF

Microsoft Academic is one year old: the Phoenix is ready to leave the nest

MUSI-6201 Computational Music Analysis

Music Recommendation from Song Sets

Assessing and Measuring VCR Playback Image Quality, Part 1. Leo Backman/DigiOmmel & Co.

Benefits of the R&S RTO Oscilloscope's Digital Trigger. <Application Note> Products: R&S RTO Digital Oscilloscope

Sarcasm Detection in Text: Design Document

Motion Video Compression

Contract Cataloging: A Pilot Project for Outsourcing Slavic Books

Consonance perception of complex-tone dyads and chords

1 Ver.mob Brief guide

DESIGNING OPTIMIZED MICROPHONE BEAMFORMERS

Hidden Markov Model based dance recognition

Example the number 21 has the following pairs of squares and numbers that produce this sum.

Intra-frame JPEG-2000 vs. Inter-frame Compression Comparison: The benefits and trade-offs for very high quality, high resolution sequences

Supporting Information

CS229 Project Report Polyphonic Piano Transcription

Speech and Speaker Recognition for the Command of an Industrial Robot

Digital Audio Design Validation and Debugging Using PGY-I2C

INSTALATION PROCEDURE

Sample Analysis Design. Element2 - Basic Software Concepts (cont d)

VISUAL CONTENT BASED SEGMENTATION OF TALK & GAME SHOWS. O. Javed, S. Khan, Z. Rasheed, M.Shah. {ojaved, khan, zrasheed,

Section Reference Page Principle Points New stadiums Existing stadiums Illuminance levels 8

MUSICAL MOODS: A MASS PARTICIPATION EXPERIMENT FOR AFFECTIVE CLASSIFICATION OF MUSIC

Estimating Word Error Rate in PDF Files of Old Newspapers by Paul Bullock

These are used for producing a narrow and sharply focus beam of electrons.

Citation Accuracy in Environmental Science Journals

Basic Operations App Guide

ECG SIGNAL COMPRESSION BASED ON FRACTALS AND RLE

In Chapter 4 on deflection measurement Wöhler's scratch gage measured the bending deflections of a railway wagon axle.

Enrichment process for title submission in STEP

Predicting annoyance judgments from psychoacoustic metrics: Identifiable versus neutralized sounds

homework solutions for: Homework #4: Signal-to-Noise Ratio Estimation submitted to: Dr. Joseph Picone ECE 8993 Fundamentals of Speech Recognition

Lecture 2 Video Formation and Representation

Section 001. Read this before starting!

DISPLAY WEEK 2015 REVIEW AND METROLOGY ISSUE

hprints , version 1-1 Oct 2008

Chapter 2 Random Number Generator

INSTRUCTION SHEET FOR NOISE MEASUREMENT

Update on Antenna Elevation Pattern Estimation from Rain Forest Data

SAMPLE ASSESSMENT TASKS MUSIC GENERAL YEAR 12

Pre-processing of revolution speed data in ArtemiS SUITE 1

Publishing Your Research in Peer-Reviewed Journals: The Basics of Writing a Good Manuscript.

Supplemental Material: Color Compatibility From Large Datasets

In basic science the percentage of authoritative references decreases as bibliographies become shorter

Dither Explained. An explanation and proof of the benefit of dither. for the audio engineer. By Nika Aldrich. April 25, 2002

Blueline, Linefree, Accuracy Ratio, & Moving Absolute Mean Ratio Charts

Understanding Fidelity

Citation analysis: Web of science, scopus. Masoud Mohammadi Golestan University of Medical Sciences Information Management and Research Network

MTO 22.1 Examples: Carter-Ényì, Contour Recursion and Auto-Segmentation

Marking Policy Published by SOAS

The Proportion of NUC Pre-56 Titles Represented in OCLC WorldCat

Evaluation of Serial Periodic, Multi-Variable Data Visualizations

A Correlation Analysis of Normalized Indicators of Citation

Disputing about taste: Practices and perceptions of cultural hierarchy in the Netherlands van den Haak, M.A.

SIMULATION OF PRODUCTION LINES INVOLVING UNRELIABLE MACHINES; THE IMPORTANCE OF MACHINE POSITION AND BREAKDOWN STATISTICS

F1000 recommendations as a new data source for research evaluation: A comparison with citations

Setting Up the Warp System File: Warp Theater Set-up.doc 25 MAY 04

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

Estimating the Time to Reach a Target Frequency in Singing

Swept-tuned spectrum analyzer. Gianfranco Miele, Ph.D

USB Mini Spectrum Analyzer User Manual PC program TSA For TSA5G35 TSA4G1 TSA6G1 TSA12G5


Measurement of overtone frequencies of a toy piano and perception of its pitch

Ferenc, Szani, László Pitlik, Anikó Balogh, Apertus Nonprofit Ltd.

Cambridge International Examinations Cambridge International Advanced Subsidiary and Advanced Level. Published

MEDICAL COMMAND CODER STUDY. Lee B Smith, MD, JD, Theodore Avtgis, PhD, David Kappel, MD, Alison Wilson, MD,

Document Analysis Support for the Manual Auditing of Elections

GLog Users Manual.

Corporate Identification Guidelines

From Once Upon a Time to Happily Ever After: Tracking Emotions in Novels and Fairy Tales. Saif Mohammad! National Research Council Canada

SIDRA INTERSECTION 8.0 UPDATE HISTORY

DAT335 Music Perception and Cognition Cogswell Polytechnical College Spring Week 6 Class Notes

UNIVERSITY OF CAMBRIDGE INTERNATIONAL EXAMINATIONS General Certificate of Education Ordinary Level

Enhancing Music Maps

Transcription:

Performance evaluation of I 3 S on whale shark data Date: 25 January 2012 Version: 0.3 Authors: Jurgen den Hartog & Renate Reijns (i3s@reijns.com) Introduction I 3 S (Classic, v2.0) has been used since 2007 by many whale shark researchers. It is also part of the recognition strategy used by Ecocean. I 3 S has been evaluated initially on a ragged tooth shark data set [1]. A first evaluation on a whale shark database was carried out by Conrad Speed [2]. As the research community is relying more and more on (semi) automatic identification, it is important to validate I 3 S on larger amounts of whale shark data from various sources. This short study had the following goals: Establishing the actual recognition performance of I 3 S on whale shark data. Establishing the effect on the performance of the number of reference images per shark. Analysis of the cause of poor match quality. Recommendations for improvement of I 3 S, the manual, or the annotation process. This document describes the experiments which were carried out, the results and analysis, and the main conclusions and recommendations. Data sets For the analysis the following annotated and identified data sets were used: Area Nr. of individuals Number of annotated images (left and right side combined) Maldives 168 536 Seychelles 64 501 Djibouti 258 604 Totals 490 1641 Experiments First, the set-up of the experiments which were carried out is identical to that of the experiments for the ragged tooth sharks. See [1] for details. Basically, we randomly divided the data set into a reference set and a test set. The left and the right sides of each shark were regarded as two different sharks (I 3 S can technically compare a left side with a right side and vice versa) in this case effectively doubling the number of individuals in the experiment to 980. The advantage is that now, we can get a performance indication for a much larger database. The assumption is that there are no overlaps between the data sets. In earlier experiments these overlaps were unfortunately never found so this seems to be valid [3] (we assume the data sets are similar or identical). Prior to each experiment the number of reference images per shark was set. For example, in case of two reference images, from each shark directory two reference images where randomly selected for the reference set. All the remaining images (if any) where then selected for the test set. In case of only one image in the directory, only this image was selected for the reference set. Page 1 of 4

Next, each image from the test set was compared against the entire reference set and we kept track whether the top 1, top 3, top 5, top 10, or top 20 contained at least one of the corresponding images from the reference set. After comparing all test images we have the overall performance statistics. Since this experiment is sensitive to the selection of images in the reference set we have repeated each experiment typically a 100 times to average out any random effects. Results Below are three tables. The top row contains the number of reference images (1, 2 or 3) and the databases used in the experiments. We have tested all databases separately, the combination of Maldives + Seychelles, and all three databases. The first column gives the number of times the experiment was repeated (iterations, typically 100), the ratio between the number of reference images vs. the number of test images, and the total number of tests carried out. For example, in the first table for the Maldives database # tests is 23800. As the experiment was repeated 100 times, it means there were 238 images in the test set. As the ref/test ratio is 1,23 the number of reference images was 238 x 1,23 = 293. The remaining five rows contain the actual recognition performance statistics. For example, for the Seychelles with 1 reference image out of 37300 tests, in 31038 times the top 5 contained the corresponding reference image, 31038 / 37300 = 83,2%. 1 ref Maldives Seychelles Djibouti Combined Mal+Sey Combined Mal+Sey+Dji Iterations 100 100 50 100 100 Ref/test 1,23 0,34 0,37 0,69 0,48 ratio # Tests 23800 37300 59450 61100 180000 Top 1 18870 79,3% 27995 75,1% 40894 68,8% 45177 73,9% 123799 68,8% Top 3 20262 85,1% 30275 81,2% 45087 75,8% 49172 80,5% 136585 75,9% Top 5 20769 87,3% 31038 83,2% 46383 78,0% 50543 82,7% 140614 78,1% Top 10 21315 89,6% 32196 86,3% 48061 80,8% 52302 85,6% 145554 80,9% Top 20 21873 91,9% 33452 89,7% 49822 83,8% 53950 88,3% 150294 83,5% Table 1: Recognition performance with 1 reference image. 2 refs Maldives Seychelles Djibouti Combined Mal+Sey Combined Mal+Sey+Dji Iterations 100 100 50 100 100 Ref/test 4,53 0,99 0,81 1,97 1,14 ratio # Tests 9600 25200 44800 34800 124400 Top 1 8730 90,9% 22048 87,5% 37957 84,7% 30430 87,4% 104387 83,9% Top 3 9134 95,1% 23153 91,9% 40312 90,0% 32095 92,2% 111022 89,2% Top 5 9245 96,3% 23393 92,8% 40864 91,2% 32588 93,6% 112803 90,7% Top 10 9267 96,5% 23735 94,2% 41597 92,9% 33222 95,5% 114762 92,3% Top 20 9364 97,5% 24157 95,9% 42277 94,4% 33614 96,6% 116510 93,7% Table 2: Recognition performance with 2 reference images. 3 refs Maldives Seychelles Djibouti Combined Mal+Sey Combined Mal+Sey+Dji Iterations 100 100 50 100 100 Ref/test Not Avail. 2,00 1,39 5,18 2,13 ratio # Tests Not Avail. 16700 34050 16700 84800 Page 2 of 4

Top 1 N.av. N.av. 15433 92,4% 30823 90,5% 15176 90,9% 75824 89,4% Top 3 N.av. N.av. 15974 95,7% 32087 94,2% 15846 94,9% 79241 93,4% Top 5 N.av. N.av. 16056 96,1% 32450 95,3% 15981 95,7% 80177 94,5% Top 10 N.av. N.av. 16142 96,7% 32763 96,2% 16061 96,2% 81106 95,6% Top 20 N.av. N.av. 16238 97,2% 33036 97,0% 16187 96,9% 81807 96,5% Table 3: Recognition performance with 3 reference images. There are no results for the Maldives database because there were no individuals with 4 images or more per side. Analysis Individual databases First we will look at the individual databases. There are considerable differences. With only one reference image, the Maldives database has the probability of 8.1% (= 100% - 91.9%) the proper reference image is not in the top 20. For the Djibouti database this is 16.2% or twice as much! For two reference images the figures are resp. 2.5% and 5.6%, about twice as much. In general the Djibouti database scores twice the non match rate compared to the Maldives database. Differences between the Maldives and the Seychelles database are a little less prominent but still significant. Using the same comparison the the Seychelles database has about 60% more non matches than the Maldives database. From the analysis it seems that the best explanation is difference in the quality of annotation. This issue will be addressed further on. Number of reference images When we look at the differences between the 3 tables it is very clear that recognition performance improves dramatically with more images in the reference database. With respect to only 1 reference image you increase the possibility to have the right shark at number 1 from 69% to 89% when using three reference images! For the top 20, the rates increase from 83% to 96,5%. Unfortunately we did not have enough data to test 4 reference images. Database size I 3 S does not seem to be very sensitive to the database size. Extending the Seychelles and Maldives database with the Djibouti database has a very limited effect on the performance. This effect is much more likely to be explained by annotation quality than database size. Problematic images Next, we had a look at the problematic images. During the experiments we stored which test images did not have a correct reference image in the top 20. Since we repeated typically 100 times we could also count how often the same test image did not have its reference image in the top 20. The images with the highest count were considered problematic. Almost 50% of the problematic images were caused by incorrect annotation or other human errors. Examples of human errors (in order of importance) are: 1. insufficient number of spots annotated. 2. spot annotation in the wrong area. 3. putting the second control point ( edge pectoral ) at the wrong position (Figure 1). 4. putting files in the wrong directory. 5. switching the second and the third control point. As the Djibouti images dominated the list of test images and therefore as well the list of problematic images, the analysis of problematic images was repeated for only the Seychelles and the Maldives database. It turned out that still 40% of the images analysed contained errors. Again, dominant errors Page 3 of 4

were the location of the second control point, and the number of spots annotated. Below these will be discussed individually. The other errors were caused mainly by either poor angle (large deviations from a perpendicular angle) or murky water making it very hard to distinguish the location of the control points or even the spots. Further, I 3 S is quite sensitive to the location of the control points. Even small deviations in the location may have a significant negative impact on the match score. Inconsistent annotation The probability of inconsistent annotation increases strongly with the number of people annotating. In case multiple people are involved, it makes sense to make sure they annotate in the same way. For example, it could be considered to have (new) researchers annotate a known test collection first and compare it with a known reference set. If the scores on the test collection are below a certain threshold, annotation is obviously not conform standards. The same principle applies to exchange between research groups. The Maldives research group seems to select (to some extent) other spots than the Seychelles and Djibouti researchers. Further, the location of the second control point varies a bit between groups. A recommendation for a test is to annotate each other s known images and see whether the scores are similar. So for example, the Seychelles research group annotates 20 Maldivian images. Next, the annotated images are compared with the Maldivian database to see whether scores are similar. If there are large deviations it is obvious that further standardization is required before I 3 S can be used to find overlaps between the various databases! Second control point It turns out that the second control point ( edge pectoral ) is in many cases hard to pinpoint consistently, especially if the edge of the pectoral fin is not touching the body (e.g. Figure 1). In many cases the control point is placed too far back (Figure 2). Special attention should be paid to the location of this control point. Number of spots annotated In some cases, it is hard to find sufficient spots (I 3 S has a built-in minimum of 12) on a whale shark. The best advice is to use the entire rectangle spanned by the control points. Analysis turned out that in many cases a problematic image only contained annotated spots in the triangle spanned by the control points. Enlarging the area to a rectangle doubles the area and makes recognition in general much easier. If this means that the rectangle extends a bit behind the second control point this should not be a problem. See Figure 3 and Figure 4. Page 4 of 5

Figure 1: Example of photo where the "edge pectoral" is hard to locate. Figure 2: Control 2 is placed inaccurately. Page 5 of 6

Figure 3: example with not enough spots annotated. Figure 4: Improved annotation. Arrows indicate changes. Page 6 of 7

I 3 S still assumes that a shark is a 2D creature. For this reason, it is best to limit the spot area to the body part which fits best to a 2D model. In practice this means not selecting spots above the horizontal line marked by the first control point (top 5 th gill) or below the second control point. Outside these lines the shark s body starts to curve strongly resulting in large matching errors. See Figure 5 for an example. Unfortunately, sometimes it is necessary to use these spots to get the minimum of 12 spots. A next experiment could be a test where the minimum number is reduced to 9 or 10. We have improved the recognition to handle fewer annotated spots better. Figure 5: Arrows indicate spots above the horizontal line which are better not used for annotation. Conclusions and recommendations Conclusions Provided that there are 3 reference images, a top 1 score has a probability of at least 91%. With more consistent annotation it could become significantly higher. The probability of a top 20 match is 97% (with a database size of almost 1000 individuals). Performance will most likely only slowly deteriorate with increase of database size. Recommendations Try to have the best three images for each shark side in the reference database. Reduce the minimum number of spots in I 3 S to 10 (trivial software change, directly available on request for testing). If multiple people are used for annotating the images, have all people annotate a standard set and compare results against the standard. Page 7 of 8

Compare annotation standards between research groups in the same way before relying on I 3 S to find overlaps. Investigate improvement of the algorithm to make it less sensitive to small changes in the location of the control points. Inconsistent annotation explains 40 to 50% of all mismatches. Specifically the location of the 2 nd control point and the area used for annotation appear to be error prone. Use the entire rectangle spanned by the control points for annotation. Extending the area a bit to the back should not pose problems. Do not annotate spots above the horizontal line through the first control point or below the horizontal line through the second control point, where the body curvature increases. A final recommendation: please do not hesitate to involve us in your research. For example, if the research community recognizes the need for more annotation standardization we could provide a tool to indicate which images are more likely to require some changes to annotation. If databases need to be compared with each other, it is fairly easy to develop a tool which does this automatically and provides you with a ranked list of best possible matches. References [1] A computer-aided program for pattern-matching natural marks on the spotted raggedtooth shark Carcharias taurus (Rafinesque, 1810), Van Tienhoven, A.M., Den Hartog, J.E., Reijns, R.A., & Peddemors, V.M., Journal of Applied Ecology 44, 273 280 (2007) [2] Spot the match wildlife photo-identification using information theory, Speed, C.W., Meekan, M.G., Bradshaw, C.J.A., Frontiers in Zoology, Frontiers in Zoology 4: 2, DOI: 10.1186/1742-9994-4-2 (2007) [3] Seeing Spots: Photo-identification as a Regional Tool for Whale Shark Identification. Katie Brooks, David Rowat, Simon J. Pierce, Daniel Jouannet and Michel Vely Western Indian Ocean J. Mar. Sci. Vol. 9, No. 2, pp. 185-194, 2010. Page 8 of 8