Performance evaluation of I 3 S on whale shark data Date: 25 January 2012 Version: 0.3 Authors: Jurgen den Hartog & Renate Reijns (i3s@reijns.com) Introduction I 3 S (Classic, v2.0) has been used since 2007 by many whale shark researchers. It is also part of the recognition strategy used by Ecocean. I 3 S has been evaluated initially on a ragged tooth shark data set [1]. A first evaluation on a whale shark database was carried out by Conrad Speed [2]. As the research community is relying more and more on (semi) automatic identification, it is important to validate I 3 S on larger amounts of whale shark data from various sources. This short study had the following goals: Establishing the actual recognition performance of I 3 S on whale shark data. Establishing the effect on the performance of the number of reference images per shark. Analysis of the cause of poor match quality. Recommendations for improvement of I 3 S, the manual, or the annotation process. This document describes the experiments which were carried out, the results and analysis, and the main conclusions and recommendations. Data sets For the analysis the following annotated and identified data sets were used: Area Nr. of individuals Number of annotated images (left and right side combined) Maldives 168 536 Seychelles 64 501 Djibouti 258 604 Totals 490 1641 Experiments First, the set-up of the experiments which were carried out is identical to that of the experiments for the ragged tooth sharks. See [1] for details. Basically, we randomly divided the data set into a reference set and a test set. The left and the right sides of each shark were regarded as two different sharks (I 3 S can technically compare a left side with a right side and vice versa) in this case effectively doubling the number of individuals in the experiment to 980. The advantage is that now, we can get a performance indication for a much larger database. The assumption is that there are no overlaps between the data sets. In earlier experiments these overlaps were unfortunately never found so this seems to be valid [3] (we assume the data sets are similar or identical). Prior to each experiment the number of reference images per shark was set. For example, in case of two reference images, from each shark directory two reference images where randomly selected for the reference set. All the remaining images (if any) where then selected for the test set. In case of only one image in the directory, only this image was selected for the reference set. Page 1 of 4
Next, each image from the test set was compared against the entire reference set and we kept track whether the top 1, top 3, top 5, top 10, or top 20 contained at least one of the corresponding images from the reference set. After comparing all test images we have the overall performance statistics. Since this experiment is sensitive to the selection of images in the reference set we have repeated each experiment typically a 100 times to average out any random effects. Results Below are three tables. The top row contains the number of reference images (1, 2 or 3) and the databases used in the experiments. We have tested all databases separately, the combination of Maldives + Seychelles, and all three databases. The first column gives the number of times the experiment was repeated (iterations, typically 100), the ratio between the number of reference images vs. the number of test images, and the total number of tests carried out. For example, in the first table for the Maldives database # tests is 23800. As the experiment was repeated 100 times, it means there were 238 images in the test set. As the ref/test ratio is 1,23 the number of reference images was 238 x 1,23 = 293. The remaining five rows contain the actual recognition performance statistics. For example, for the Seychelles with 1 reference image out of 37300 tests, in 31038 times the top 5 contained the corresponding reference image, 31038 / 37300 = 83,2%. 1 ref Maldives Seychelles Djibouti Combined Mal+Sey Combined Mal+Sey+Dji Iterations 100 100 50 100 100 Ref/test 1,23 0,34 0,37 0,69 0,48 ratio # Tests 23800 37300 59450 61100 180000 Top 1 18870 79,3% 27995 75,1% 40894 68,8% 45177 73,9% 123799 68,8% Top 3 20262 85,1% 30275 81,2% 45087 75,8% 49172 80,5% 136585 75,9% Top 5 20769 87,3% 31038 83,2% 46383 78,0% 50543 82,7% 140614 78,1% Top 10 21315 89,6% 32196 86,3% 48061 80,8% 52302 85,6% 145554 80,9% Top 20 21873 91,9% 33452 89,7% 49822 83,8% 53950 88,3% 150294 83,5% Table 1: Recognition performance with 1 reference image. 2 refs Maldives Seychelles Djibouti Combined Mal+Sey Combined Mal+Sey+Dji Iterations 100 100 50 100 100 Ref/test 4,53 0,99 0,81 1,97 1,14 ratio # Tests 9600 25200 44800 34800 124400 Top 1 8730 90,9% 22048 87,5% 37957 84,7% 30430 87,4% 104387 83,9% Top 3 9134 95,1% 23153 91,9% 40312 90,0% 32095 92,2% 111022 89,2% Top 5 9245 96,3% 23393 92,8% 40864 91,2% 32588 93,6% 112803 90,7% Top 10 9267 96,5% 23735 94,2% 41597 92,9% 33222 95,5% 114762 92,3% Top 20 9364 97,5% 24157 95,9% 42277 94,4% 33614 96,6% 116510 93,7% Table 2: Recognition performance with 2 reference images. 3 refs Maldives Seychelles Djibouti Combined Mal+Sey Combined Mal+Sey+Dji Iterations 100 100 50 100 100 Ref/test Not Avail. 2,00 1,39 5,18 2,13 ratio # Tests Not Avail. 16700 34050 16700 84800 Page 2 of 4
Top 1 N.av. N.av. 15433 92,4% 30823 90,5% 15176 90,9% 75824 89,4% Top 3 N.av. N.av. 15974 95,7% 32087 94,2% 15846 94,9% 79241 93,4% Top 5 N.av. N.av. 16056 96,1% 32450 95,3% 15981 95,7% 80177 94,5% Top 10 N.av. N.av. 16142 96,7% 32763 96,2% 16061 96,2% 81106 95,6% Top 20 N.av. N.av. 16238 97,2% 33036 97,0% 16187 96,9% 81807 96,5% Table 3: Recognition performance with 3 reference images. There are no results for the Maldives database because there were no individuals with 4 images or more per side. Analysis Individual databases First we will look at the individual databases. There are considerable differences. With only one reference image, the Maldives database has the probability of 8.1% (= 100% - 91.9%) the proper reference image is not in the top 20. For the Djibouti database this is 16.2% or twice as much! For two reference images the figures are resp. 2.5% and 5.6%, about twice as much. In general the Djibouti database scores twice the non match rate compared to the Maldives database. Differences between the Maldives and the Seychelles database are a little less prominent but still significant. Using the same comparison the the Seychelles database has about 60% more non matches than the Maldives database. From the analysis it seems that the best explanation is difference in the quality of annotation. This issue will be addressed further on. Number of reference images When we look at the differences between the 3 tables it is very clear that recognition performance improves dramatically with more images in the reference database. With respect to only 1 reference image you increase the possibility to have the right shark at number 1 from 69% to 89% when using three reference images! For the top 20, the rates increase from 83% to 96,5%. Unfortunately we did not have enough data to test 4 reference images. Database size I 3 S does not seem to be very sensitive to the database size. Extending the Seychelles and Maldives database with the Djibouti database has a very limited effect on the performance. This effect is much more likely to be explained by annotation quality than database size. Problematic images Next, we had a look at the problematic images. During the experiments we stored which test images did not have a correct reference image in the top 20. Since we repeated typically 100 times we could also count how often the same test image did not have its reference image in the top 20. The images with the highest count were considered problematic. Almost 50% of the problematic images were caused by incorrect annotation or other human errors. Examples of human errors (in order of importance) are: 1. insufficient number of spots annotated. 2. spot annotation in the wrong area. 3. putting the second control point ( edge pectoral ) at the wrong position (Figure 1). 4. putting files in the wrong directory. 5. switching the second and the third control point. As the Djibouti images dominated the list of test images and therefore as well the list of problematic images, the analysis of problematic images was repeated for only the Seychelles and the Maldives database. It turned out that still 40% of the images analysed contained errors. Again, dominant errors Page 3 of 4
were the location of the second control point, and the number of spots annotated. Below these will be discussed individually. The other errors were caused mainly by either poor angle (large deviations from a perpendicular angle) or murky water making it very hard to distinguish the location of the control points or even the spots. Further, I 3 S is quite sensitive to the location of the control points. Even small deviations in the location may have a significant negative impact on the match score. Inconsistent annotation The probability of inconsistent annotation increases strongly with the number of people annotating. In case multiple people are involved, it makes sense to make sure they annotate in the same way. For example, it could be considered to have (new) researchers annotate a known test collection first and compare it with a known reference set. If the scores on the test collection are below a certain threshold, annotation is obviously not conform standards. The same principle applies to exchange between research groups. The Maldives research group seems to select (to some extent) other spots than the Seychelles and Djibouti researchers. Further, the location of the second control point varies a bit between groups. A recommendation for a test is to annotate each other s known images and see whether the scores are similar. So for example, the Seychelles research group annotates 20 Maldivian images. Next, the annotated images are compared with the Maldivian database to see whether scores are similar. If there are large deviations it is obvious that further standardization is required before I 3 S can be used to find overlaps between the various databases! Second control point It turns out that the second control point ( edge pectoral ) is in many cases hard to pinpoint consistently, especially if the edge of the pectoral fin is not touching the body (e.g. Figure 1). In many cases the control point is placed too far back (Figure 2). Special attention should be paid to the location of this control point. Number of spots annotated In some cases, it is hard to find sufficient spots (I 3 S has a built-in minimum of 12) on a whale shark. The best advice is to use the entire rectangle spanned by the control points. Analysis turned out that in many cases a problematic image only contained annotated spots in the triangle spanned by the control points. Enlarging the area to a rectangle doubles the area and makes recognition in general much easier. If this means that the rectangle extends a bit behind the second control point this should not be a problem. See Figure 3 and Figure 4. Page 4 of 5
Figure 1: Example of photo where the "edge pectoral" is hard to locate. Figure 2: Control 2 is placed inaccurately. Page 5 of 6
Figure 3: example with not enough spots annotated. Figure 4: Improved annotation. Arrows indicate changes. Page 6 of 7
I 3 S still assumes that a shark is a 2D creature. For this reason, it is best to limit the spot area to the body part which fits best to a 2D model. In practice this means not selecting spots above the horizontal line marked by the first control point (top 5 th gill) or below the second control point. Outside these lines the shark s body starts to curve strongly resulting in large matching errors. See Figure 5 for an example. Unfortunately, sometimes it is necessary to use these spots to get the minimum of 12 spots. A next experiment could be a test where the minimum number is reduced to 9 or 10. We have improved the recognition to handle fewer annotated spots better. Figure 5: Arrows indicate spots above the horizontal line which are better not used for annotation. Conclusions and recommendations Conclusions Provided that there are 3 reference images, a top 1 score has a probability of at least 91%. With more consistent annotation it could become significantly higher. The probability of a top 20 match is 97% (with a database size of almost 1000 individuals). Performance will most likely only slowly deteriorate with increase of database size. Recommendations Try to have the best three images for each shark side in the reference database. Reduce the minimum number of spots in I 3 S to 10 (trivial software change, directly available on request for testing). If multiple people are used for annotating the images, have all people annotate a standard set and compare results against the standard. Page 7 of 8
Compare annotation standards between research groups in the same way before relying on I 3 S to find overlaps. Investigate improvement of the algorithm to make it less sensitive to small changes in the location of the control points. Inconsistent annotation explains 40 to 50% of all mismatches. Specifically the location of the 2 nd control point and the area used for annotation appear to be error prone. Use the entire rectangle spanned by the control points for annotation. Extending the area a bit to the back should not pose problems. Do not annotate spots above the horizontal line through the first control point or below the horizontal line through the second control point, where the body curvature increases. A final recommendation: please do not hesitate to involve us in your research. For example, if the research community recognizes the need for more annotation standardization we could provide a tool to indicate which images are more likely to require some changes to annotation. If databases need to be compared with each other, it is fairly easy to develop a tool which does this automatically and provides you with a ranked list of best possible matches. References [1] A computer-aided program for pattern-matching natural marks on the spotted raggedtooth shark Carcharias taurus (Rafinesque, 1810), Van Tienhoven, A.M., Den Hartog, J.E., Reijns, R.A., & Peddemors, V.M., Journal of Applied Ecology 44, 273 280 (2007) [2] Spot the match wildlife photo-identification using information theory, Speed, C.W., Meekan, M.G., Bradshaw, C.J.A., Frontiers in Zoology, Frontiers in Zoology 4: 2, DOI: 10.1186/1742-9994-4-2 (2007) [3] Seeing Spots: Photo-identification as a Regional Tool for Whale Shark Identification. Katie Brooks, David Rowat, Simon J. Pierce, Daniel Jouannet and Michel Vely Western Indian Ocean J. Mar. Sci. Vol. 9, No. 2, pp. 185-194, 2010. Page 8 of 8