Standard Databases for Recognition of Handwritten Digits, Numerical Strings, Legal Amounts, Letters and Dates in Farsi Language

Stndrd Dtbses for Recognition of Hndwritten, Numericl Strings, Legl Amounts, Letters nd Dtes in Frsi Lnguge Frshid Solimnpour, Jvd Sdri, Ching Y. Suen To cite this version: Frshid Solimnpour, Jvd Sdri, Ching Y. Suen. Stndrd Dtbses for Recognition of Hndwritten, Numericl Strings, Legl Amounts, Letters nd Dtes in Frsi Lnguge. Guy Lorette. Tenth Interntionl Workshop on Frontiers in Hndwriting Recognition, Oct 006, L Bule (Frnce), Suvisoft, 006. <inri-00098> HAL Id: inri-00098 https://hl.inri.fr/inri-00098 Submitted on 5 Oct 006 HAL is multi-disciplinry open ccess rchive for the deposit nd dissemintion of scientific reserch documents, whether they re published or not. The documents my come from teching nd reserch institutions in Frnce or brod, or from public or privte reserch centers. L rchive ouverte pluridisciplinire HAL, est destinée u dépôt et à l diffusion de documents scientifiques de niveu recherche, publiés ou non, émnnt des étblissements d enseignement et de recherche frnçis ou étrngers, des lbortoires publics ou privés.

Stndrd Dtbses for Recognition of Hndwritten, Numericl Strings, Legl Amounts, Letters nd Dtes in Frsi Lnguge Frshid Solimnpour Jvd Sdri Ching Y. Suen CENPARMI (Center for Pttern Recognition nd Mchine Intelligence), Computer Science Deprtment, Concordi University, 455 de Misonneuve Blvd. West, Montrel, Quebec, Cnd, HG M8, Tel: (54)-848-44-Ext:7950, Fx: (54)-848-80 Emils:{f_solim, j_sdri, suen}@cs.concordi.c Abstrct This pper describes n importnt step towrds the stndrdiztion of the reserch on Opticl Chrcter Recognition (OCR) in Frsi lnguge. It describes formtions of novel nd stndrd hndwritten dtbses including isolted digits, letters, numericl strings, Legl mounts (used for cheques), nd dtes. Despite conventionl reserch nd n Internet serch, no publicly ccessible Frsi dtbse ws found. Hence, it ws decided tht it would be worthwhile cdemic effort to crete severl Frsi dtbses tht could stnd on their own merit functioning s useful tools for OCR reserchers. Also, in order to show the potentil uses of our new dtbses we lso conducted some experiments on the recognition of hndwritten isolted Frsi digits. Keywords: Frsi OCR, Frsi Hndwritten Dtbses, Arbic Hndwritten Dtbses, Indin Dtbse.. Introduction An essentil prt of the development nd evlution of every offline chrcter recognition technique is the comprison of the results by using the sme stndrd dtbse s other reserchers []. There re mny exmples of widely used dtbses in the field of hndwriting recognition such s NIST [], CEDAR [], CENPARMI [4], UNIPEN [5], CENPARMI Arbic Cheques [6], ETL9 (Jpn) [7], nd PE9 (Kore) [8]. But to the best of our knowledge, no stndrd dtbse for the Frsi lnguge is vilble. The Frsi lnguge is spoken by more thn 0 million people, minly in Irn, Afghnistn, Tjikistn, nd prtly in some other countries. There re lso other lnguges which use the sme lphbets nd digits or subsets of them such s: Arbic, Urdu, nd Pshto. In Frsi, words, sentences nd dtes re written from right to left, but numbers re written from left to right which mtch the style of writing numbers in the English lnguge. Frsi hs letters in the lphbet nd is cursive lnguge, which mens within one word, letters cn be connected. Due to connectivity, the shpe of Frsi letters my chnge significntly depending on their positions in word, the identity of neighboring letters, the font, or the wy tht writer connects successive letters. Considering these fcts, it is crucil to hve & Authors hve the sme contribution stndrd dtbses in order to improve reserch on Frsi hndwritten recognition. In this pper, we will describe the detils of formtion of the following dtbses: Frsi isolted digits, numericl strings, isolted letters, legl mounts, Frsi dtes (clled Hijri Shmsi); nd smll set of English digits (written by Frsi ntive spekers). In order to show the usefulness of our dtbse, we lso report the results of some of our experiments on the recognition of isolted hndwritten Frsi digits tken from this dtbse. The rest of this pper is orgnized s follows: Section describes our steps towrds collecting the dt. In Section, dt extrction methods re covered, which include the pre-processing of the imges. Section 4 detils our experiments on the recognition of Frsi isolted digits. In Section 5, we discuss the output of our work nd compre it with some other works. Finlly in Section 6 we present some concluding remrks nd suggestions for future reserch.. Dt Collection Two dt entry forms were designed for our dt collection process. The first form contined Frsi numericl strings, isolted letters, the dte, nd English digits. The Frsi digits dtbse ws formed by segmenting the numericl strings in this form. The second form ws completely dedicted to cursive legl mounts. In order to utomte the process of cutting the fields out of the scnned forms, two types of nchoring mrks were dded to the forms: the form identifiers, nd the edge identifiers. The form identifiers consisted of 8 squres such tht ech one cn hve two sttes: empty or blckened. Therefore, they could represent 55 binry numbers nd could serve s identity of 55 different forms. In our cse, for the form, squres, 5, nd 8; nd for the form, squres, 4, nd 7 were blckened. By detecting these squres our progrm could utomticlly identify the form it ws working on. Edge identifier mrks consisted of four squres locted t ech corner of the form, nd detecting them enbled the progrm to correctly determine the coordintes of the region tht contined the ctul dt. Two smples of the dt entry forms re shown in Figure nd Figure.

.. Frsi numericl strings dtbse Ech prticipnt wrote 4 numericl strings in form which were used to form our dtbse of Frsi numericl strings. In Frsi, the norml height of the num erl 0 is pproxim tely one fifth of other chrcters, nd is written differently every time either becuse of its loction in numericl string or becuse of its repetition in numericl string. To cover ll forms, we hd to repet it more times thn other numerls. In our dtbses, we hve smples of the numerl zero being t the beginning, middle or end of numerl string s well s when it is repeted two, three or six times in string. In Figure, smples of two different writing styles of repeted zeros cn be viewed. Figure. Different styles of writing zeros in the numericl string: 7000. Figure. Smple of filled form. The dt entry forms were filled by 75 writers selected from different ges, genders, nd jobs; nd mong those, 05 writer were rndomly ssigned to our trining set, 50 writer to the testing set, nd 0 writer to the verifying set. We ensured tht the dt in ech set ws completely genuine nd tht there would be no reltion between sets. Our finl work includes these dtbses: numericl strings, isolted digits, Frsi letters, cursive legl mounts, nd smll set of English isolted digits. In the following subsection we give detils on ech dtbse. Tble. Sttistics of numericl strings dtbse. 4 75 440 840 00.. Frsi isolted digits dtbse A simple segmenttion lgorithm ws developed for seprting the digits in the numericl strings nd to crete the Frsi digits dtbse. When designing the dt entry form for the numericl strings, throughout ll the strings, digits to 9 were repeted 5 times, digit 0 ws repeted 0 times, nd the deciml point ws repeted times. This wy we could control number of isolted digits tht we could extrct from the numericl strings. Smples of Frsi isolted digits re shown in Figure 4 nd sttistics of this dtbse re included in Tble. Figure 4. Smples of Frsi isolted digits. Becuse seprting ll the digits ws not possible, writers did not eqully prticipte in the dtbse for ech digit. Therefore, some of the digits written by those writers tht hd the most prticiption were rndomly removed from the dtbse in order to normlize the prticiption. The lgorithm is shown in Figure 5. Note tht every time digit is removed the most prticipting writer chnges. This procedure ws executed for ech digit. Tble shows the finl sttistics for this dtbse. Figure. Smple of filled form. Tble. Sttistics of the isolted digits dtbse. 0 75 000 000 5000

Include ll the imges Reched designted count? No Yes Finished Figure 8. Exmple of cursive worded number which reds: One Hundred nd Fourty Toumns Over. Determine the most prticipting writer Rndomly delete one imge from the determined writer.5. Frsi dtes dtbse Countries tht hve Frsi lnguge spekers use type of dte clled Hijri Shmsi. The formt of writing the dte in Frsi is like this: yer/month/dy. A smple of dte is shown in Figure 9. The sttistics of this dtbse re lso included in Tble 5. Figure 5. Algorithm of normlizing the prticiption... Frsi isolted letters dtbse Although Frsi consists of letters, yet when filling dt entry forms out people use two different styles for the letter ه (pronounced: Heh) nd ا (pronounced: Alef) nd smples of those styles re shown in Figure 6 nd Figure 7. With these styles, the number of isolted letters tht we included in the form reched 4. Figure 6. Two styles of. ه writing the letter Figure 7. Two styles of. ا writing the letter Ech writer wrote the isolted letters included in the form, two times. The sttistics of this dtbse re included in Tble. Tble. Sttistics of Frsi isolted letters dtbse. 4 75 740 60 400.4. Frsi legl mounts dtbse Two types of dt were included in our second dt entry form. The first type consisted of 4 words tht re normlly used for writing the legl mount on bnk cheques plus four dditionl words consist of currency units nd the words Over nd Equl to (in Frsi). The second type consisted of four worded number strings where three of those were pre-determined fields nd one ws free field. In the free field, writers could write worded number of their own. When including these imges in the dtbse, the free field ws lbeled mnully. A smple of worded number cn be seen in Figure 8. Tble 4 shows sttistics of this dtbse. Tble 4. Sttistics of cursive worded number dtbse. Writers = Clsses 75 Fields 48 5040 960 400 Free Field 75 05 0 50 Totl 8 545 980 450 Figure 9. Exmple of Frsi dte. Tble 5. Sttistics of the Frsi dtes dtbse 75 75 05 0 50.6. English digits English digits hve lredy been collected nd included in different dtbses; however, smll set ws included in the first form (ech digit from 0 to 9 ws repeted twice in ech form) in order to cpture the style of writing English digits by non-ntive English spekers (Irnins). Tble 6 shows sttistics of this dtbse. Tble 6. Sttistics of the isolted digits dtbse. 0 75 00 400 000. Dt Extrction.. Preprocessing Ech form ws completely scnned using Lexmrk-P80 scnner whose resolution ws set to 00 dpi t grey level of 8 bits. The imges were sved in PNG (Portble Network Grphics) indexed-color formt files. PNG provides ptent-free replcement for GIF nd lso replces mny common uses of TIFF. [9] All the dtbses consist of gryscle nd binry versions of imges nd ech set is included in seprte folder. First, gryscle imges were extrcted, nd then ll were converted to binry in seprte folder keeping the sme filenmes nd the sme folder structure. To convert ech file to binry, the threshold of gryscle imge is clculted using the gry-level histogrm [0], nd then ll the pixels with brightness less thn tht vlue re set to blck, nd the rest to white. Before strting the process of extrcting imges from scnned forms, their slt nd pepper noise ws remove using the lgorithm presented in [].

.. Dt Preprtion A computer progrm ws developed to utomticlly extrct imges of the fields from the pre-processed scnned forms using templte tht ws mnully designed for identifying the dt entry fields reltive to the nchoring mrks t the corners of the forms. The progrm first recognized edge identifier nchor mrks on the scnned imge by simple templte mtching technique. It then tried to mtch the templte coordintes to nchor mrks of the imge by scling nd/or rotting the templte if necessry. After tht, ll the fields were cut from the imge, bsed on the boundries in the mtched templte. The fields were sved s individul imge files using the set they belonged to nd the nming convention of the dtbse. To determine the set to which n imge belongs, the writers were selected from different ges, genders, nd jobs to serve in the trining, testing, or verifying set. All the im ges extrcted from ech prticulr w riter s form, were sved to the sme set for mking sure tht the dt sets re totlly unrelted. For ech imge, record ws inserted into Microsoft Access dtbse tht includes the pth to the imge file reltive to the bse folder, the lbel of the imge, the number of chrcters in the imge, the number of words in the imge, the type of the contents (numericl, dte, cursive worded number or letter), nd some other informtion. By querying this type of detiled informtion, future reserchers will be ble to find the proper set of imges more esily. 4. Experimentl Results In order to show the ppliction of our dtbses, we conducted some experiments on the recognition of hndwritten isolted Frsi digits. We used our isolted digits dtbse which contins 00 trining, 00 verifying, nd 500 testing smples per digit. 4.. Feture Extrction In order to compre our results with some previous works, we used the fetures presented in []. Eight sets of fetures were used to represent imges of digits: the outer profile from four directions; crossing counts; nd projected histogrm from ech of two directions. Figure 0. A smple of the fetures. : outer profiles, b: crossing counts, c: projection histogrm. b c b c Ech set produced n rry tht ws lter normlized to n rry of size eight. The normliztion ws done using liner interpoltion for up-smpling nd verging for down-smpling the rry. The combintion of ll the feture sets produced 64-member rry tht ws used s our feture vector. A smple of fetures used in our experiment cn be viewed in Figure 0. 4.. Clssifiction For clssifiction we used Support Vector Mchines (SVM) [] nd Rdil Bsis Function (RBF) kernel. T he prm eter C w s set to nd the prm eter σ w s set to 0.05. To find the best prmeter vlues, we djusted the prmeters on the trining set, nd tested them on the verifying set. Prmeters tht gve the best results on the verifying set were used for clssifying on the testing set. We used LIBSVM [4] for the implementtion of our SVM clssifier. Tble 7 shows the overll results of our clssifier compred to the results of [] nd [5]. The confusion mtrix of the testing set is lso shown in Tble 8. Ech row of this tble shows how isolted digits in the testing set were clssified or misclssified. Tble 7. Our results compred with [] nd [5]. Our Results Results of [] Results of [5] Trining Set 000 4500 790 Verifying Set 000 - - Testing Set 5000 600 05 nsv* 577 69 - Trining Error 0.85% 0.00% 0 RR** 97.% 99.44% 94 * Number of Support Vectors, ** Recognition Rte Tble 8. The confusion mtrix of the testing set using SVM with polynomil kernel. 0 4 5 6 7 8 9 459 0 9.8% 6.6% 48 96.6% 49 98.6%.6% 7 47 8 94.6% 4 49 4.6%.8% 98 8 5.6% 6 7 8 9.6% 5. Discussion.6%.6% 49 98 484 96.8% 5 % 500 00% 5 % 499 99.8% 49.6% 98 This reserch effort hs produced six dtbses. Ech

dtbse is divided into trining, verifying, nd testing sets, which includes pproximtely 60%, %, nd 8% of the vilble dt respectively. All the dtbses re vilble in gryscle nd binry versions. Tble 9 nd Tble 0 show comprison between our two importnt dtbses (Frsi isolted digits nd Frsi isolted letters) nd other similr vilble dtbses. Although the result of our recognition rte in Section 4 is little bit lower thn [], our dtbses were not the sme, nd our isolted digits dtbse hs more smples compred to them. Also we used unseen dt to test our clssifier nd in [] testing set ws used for djusting prmeters of the clssifier. As our dtbse is vilble for the reserch community, we hope tht it cn function s stndrd comprison bsis for Frsi hndwritten recognition reserch. Tble 9. Comprison of number of smples in our Frsi isolted digit dtbse with other dtbses. Isolted Dtbse Set MNIST English 60,000 0 0,000 CEDAR English 5,80 0 707 CENPARMI English 4,000 0,000 CENPARMI Arbic 0,56 0 4,4 Our Dtbse Frsi,000,000 5,000 Tble 0. Comprison of number of smples in our Frsi isolted letters dtbse with other dtbses. Isolted Letters Dtbse Set CEDAR English Letters 9,45 0,8 Our Dtbse Frsi Letters 7,40,60,400 6. Conclusion nd Future Works We hve presented six new stndrd dtbses consisting of hndwritten Frsi numericl strings, digits, letters, legl mounts nd dtes which cn serve s bsis for future reserch in offline Frsi hndwritten recognition. These dtbses re vilble to the reserch community upon request to the Center of Pttern Recognition nd Mchine Intelligence (CENPARMI) of Concordi University. Our dtbse contins binry nd gryscle versions of the imges llowing for experimenttion nd comprison with both gryscle nd binry preprocessing nd recognition techniques. In the future, the dtbses my be expnded by collecting more dt entry forms, nd dding more sets such s Frsi words, sub-words nd sentences. Furthermore, the sets my be esily dopted for Frsi-bsed cheque-processing systems. Lter, we would like to develop sophisticted segmenttion nd recognition lgorithms for processing smples of these dtbses. 7. References [] I. Guyon, R. Hrlick, J. Hull, nd I. Phillips, Dtbse nd benchmrking, In H. Bunke nd P. Wnd, editors, Hndbook of Chrcter Recognition nd Document Imge Anlysis. World Scientific, 997, Chpter 0, pp. 779 799. [] R. Wilkinson, J. Geist, S. Jnet, P. Grother, C. Burges, R. Creecy, B. Hmmond, J. Hull, N. Lrsen, T. Vogl, nd C. Wilson. The first census opticl chrcter recognition systems conf. #NISTIR 49, The U.S. Bureu of Census nd the Ntionl Institute of Stndrds nd Technology, Githersburg, MD, 99. [] J. Hull, A dtbse for hndwritten text recognition reserch, IEEE Trns. on Pttern Anlysis nd Mchine Intelligence, My 994, Volume 6, Issue 5, pp. 550 554. [4] C. Y. Suen, C. Ndl, R. Legult, T. Mi, nd L. Lm, Computer recognition of unconstrined hndwritten numerls, Proc. of the IEEE, 99, Volume 7, Issue 80, Pges 6 80. [5] I. Guyon, L. Schomker, R. Plmondon, M. Libermn, nd S. Jnet, Unipen project of on-line dt exchnge nd benchmrks, Proc. of the th IAPR Int. Conf on Pttern Recognition, Jeruslem, Isrel, Oct. 994, pp. 9. [6] Yousef Al-Ohli, Mohmed Cheriet, nd C.Y. Suen, Dtbses for recognition of hndwritten Arbic cheques, Proceedings of the Seventh Int. Workshop on Frontiers in Hndwritten Recognition, Sep 000, pp. 60-606. [7] F. Jelinek, Self-orgnized lnguge modeling for speech recognition, In A. Wibel nd K.-F. Lee, editors, Redings in Speech Recognition, Morgn Kufmnn Publishers, Inc., 990, pp. 450 506. [8] D. Kim, Y. Hwng, S. Prk, E. Kim, S. Pek, nd S. Bng, Hndwritten Koren chrcter imge dtbse PE9, In Proceedings of the Second Int. Conference on Document Anlysis nd Recognition, 99, pp. 470 47. [9] Chris Lilley, PNG (Portble Network Grphics). The World Wide Web Consortium (WC), Detils vilble t http://www.w.org/grphics/png/ [0] N. Otsu, A thresholding selection method from grylevel histogrm, IEEE Trnsctions on Systems, Mn, nd Cybernetic, 979, Volume 9, pp. 6-66. [] Je S. Lim, Two-Dimensionl Signl nd Imge Processing, Englewood Cliffs, editor, Prentice Hll, USA, 990, pp. 469-476. [] H. Soltnzdeh, nd M. Rhmti, Recognition of Persin hndwritten digits using imge profiles of multiple orienttions, Pttern Recognition Letters, 004, Volume 5, pp. 569 576. [] C.J.C Burges, A tutoril on support vector mchines for pttern recognition, Dt Mining nd Knowledge Discovery, 998, Volume, pp. 67. [4] Chih-Chung Chng, Chih-Jen Lin, LIBSVM: librry for support vector mchines, 00. Softwre vilble t http://www.csie.ntu.edu.tw/~cjlin/libsvm [5] J. S dri, C. Y. S uen, T. D. B ui, Appliction of Support Vector Mchines for Recognition of Hndwritten A rbic/p ersin D igits, P roceedings of the S econd Conference on Mchine Vision nd Imge Processing & Applictions (MVIP00), Vol., pp. 00-07, Feb. 00, Tehrn, Irn.