In document image analysis and especially in handwritten document image recognition, standard datasets play vital roles for evaluating performances of algorithms and comparing results obtained by different groups of researchers. In this paper, an unconstrained Persian handwritten text dataset (PHTD) is introduced. The PHTD contains 140 handwritten documents of three different categories written by 40 individuals. Total number of text-lines and words/subwords in the dataset are 1787 and 27073, respectively. In most of the PHTD documents either an overlapping or a touching text-lines is present. The average number of text-lines in documents of the PHTD is 13. Two types of ground truths based on pixels information and content information are generated for the dataset. Providing these two types of ground truths for the PHTD, it can be utilized in many areas of document image processing such as sentence recognition/understanding, text-line segmentation, word segmentation, word recognition, and character segmentation. To provide a framework for other researches, recent text-line segmentation results on this dataset are also reported.
Conference proceeding
A new dataset of Persian handwritten documents and its segmentation
2011 7th Iranian Conference on Machine Vision and Image Processing : proceedings, pp.35-39
2011 7th Iranian Conference on Machine Vision and Image Processing (Tehran, Iran, 16/11/2011 - 17/11/2011)
2011
Metrics
61 Record Views
Abstract
Details
- Title
- A new dataset of Persian handwritten documents and its segmentation
- Creators
- Ali Reza Alaei (Author) - University of MysoreP Nagabhushan (Author) - University of MysoreUmapada Pal (Author) - Indian Statistical Institute
- Publication Details
- 2011 7th Iranian Conference on Machine Vision and Image Processing : proceedings, pp.35-39
- Conference
- 2011 7th Iranian Conference on Machine Vision and Image Processing (Tehran, Iran, 16/11/2011 - 17/11/2011)
- Publisher
- IEEE; Piscataway, NJ
- Number of pages
- 35-39
- Identifiers
- 2010; 991012822180602368
- Academic Unit
- Information Technology; Faculty of Science and Engineering; School of Business and Tourism; Faculty of Business, Law and Arts
- Language
- English
- Resource Type
- Conference proceeding