CALAM: Linguistic Structure to Annotate Handwritten Text Image Corpus

Choudhary, Prakash; Nain, Neeta

doi:10.1007/978-81-322-2208-8_41

CALAM: Linguistic Structure to Annotate Handwritten Text Image Corpus

Prakash Choudhary⁷ &
Neeta Nain⁸

Conference paper
First Online: 11 December 2014

2378 Accesses

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 32))

Abstract

In this paper, we report our effort in building a multi linguistic structure Cursive and Language Adaptive Methodology (CALAM) to create, annotate and validate linguistic dataset. CALAM provides a way for fetching and retrieval of information in a scientific and systematic manner through design and development of an annotated corpus of handwritten text image. It is a useful tool to annotate multi-lingual handwritten image dataset (Hindi, English, and Urdu etc.). The annotation is not limited with the grammatical tagging, but structural markup is also done. Annotation of handwritten text image is done in a hierarchical manner starting from handwritten form to segmented lines, words, and components. The component level markup is useful for finding strokes and list of ligatures in Urdu language. Along with a hierarchical access structure, CALAM provides the functionalities of Indexing, Insertion, Searching and Deletion of words and phrases in handwritten form. Apart from dataset fetching and retrieval it also automatically generates XML tagged file for each annotated handwritten text image for all dataset.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Christian, V.G., Michel, P., Stefan, K., Philippe, B.: The IRESTE on/off (IRONOFF) dual handwriting database. In: International Conference Document Analysis and Recognition, pp. 455–458 (1999)
Google Scholar
Marti, U., Bunke, H.: A full English sentence database for off-line handwriting recognition. In: International Conference Document Analysis and Recognition, pp. 705–708 (1999)
Google Scholar
Marti, U., Bunke, H.: The IAM-database: an English sentence database for off-line handwriting recognition. Int. J. Doc. Anal. Recogn. 5, 39–46 (2002)
Google Scholar
Lecun, Y., et al.: The MNIST database of handwritten digits (image) (1999)
Google Scholar
Waqas, M., Lei, C., Nobile, N., Suen, C.Y.: A new large Urdu database for off-line handwriting recognition. In: International Conference Image Analysis and Processing. Lecture Notes in Computer Science, pp. 538–546, Italy (2009)
Google Scholar
Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M., Basu, D.: A database of unconstrained handwritten Bangla and English mixed script document image. Int. J. Doc. Anal. Recogn. (IJDAR) 15, 71–83 (2012)
Article Google Scholar
Raza, A., Siddiqi, I., Abidi, A., Arif, F.: An unconstrained benchmark Urdu handwritten sentence database with automatic line segmentation. In: International Conference Frontiers in Handwritten Recognition (ICFHR), pp. 491–496 (2012)
Google Scholar
J. Hull: A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 550–554 (1994)
Google Scholar
Wilkinson, R., Geist, J., Janet, S., Grother, P., Burges, C., Creecy, R., Hammond, B., Hull, J., Larsen, N., Vogl, T., Wilson, C.: The first census optical character recognition systems: NISTIR 4912. The U.S. Bureau of Census and the National Institute of Standards and Technology, Gaithersburg (1992)
Google Scholar
Saito, T., Yamada, H., Yamamoto, K.: On the data base ETL 9 of hand printed characters in JIS Chinese characters and its analysis. IEICE Trans. 757–764 (1985)
Google Scholar
Dae-Hwan, K.I.M., Hwang, Y., Sang-Tae, P.A.R.K., Eun-Jung, K.I.M., Sang-Hoon, P.A.E.K., Sung-Yang, B.A.N.G.: Handwritten Korean character image database PE92. In: International Conference Document Analysis and Recognition (ICDAR), pp. 470–473 (1993)
Google Scholar
Dash, N.S.: Corpus Linguistics: A General Introduction. CIIL, Mysore (2010)
Google Scholar
Agrawal, M., Bali, K., Madhvanath, S.: UPX: a new XML representation for annotated datasets of online handwriting data. In: International Conference Document Analysis and Recognition (ICDAR), vol. 2, pp. 1161–1165, Seoul, Korea (2005)
Google Scholar
Saund, E., Lin, J., Sarkar, P.: PixLabeler: user interface for pixel-level labeling of elements in document images. In: International Conference Document Analysis and Recognition (ICDAR), pp. 446–450, Spain (2009)
Google Scholar
Yin, F., Wang, Q.-F., Liu, C.-L.: A tool for ground-truthing text lines and characters in off-line handwritten Chinese documents. In: International Conference Document Analysis and Recognition ICDAR, pp. 951–955 (2009)
Google Scholar
Elliman, D., Sherkat, N.: A truthing tool for generating a database of cursive words. In: International Conference Document Analysis and Recognition, pp. 1255–1262, USA (2001)
Google Scholar
Slimane, F., Ingold, R., Kanoun, S., Alimi, M.A., Hennebert, J.: A new arabic printed text image database and evaluation protocols. In: International Conference Document Analysis and Recognition (ICDAR), pp. 946–950, Spain (2009)
Google Scholar

Download references

Acknowledgments

This work is financially supported by Department of Science and Technology, Government of Rajasthan.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, National Institute of Technology, Manipur, Imphal, India
Prakash Choudhary
Department of Computer Engineering, Malaviya National Institute of Technology, Jaipur, Jaipur, India
Neeta Nain

Authors

Prakash Choudhary
View author publications
You can also search for this author in PubMed Google Scholar
Neeta Nain
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Prakash Choudhary .

Editor information

Editors and Affiliations

University of Canberra, Canberra, Australia and University of South Australia, Adelaide, South Australia, Australia
Lakhmi C. Jain
Department of Computer Science and Engineering, Veer Surendra Sai University of Technology, Sambalpur, Odisha, India
Himansu Sekhar Behera
Computer Science & Engineering, Kalyani University, Nadia, West Bengal, India
Jyotsna Kumar Mandal
Dept. of Computer Science and Engineering, National Institute of Technology Rourkela, Rourkela, India
Durga Prasad Mohapatra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Choudhary, P., Nain, N. (2015). CALAM: Linguistic Structure to Annotate Handwritten Text Image Corpus. In: Jain, L., Behera, H., Mandal, J., Mohapatra, D. (eds) Computational Intelligence in Data Mining - Volume 2. Smart Innovation, Systems and Technologies, vol 32. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2208-8_41

Download citation

DOI: https://doi.org/10.1007/978-81-322-2208-8_41
Published: 11 December 2014
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2207-1
Online ISBN: 978-81-322-2208-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics