Skip to main content

CALAM: Linguistic Structure to Annotate Handwritten Text Image Corpus

  • Conference paper
  • First Online:
  • 2378 Accesses

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 32))

Abstract

In this paper, we report our effort in building a multi linguistic structure Cursive and Language Adaptive Methodology (CALAM) to create, annotate and validate linguistic dataset. CALAM provides a way for fetching and retrieval of information in a scientific and systematic manner through design and development of an annotated corpus of handwritten text image. It is a useful tool to annotate multi-lingual handwritten image dataset (Hindi, English, and Urdu etc.). The annotation is not limited with the grammatical tagging, but structural markup is also done. Annotation of handwritten text image is done in a hierarchical manner starting from handwritten form to segmented lines, words, and components. The component level markup is useful for finding strokes and list of ligatures in Urdu language. Along with a hierarchical access structure, CALAM provides the functionalities of Indexing, Insertion, Searching and Deletion of words and phrases in handwritten form. Apart from dataset fetching and retrieval it also automatically generates XML tagged file for each annotated handwritten text image for all dataset.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Christian, V.G., Michel, P., Stefan, K., Philippe, B.: The IRESTE on/off (IRONOFF) dual handwriting database. In: International Conference Document Analysis and Recognition, pp. 455–458 (1999)

    Google Scholar 

  2. Marti, U., Bunke, H.: A full English sentence database for off-line handwriting recognition. In: International Conference Document Analysis and Recognition, pp. 705–708 (1999)

    Google Scholar 

  3. Marti, U., Bunke, H.: The IAM-database: an English sentence database for off-line handwriting recognition. Int. J. Doc. Anal. Recogn. 5, 39–46 (2002)

    Google Scholar 

  4. Lecun, Y., et al.: The MNIST database of handwritten digits (image) (1999)

    Google Scholar 

  5. Waqas, M., Lei, C., Nobile, N., Suen, C.Y.: A new large Urdu database for off-line handwriting recognition. In: International Conference Image Analysis and Processing. Lecture Notes in Computer Science, pp. 538–546, Italy (2009)

    Google Scholar 

  6. Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M., Basu, D.: A database of unconstrained handwritten Bangla and English mixed script document image. Int. J. Doc. Anal. Recogn. (IJDAR) 15, 71–83 (2012)

    Article  Google Scholar 

  7. Raza, A., Siddiqi, I., Abidi, A., Arif, F.: An unconstrained benchmark Urdu handwritten sentence database with automatic line segmentation. In: International Conference Frontiers in Handwritten Recognition (ICFHR), pp. 491–496 (2012)

    Google Scholar 

  8. J. Hull: A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 550–554 (1994)

    Google Scholar 

  9. Wilkinson, R., Geist, J., Janet, S., Grother, P., Burges, C., Creecy, R., Hammond, B., Hull, J., Larsen, N., Vogl, T., Wilson, C.: The first census optical character recognition systems: NISTIR 4912. The U.S. Bureau of Census and the National Institute of Standards and Technology, Gaithersburg (1992)

    Google Scholar 

  10. Saito, T., Yamada, H., Yamamoto, K.: On the data base ETL 9 of hand printed characters in JIS Chinese characters and its analysis. IEICE Trans. 757–764 (1985)

    Google Scholar 

  11. Dae-Hwan, K.I.M., Hwang, Y., Sang-Tae, P.A.R.K., Eun-Jung, K.I.M., Sang-Hoon, P.A.E.K., Sung-Yang, B.A.N.G.: Handwritten Korean character image database PE92. In: International Conference Document Analysis and Recognition (ICDAR), pp. 470–473 (1993)

    Google Scholar 

  12. Dash, N.S.: Corpus Linguistics: A General Introduction. CIIL, Mysore (2010)

    Google Scholar 

  13. Agrawal, M., Bali, K., Madhvanath, S.: UPX: a new XML representation for annotated datasets of online handwriting data. In: International Conference Document Analysis and Recognition (ICDAR), vol. 2, pp. 1161–1165, Seoul, Korea (2005)

    Google Scholar 

  14. Saund, E., Lin, J., Sarkar, P.: PixLabeler: user interface for pixel-level labeling of elements in document images. In: International Conference Document Analysis and Recognition (ICDAR), pp. 446–450, Spain (2009)

    Google Scholar 

  15. Yin, F., Wang, Q.-F., Liu, C.-L.: A tool for ground-truthing text lines and characters in off-line handwritten Chinese documents. In: International Conference Document Analysis and Recognition ICDAR, pp. 951–955 (2009)

    Google Scholar 

  16. Elliman, D., Sherkat, N.: A truthing tool for generating a database of cursive words. In: International Conference Document Analysis and Recognition, pp. 1255–1262, USA (2001)

    Google Scholar 

  17. Slimane, F., Ingold, R., Kanoun, S., Alimi, M.A., Hennebert, J.: A new arabic printed text image database and evaluation protocols. In: International Conference Document Analysis and Recognition (ICDAR), pp. 946–950, Spain (2009)

    Google Scholar 

Download references

Acknowledgments

This work is financially supported by Department of Science and Technology, Government of Rajasthan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Prakash Choudhary .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer India

About this paper

Cite this paper

Choudhary, P., Nain, N. (2015). CALAM: Linguistic Structure to Annotate Handwritten Text Image Corpus. In: Jain, L., Behera, H., Mandal, J., Mohapatra, D. (eds) Computational Intelligence in Data Mining - Volume 2. Smart Innovation, Systems and Technologies, vol 32. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2208-8_41

Download citation

  • DOI: https://doi.org/10.1007/978-81-322-2208-8_41

  • Published:

  • Publisher Name: Springer, New Delhi

  • Print ISBN: 978-81-322-2207-1

  • Online ISBN: 978-81-322-2208-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics