Keywords

1 Introduction

Corpus is a structured and large data set collection of real life language, chosen to be as varied as possible to cover a large volume of distinct texts. Annotated Corpus play a significant role in any kind of computer-based linguistics research. Obvious application areas include all features of a language such as grammatical information, style of writing, syntax, lexicography, statistical analysis and testing, checking occurrences or validating linguistic rules within all branches of applied and theoretical linguistics territory. Computer process able corpora facilitates linguistic research, as electronically readable corpora have dramatically reduced the time needed to find a particular information in a Corpus.

In principle, Corpus Linguistics is an approach that aims at investigating nature of language and all its properties by analyzing large collections of text samples. This approach has been used in a number of research areas for ages: from descriptive study of a language, to language education, to lexicography, etc. It broadly refers to exhaustive analysis of any substantial amount of authentic, spoken/written text samples. In general, it covers large amount of machine-readable data of actual language that includes the collections of literary and non-literary text samples to reflect on both the synchronic and diachronic aspects of a language. The uniqueness of corpus linguistics lies in its way of using modern computer technology in the collection of language data, methods used in processing of language databases, techniques used for information retrieval, and strategies used to explore all kinds of language-related research and applications development activities.

This paper describe the structure of the CALAM, as embodied in the filename of the text. It describes the design and development of a multilingual Corpus of large volume of handwritten text and Unicode dataset. The paper explores all the features of a natural language: writing style, grammatical category information, and machine translation, aligned translation for sentence by sentence, phrase by phrase, or word by word. The Corpus is completely labelled for content information as well as content detection and supports the evaluation of systems like linguistic handwriting recognition, writer identification. The database was also experimented for the benchmarking of handwritten text recognition algorithms by generating a XML file of annotated handwritten text image.

The paper first introduces the experimental setup for the collection and distribution of data in a systematic manner, and then report on the process of information fetching and feeding in both handwritten text image and corresponding Unicode text format simultaneously on same screen. The paper is organized as: Sect. 2, details the related work with Corpus and their structure. Section 3, describes the collection and distribution of raw text sample. Section 4 is concerned with the annotation of dataset and the methodology to develop the Corpus. Section 5 describe the validation of dataset and Sect. 6 does the comparatively study of structure. Finally some conclusions are presented in Sect. 7.

2 Related Work

The Corpus methodology though started in 1990, is still a thrust area for linguistic domain. In the era of computation linguistic research Corpora has become a revolutionized area in all branches of linguistics. Standard datasets demand has been increased in recent years for different research area. This provides a platform for researchers to evaluate various linguistic techniques on the same dataset.

The most popular handwritten databases used for linguistic research is IRESTE [1]. It is a handwritten image database of French and English languages containing isolated words and characters without labelling. IAM [2, 3] is the first available annotated dataset of full length English sentences. The database is available in both printing and handwritten image format. Annotation has been done for line as well as word. MNIST [4] is a database of handwritten digits. CENPARMI [5] is the first Urdu handwritten Corpus which includes isolated digits and characters of Urdu language. CMATER [6] is a database of unconstrained handwritten Bangla and English mixed script document images. CENIP-UCCP [7] is the only available Corpus of Urdu handwritten text image with full sentences. Some other widely used databases in the field of handwriting recognition are CEDAR [8], NIST [9], ETL9 (Japan) [10], and PE92 (Korea) [11].

From the survey it was found that a less number of annotated handwritten datasets is available as compared to printed datasets. We do not find any handwritten dataset for Indian languages. CALAM provides a way to develop a large volume of data set for handwritten text images in Indic scripts and their corresponding labelled Unicode texts.

3 CALAM: Design of Experiment

Our Proposed methodology is to design and develop a Corpus consisting of full length text sentences. In order to be representative of all the phenomena across that language the corpus should contain a large verity of text samples. To maintain the balancing among the resources and building a corpus some salient features of the corpus suggested by Dash [12] in a general introduction about corpus linguistic has been considered.

3.1 Category Wise Distribution of Data

To cater to a huge vocabulary and maintain the balancing among the resources throughout the database, domain of the Corpus would be a data collection of six different categories. The categories are further divided into subcategories to capture maximum variance in word collection. List of category and denoted keyword of corresponding category and their subcategory for collection of data is as follows:

  1. 1.

    History—H

    1. (a)

      Indian History—IH

    2. (b)

      World History—WH

  2. 2.

    Literature—L

    1. (a)

      Poetry/Religion—PR

    2. (b)

      Gazals/Shyari—GS

    3. (c)

      Biography—BI

  3. 3.

    Science—S

    1. (a)

      Medical—ME

    2. (b)

      Physics—PH

    3. (c)

      Chemistry—CH

  4. 4.

    News—N

    1. (a)

      International—IN

    2. (b)

      National—NA

    3. (c)

      Sports—SP

  5. 5.

    Architecture—A

    1. (a)

      Rural Architecture—RA

    2. (b)

      Urban Architecture—UA

  6. 6.

    Politics—P

    1. (a)

      Central Government—CG

    2. (b)

      State Government—SG

The corpus development starts with the raw collection of data and ends with appropriate tagging and labelling of the collected text in the database. A form is designed to systematic collection of handwritten text images for corpus. The design and structure of the form is split into four parts as shown in Fig. 1, each part is separated from each other with a horizontal line.

Fig. 1
figure 1

Layout of a handwritten text form

For corpus understanding we are using Urdu language as an example. The same steps are done for Hindi and English scripts. The various parts of the form are organized as:

  • Part 1: The first part comprises the title for a language in the Dataset and a unique identification number. For example Urdu language and Indian History form 1 will have an id as (URD-H-IH-001). The id of the corresponding form is automatically updated or generated once a language and category/subcategory is selected.

  • Part 2: The second part of the form consists of 3–5 lines of printed text which is collected from various sources. Where a line can have around 60–70 words.

  • Part 3: Third part of the form is left blank where the writers replicate the printed text in his/her natural handwriting.

  • Part 4: Fourth part of the form have six attributes: Name, Education, Address, Source of Information, Signature and date of form filling, writers can optionally provide this information.

The filled forms are scanned at the resolution of 600 dpi at a grey-level. The images were saved in PNG-format. Each form was completely scanned, including both the printed and handwritten text which can be useful for experiment on machine-printed and handwritten text separation. Transcription coding is stored in Unicode utf-general-ci UTF-8 which provide a fully support for Urdu Unicode.

4 Annotation

Annotation (labeling) of text in a Corpus is prerequisite for any Corpus development process. Annotation is a time consuming and error prone task, so it requires utmost care. Annotation makes a corpus useful to support machine learning and computational linguistic related research. Apart from pure text annotation, CALAM provides some additional linguistics features about the nature of language such as transcription of corpus to other language. Transcription of corpus across languages provide more fruitful resources in term of cross linguistic research, and realization of comparative study and helps in discovering cross linguistic variants.

Bureau of Indian Standards common Tagset framework has been used for grammatical tagging of Hindi language, British National Corpus Tagset and Center for Language Engineering Urdu Tagset has been used for respective language English and Urdu. To create the annotation and mark-up of handwritten image, the text of a handwritten form is replicated in Unicode format to make the corpus computer readable.

CALAM provides a platform for multilingual Corpus suitable for all types of linguistic related research where a large scale of fine grade systematic data across language is provided in both handwritten and machine readable format.

CALAM: Graphical User Interface Description

This section describes the step by step process of designing a corpus after the generation of scanned handwritten text forms and the generation of a meta-information xml file for the corresponding forms.

4.1 Insertion

This functionality gives an option to insert a new image into the database and the corresponding information of the image such as: number of handwritten lines, skew, transcription, date of creating and updating etc. The images are stored in a separate folder while the data is directed towards the respected fields in the database.

4.1.1 Auto-Indexing

The ID of each image inserted is automatically indexed according to the selected language, category and subcategory. The user selects the particulars language of form and the id field is appended accordingly.

For example: As shown in Fig. 1 form id is URD-L-PR-001 where URD as the language URDU, L is Category, PR is subcategory and last 3 digit is form number in respective subcategory.

Auto-indexing is also applicable for the ID of the segmented lines and words of handwritten image.

Each handwritten form will get a unique Id which is as follows:

  1. (a)

    File name is language (2 bits)—category (3 bits)—subcategory (3 bits)—xxxxxxxx((8 bit form no). The index structure is shown in Fig. 3.

  2. (b)

    So the Index of form id is 16 bits = Total number of forms (maximum) = 216 = 65,536.

  3. (c)

    There can be maximum 8 categories so 2,048 forms in each category and there can be 8 subcategories hence 256 forms in each subcategory.

  4. (d)

    CALAM can have a maximum of 4 languages with 16,384 handwritten forms in each language.

4.1.2 Handwritten Image Storage in Database

To achieve the consistency throughout the database all the handwritten text images stored in the database get the same unique id which was generated during the auto-indexing. The filename consists of a series of codes chained together with hyphen characters. The codes used for languages in the database are drawn from ISO-639. For example Urdu URD, English ENG and Hindi HIN etc.

  • A unique auto-indexing for word level.

  • The name of the image file in database is in the following format:

  • [Language code]-[Subcategory-id]-[FormNo].png

  • The name of the line_le is generally of the format:

  • [Language Code]-[Cat]-[Subcat]-[FormNo]-[Line No].png

  • The name of the word_le is generally of the format:

  • [Language Code]-[Cat]-[Subcat]-[FormNo]-[Line No]-[WordNo].png

4.1.3 Searching

This functionality facilitates the user to find any particular text/image using keyword, string, image ID, line ID and word ID in any category of the corpus.

Search generates output of all the database entries of images contained in the query string in the search box corresponding to the text fields in the database. For example, when an ID is inputted, the search engine searches the database for the particular ID, and displays the result. There is a link that redirects to the image that is searched. It helps to directly access the needed attributes and annotated information. It also highlights the searched query on the result page.

4.1.4 Deletion or Modification in Existing Form

This functionality gives an option for deletion and updating of an existing image from the database and it automatically deletes lines, words and components corresponding to that particular image.

4.1.5 Bounding Box

Structural mark-up is done for lines, words and ligatures. This is required for proper benchmarking of segmentation techniques for handwritten text recognition. A bounding box is displayed over the component selected for better visibility, so that one can recognize the path of the image components. A mapping has been done between the window screen and the view port. When cursor points on unique id of lines, words and ligatures a rectangular bounding box appears on the corresponding line, word or ligature of the image in the view port. A sample of graphical user interface is shown in Fig. 2.

Fig. 2
figure 2

A graphical interface of handwritten text image and transcription information

4.1.6 XML Mark-Up of Handwritten Image

Each image in the corpus is mark-up with meta-information as shown in Table 1. XML is the mostly used file format to generate ground-truth annotation results of corpus. CALAM provides the functionality of generating an XML file for each image in the database based on data entry description. The user can select an image to generate corresponding XML formatted file and then they can download or directly view the XML file for that image.

Table 1 Meta information of XML formatted

All the meta-information of handwritten text image, segmented line, word, component and writers name are formatted in XML format with ground truth data [13]. A sample of XML format meta-information which is automatically extractable from data entry procedure is presented in Fig. 3.

Fig. 3
figure 3

XML mark-up file of handwritten image

Standard Character Encoding Scheme (CES) under the guidelines of Text Encoding Initiative (TEI) is used for electronic data encoding and XML files meta-information.

5 Validation

Data validation is the process of ensuring that a program operates on clean, correct and useful data. Validation checks are very important to maintain the integrity of any database structure. They are equipped in our corpus by using auto indexing and cross indexing routines, often called validation rules, validation constraints or check routines for correctness, meaningfulness, and security of data that are input to the system. In a nutshell, data needs to be validated at the same stage/level where it is most likely to be erroneous. The different types of data validation techniques applied such as: form level validation, search criteria validation, field level validation and range validation.

6 Comparative Study of CALAM

A comparatively analysis of CALAM with existing structures (Pix Labeler [14], GTLC [15], Truthing Tool [16], APTI [17]) for handwritten text image corpus is shown in Table 2.

Table 2 Comparative analysis of CALAM

As compared to above structures CALAM provides a facility to display handwritten text image file and transcription material of corresponding image on the same screen in a collaboration context. An Automatic XML file of meta-information can be generated on the basic of database entries. It also provides a structural markup information for benchmarking of handwritten text segmentation and OCR techniques.

7 Conclusions

In this paper we have presented a structure for developing a standard corpus for various languages, currently it has been experimented with three languages Urdu, Hindi and English. The uniformity of structure provide an appropriate way for annotation of handwritten text images. We describes the data collection methodology and characteristics of the structure to manipulate the data for benchmarking tests.

Structure has a potential to provide researchers all the facilities for linguistic research on same platform. The aim of the structure is to build a resources that would provide ground truth annotation for handwritten text images. Structure would be rich source to design a large volume dataset for natural language processing related research. All the experiments from data collection to validating and XML meta-information has been done with utmost care by following standard procedures and rules. Forth part of handwritten text form having demographic information of writers could be used to train a system for automatic data fetching from handwritten form.