Advertisement

Impact of Named Entity Recognition on Kannada Documents Classification

  • R. Jayashree
  • Basavaraj S. Anami
  • S. Teju
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 801)

Abstract

Natural language processing in Kannada language is promising research field due to the unavailability of tools and challenges in various aspects such as lack of annotated Kannada corpus. The important aim objective of this paper is to study and understand the impact of named entity recognition (NER) on Kannada text documents classification. Rule based Kannada named entity recognition system is implemented and integrated with Naïve Bayes classifier using a tool for this purpose. Rule based approach is considered for the purpose of experimentation. Another important aspect of this work is the attempt made to improving the classifier performance for Kannada Documents through NER. Comprehensive study is conducted to investigate the impact of Kannada named entity recognition on Kannada text document classification using Naïve Bayes classifier. Experimental results shows classification algorithm produces better results for Kannada text documents with previously recognized named entities.

Keywords

Natural language processing Kannada named entity recognition Rule based approach Text classification 

1 Introduction

Natural Language Processing is a field of computer science, which is intended to create interactions between human and machine. It is widely being used in variety of applications such as emails, news articles, web portals, social media, to name a few. Since many decades, natural language processing has been a part of our life in many aspects such as spell check, detecting mails as spam or not spam, language translation, grammar correction and question answering system etc.

Information Retrieval (IR) is an application of NLP and Named Entity Recognition is another subarea of information retrieval. Named Entity Recognition is high end application of Natural language processing which includes identifying proper names in text and classifying these identified proper names into previously specified categories such as nouns, names of organizations, names of location, number expressions (date and time).

Kannada, is predominantly being spoken in the state of Karnataka, India. For the language Kannada, preprocessing tools such as annotated corpora, name dictionaries, good morphological analyzers, Parts of Speech (POS) taggers etc. are not yet available in the required measure and not much work has been done in NLP, with respect to the Kannada language.

Here, we have made an attempt to develop rule based system for Kannada NER. A rule based systems needs more grammatical and linguistic analysis to make rules. The Rule Based approaches may give good result with sufficient gazetteers lists, language dependent features and rules for purely particular language. Named entities are open class words, every day new words added to languages and gazetteers list is infinite to store all words is not possible, hence gazetteers are needed to divide into finite tests like suffix, prefix, context words etc. All rule based approaches are language dependent.

2 Literature Survey

Approaches to named entities identification in south Asian languages and study [1] is the work which highlights the various approaches for finding Names Entities (NE) in Indian Languages, challenges faced for Indian languages and results for identifying named entities in text document using these approaches.

Rule based method to find named entities in Kannada [2] proposed a rule based approach to recognize named entities (Names of person, location, time, organization, number and measurement) in Kannada language. Manually collected/created suffix, prefix list and proper noun list of 5000 words are used. English have NER features such as capitalization which helps in finding named entities. Lack of capitalization feature in Kannada makes NER task more complex than in English. The proposed system gives result 86% of precision and 90% of recall over 20 files.

Kannada named entity recognition and classification [3] provides comprehensive study in implementing NLP Models based on Noun Taggers. Authors proposed and implemented HMM (Hidden Markov Model), supervised learning techniques. Unannotated Kannada text file are inputted to proposed system which recognizes the Named Entities and annotated text document file is obtained as output. Suitable cryptographic algorithm is applied on output of classification system in order to secure corpus.

Rule based approach Named entity recognition for Malay language [4] gives detailed description of proposed system for recognizing named entities in Malay language which contains rules for identifying person entities, location rules, organization rules etc. This algorithm implements both rule based method and machine learning methods to obtain better results. The performance of the system can be made better by updating dictionaries and lists.

The detailed challenges for recognizing named entities in Indian languages are discussed in rule based approach to identify named entities in Urdu language [5]. The algorithm for identifying named entities in Urdu using rule based method is proposed. Concept of n-grams are used in this proposed system and get promising results. 6-g are used to implement this model and authority files are created which consist of list of common person names, location names, organization names, person name prefixes, suffixes etc. considering 3-g would be ideal for any named entity recognition system.

For recognizing named entities in Telugu [6], authors developed a CRF approach with rule based approach to identify named entities in Telugu. Not much work has been done in Telugu language because of lack of annotated corpus they manually created 13,425 words with are manually tagged as noun or not-noun and also extracted few features which helps in identifying nouns in a given Telugu text. The work indicates that, the performance of the system can be increased by increasing the gazetteer list and better accuracy can be achieved by hybrid approach than any other approach.

A brief review for basic approaches to named entity recognition is discussed in named entity recognition methods [7]. The authors tried to improve precision and portability for the developed system as the most and difficult problem in NER is portability. Rule based approach improves the precision by adding more rules but it will automatically decreased the portability because of its fix rules. In order to overcome this problem this paper proposed fuzzy NER and produced better results.

An algorithm is proposed to classify Punjabi Documents text documents [8] by creating domain based ontology in Punjabi that consist of domain related terms. The main advantage of classification with help on ontology is that it does not need training dataset.

The sentence level text classification [9], is a work, which highlights the importance of sentence level text classification in the Kannada language. The performance of Naive Bayesian classifier is looked at in this work. The exhaustive literature survey made highlights the importance of Named Entities and their impact on text classification. No considerable work has been recorded in the case of NER in the Kannada language. This motivated us to take up the task of NER and its impact on text classification.

3 Proposed System

There are mainly two parts in this paper, first identification of named entities in the given text document and secondly, classification of the text document in Kannada Language. For named entity recognition, rule based approach is used, because it is expected to produce better results for Indian languages (Fig. 1).
Fig. 1.

Architecture of proposed system

3.1 Named Entity Recognition

Rules are handcrafted regular expressions which are designed by language dependent features and context features. Authority file is created which consist of names of people, organization names and location names. Another authority file is created which consist of list of common person name prefixes, suffixes, location suffixes, organization suffixes. 3-g are considered to get context features and identify named entities.
  1. 3.1.1

    Person names: This file consist of 500 person names such as Kuvempu, Chandrasheker Kambar, Goruru Ramaswamy Iyenger etc.

     
  2. 3.1.2

    Person prefix: This file consist of 51 person name prefixes such as Pradhani, Sri, Pro, Rastrapathi etc.

     
  3. 3.1.3

    Person suffix: This file consist of 60 person name suffixes such as Singh, Kapoor, Gowda etc.

     
  4. 3.1.4

    Location names: This file consist of 500 names of locations such as Karnataka, Andra Pradesh, America etc.

     
  5. 3.1.5

    Location suffix: This file consist of 15 location name suffixes such as Uru, Halli etc.

     
  6. 3.1.6

    Organization names: This file consist of 200 names of organizations such as SCC, TATA Steel, Google, Yahoo etc.

     
  7. 3.1.7

    Organization suffix: This file consist of 34 organization name suffixes such as limited, Nigama, Samsthe etc.

     
Table 1 represents the description of tags used in this system with examples.
Table 1.

Named entity tags description

Figure 2, shown below gives the detailed explanation for the module to identify Kannada named entities. Input given to this module are Kannada text documents, a sample document is shown in Fig. 3. The output obtained after execution are Kannada text documents with named entities identified with appropriate tags, which is shown in a sample document in Fig. 4. The given Kannada text document is divided into sequence of sentences, again each sentence is divided into sequence of tokens. The program takes each token and match it with the dictionary available. In this work, dictionary means created files with list of person names, location names and organization names. If token finds direct match with the dictionary created, then that token is output as named entity with appropriate tags. For token which does not find match in the provided dictionary, NER features are checked upon that token. NER features such as person name prefixes, person name suffixes, organization suffixes, location suffixes are used here. The previous token and the next token to the current token are considered at this part to obtain the named entities. If any of the NER features are matched, then the token is recognized as named entity and attached with appropriate tag. To achieve the named entities attached with tags out of plain Kannada text rule based approach is utilized. Each and every step in this module is depicted as rule.
Fig. 2.

Kannada named entity recognition module

Fig. 3.

Sample input Kannada text

Fig. 4.

Sample output Kannada text

3.2 Text Classification

See Fig 5.
Fig. 5.

Kannada text documents classification module for classifying Kannada text documents

4 Tests and Results

Train dataset considered for this system includes 155 Kannada text documents under 3 categories: Entertainment, sports and Politics.

It is observed that, a slight increase in precision, recall and F-measure of classifier results are recorded in Tables 2 and 3. Classifier results show increased accuracy with named entities. Noticeable decrease in mean absolute error and root mean squared error is also recorded. This is indicative of the fact that, classifying documents after recognizing named entities increase the accuracy of the classifier.
Table 2.

Classification results of Kannada text documents (without using named entities)

No. of Kannada documents

Precision

Recall

f-measure

10

0.75

0.7

0.711

20

0.846

0.75

0.711

33

0.795

0.727

0.714

Table 3.

Classification results of named entity recognized Kannada text documents (using named entities)

No. of Kannada documents

Precision

Recall

f-measure

10

0.75

0.7

0.711

20

0.867

0.8

0.786

33

0.808

0.758

0.752

Table 4 precisely indicates the impact of named entities on Kannada documents with increasing precision, recall and F-measure.
Table 4.

Comparison of Precision, Recall and F-score with and without named entities.

Results

Kannada text documents

Kannada text documents with named entities identified

Precision

0.795

0.808

Recall

0.727

0.758

F-measure

0.714

0.752

5 Conclusion

Named entity recognition in Kannada language is promising research field due to lack of work in the language and challenges in various aspects such as lack of annotated Kannada corpus.

In this work, Rule based named entity recognition system is successfully implemented with manually created dictionaries (list of proper names). NER system is integrated with machine learning and data mining tool (WEKA). Using the collected corpus and the proposed system, a comprehensive study is conducted to investigate the impact of Kannada named entity recognition on Kannada text document classification using Naïve Bayes classifier. Named entities have direct impact on document classification and Naïve Bayesian classifier gives better classification accuracy.

6 Future Work

Future enhancement to this work includes adding more proper names to the dictionary, identifying more Kannada NER features such as person name suffixes, person name prefixes, location suffixes, organization suffixes etc., identifying more context features and creating rules for corresponding features.

References

  1. 1.
    Hiremath, P., Shambhavi, B.R.: Approaches to named entity recognition in Indian languages: a study. Int. J. Eng. Adv. Technol. (IJEAT) 3(6), 191–194 (2014)Google Scholar
  2. 2.
    Bhuvaneshwari, C.M.: Rule based methodology for recognition of Kannada named entities. IJLTET 3, 50–59 (2014)Google Scholar
  3. 3.
    Amarappa, S., Sathyanarayana, S.V.: Named entity recognition and classification in Kannada language. Int. J. Electron. Comput. Sci. Eng. Trans. Mach. Learn. Artif. Intell. 2, 281–289 (2012)Google Scholar
  4. 4.
    Alfred, R., Leong, L.C., On, C.K., Anthony, P.: Malay named entity recognition based on rule-based approach. Int. J. Mach. Learn. Comput. 4(3), 300–306 (2014)CrossRefGoogle Scholar
  5. 5.
    Riaz, K.: Rule-based named entity recognition in Urdu. In: Proceedings of the Named Entities Workshop, pp. 126–135 (2010)Google Scholar
  6. 6.
    Srikanth, P., Murthy, K.N.: Named entity recognition for Telugu. In: Proceedings of the IJCNLP-2008 Workshop on NER for South and South East Asian Languages Hyderabad (2008)Google Scholar
  7. 7.
    Mansouri, A., Affendey, L.S., Mamat, A.: Named entity recognition approaches. IJCSNS Int. J. Comput. Sci. Netw. Secur. 8(2), 339–344 (2008)Google Scholar
  8. 8.
    Kaur, K., Gupta, V.: Named entity recognition for Punjabi language. Int. J. Comput. Sci. Inf. Technol. Secur. (IJCSITS) 2(3) (2012)Google Scholar
  9. 9.
    Jayashree, R., Srikanta, M.K., Anami, B.S.: An analysis of sentence level text classification in the Kannada language. In: International Conference of Soft Computing and Pattern Recognition (SoCPaR), pp. 147–151. IEEE (2011)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.PES Institute of TechologyBangaloreIndia

Personalised recommendations