Integer Representation and B-Tree for Classification of Text Documents: An Integrated Approach

  • S. N. Bharath Bhushan
  • Ajit Danti
  • Steven Lawrence Fernandes
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 701)

Abstract

Text document classification is creating more interest because of the availability of the information in the textual or electronic form. Generally, in conventional approaches, representation of text data and classification of text documents are considered as nondependent issues. In this research article, we have considered that overall efficiency of the text classification system depended on the effective representation of text data and efficient methodology for classification of the text documents. Here effective compressed representation for text documents is proposed for the text documents. Followed by a B-Tree-based classification methodology is adapted for classification. The proposed compressed representation and B-Tree methodologies are verified on the publically available large corpus to validate the effectiveness of the proposed models.

Keywords

Text representation B-Tree Text classification 

References

  1. 1.
    Rigutini, L.: Automatic text processing: machine learning techniques. Ph.D. thesis, University of Siena (2004)Google Scholar
  2. 2.
    Bhushan Bharath S.N., Danti, A.: Classification of text documents based on score level fusion approach. Pattern Recogn. Lett. 94, 118–126 (2017)Google Scholar
  3. 3.
    Marton, Y., Wu, N., Hellerstein, L.: On compression-based text classification. In: Proceedings of the European Colloquium on IR Research (ECIR), pp. 300–314 (2005)Google Scholar
  4. 4.
    Teahan, W., Harper, D.: Using compression based language models for text categorization. In: Proceedings of 2001 Workshop on Language Modeling and Information Retrieval (1998)Google Scholar
  5. 5.
    Frank, E., Cai, C., Witten, H.: Text Categorization using compression models. In: Proceedings of DCC-00, IEEE Data Compression Conference (2000)Google Scholar
  6. 6.
    Clemens, S., Frank, P.: Low complexity compression of short messages. In: Proceedings of IEEE Data Compression Conference, pp. 123–132 (2006)Google Scholar
  7. 7.
    Snel, V., Plato, J., Qawasmeh, E.: Compression of small text files. J. Adv. Eng. Inform. Inf. Achieve 20, 410–417 (2008)Google Scholar
  8. 8.
    Dvorski, J., Pokorn, J., Snsel V.: Word-based compression methods and indexing for text retrieval systems. In: Proceeding Third East European Conference on Advances in Databases and Information Systems, pp. 75–84 (1999)Google Scholar
  9. 9.
    Khurana, U., Koul, A.: Text compression and superfast searching. In: Proceedings of the CoRR, 2005 (2005)Google Scholar
  10. 10.
    Moura, E., Ziviani, N., Navarro, G., Yates, R.B.: Fast searching on compressed text allowing errors. In: Proceedings of the 21st Annual International ACM Sigir Conference on Research and Development in Information Retrieval, pp. 298–306 (1998)Google Scholar
  11. 11.
    Nieves, G., Brisaboa, E.L., Param, J.: An efficient compression code for text databases. In: Proceedings of the 25th European Conference on IR Research, pp. 468–481 (2003)Google Scholar
  12. 12.
    Horspool, R.N., Cormack, G.V.: Constructing word based text compression of short messages. In: Proceedings of the IEEE Data Compression Conference, pp. 62–71 (1992)Google Scholar
  13. 13.
    Danti, A., Bhushan Bharath, S.N.: Document vector space representation model for automatic text classification. In: Proceedings of International Conference on Multimedia Processing, Communication and Information Technology, Shimoga, pp. 338–344 (2013)Google Scholar
  14. 14.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)CrossRefGoogle Scholar
  15. 15.
    Salton, G., Buckely, C.: Term weighting approaches in automatic text retrieval. J. Inf. Process. Manag. 24(5), 513–523 (1988)CrossRefGoogle Scholar
  16. 16.
    Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of European Conference on Machine Learning (ECML), No. 1398, pp. 137–142 (2000)CrossRefGoogle Scholar
  17. 17.
    Danti, A., Bhushan Bharath, S.N.: Classification of text documents using integer representation and regression: an integrated approach. Spec. Issue of The IIOAB Scopus Index. J. 7(2), 45–50 (2016)Google Scholar
  18. 18.
    Bhushan Bharath, S.N., Danti, A., Fernandes, S.L.: A novel integer representation-based approach for classification of text documents. In: Proceedings of the International Conference on Data Engineering and Communication Technology, pp 557–564 (2017)Google Scholar
  19. 19.
    Hotho, A., Nurnberger, A., Paab, G.: A brief survey of text mining. J. Comput. Linguist. Lang. Technol. 20, 19–62 (2005)Google Scholar
  20. 20.
    Mccallum, A.K., Nigam, K.: Employing EM in pool-based active learning for text classification. In: Proceedings of the 15th International Conference on Machine Learning, USA, pp. 350–358 (1998)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  • S. N. Bharath Bhushan
    • 1
  • Ajit Danti
    • 2
  • Steven Lawrence Fernandes
    • 3
  1. 1.Department of Computer Science & EngineeringSahyadri College of Engineering & ManagementAdyar, MangaloreIndia
  2. 2.Department of Computer ApplicationsJNN College of EngineeringShimogaIndia
  3. 3.Department of Electronics and CommunicationsSahyadri College of Engineering & ManagementAdyar, MangaloreIndia

Personalised recommendations