Skip to main content

Nature Inspired Data Mining Algorithm for Document Clustering in Information Retrieval

  • Conference paper
Information Retrieval Technology (AIRS 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8870))

Included in the following conference series:

  • 1410 Accesses

Abstract

Document clustering is an important technique that has been widely employed in Information Retrieval (IR). Various clustering techniques have been reported, but the effectiveness of most techniques relies on the initial value of k clusters. Such an approach may not be suitable as we may not have prior knowledge on the collection of documents. To date, there are various swarm based clustering techniques proposed to address such problem, including this paper that explores the adaptation of Firefly Algorithm (FA) in document clustering. We extend the work on Gravitation Firefly Algorithm (GFA) by introducing a relocate mechanism that relocates assigned documents, if necessary. The newly proposed clustering algorithm, known as GFA_R, is then tested on a benchmark dataset obtained from the 20Newsgroups. Experimental results on external and relative quality metrics for the GFA_R is compared against the one obtained using the standard GFA and Bisect K-means. It is learned that by extending GFA to becoming GFA_R, a better quality clustering is obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Sayed, A., Hacid, H., Zighed, D.: Exploring Validity Indices for Clustering Textual Data. In: Zighed, D.A., Tsumoto, S., Ras, Z.W., Hacid, H. (eds.) Mining Complex Data. SCI, vol. 165, pp. 281–300. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  2. Miner, G., Elder, J., Fast, A., Hill, T., Nisbet, R., Delen, D.: Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications, 1st edn. Elsevier (2012)

    Google Scholar 

  3. Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A review. ACM Comput. Surv. 31(3), 264–323 (1999)

    Article  Google Scholar 

  4. Aliguliyev, R.M.: Clustering of Document Collection-A Weighted Approach. Expert Systems with Applications 36(4), 7904–7916 (2009)

    Article  Google Scholar 

  5. Luo, C., Li, Y., Chung, S.M.: Text Document Clustering based on Neighbors. Data and Knowledge Engineering 68(11), 1271–1288 (2009)

    Article  Google Scholar 

  6. Jain, A.K.: Data Clustering: 50 years beyond K-means. Pattern Recognition Letters 31(8), 651–666 (2010)

    Article  Google Scholar 

  7. Gil-Garicia, R., Pons-Porrata, A.: Dynamic Hierarchical Algorithms for Document Clustering. Pattern Recognition Letters 31(6), 469–477 (2010)

    Article  Google Scholar 

  8. Forsati, R., Mahdavi, M., Shamsfard, M., Meybodi, M.R.: Efficient Stochastic Algorithms for Document Clustering. Information Sciences 220, 269–291 (2013)

    Article  MathSciNet  Google Scholar 

  9. Kashef, R., Kamel, M.S.: Enhanced Bisecting K-means Clustering using Intermediate Cooperation. Pattern Recognition 42(11), 2557–2569 (2009)

    Article  MATH  Google Scholar 

  10. Yujian, L., Liye, X.: Unweighted Multiple Group Method with Arithmetic Mean. In: The IEEE Fifth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA), pp. 830–834 (2010)

    Google Scholar 

  11. Tan, S.C., Ting, K.M., Teng, S.W.: A general stochastic clustering method for automatic cluster discovery. Pattern Recognition 44(10-11), 2786–2799 (2011)

    Article  Google Scholar 

  12. Saka, E., Nasraoui, O.: On Dynamic Data Clustering and Visualization using Swarm Intelligence. In: 2010 IEEE The 26th International Conference on Data Engineering Workshops (ICDEW), pp. 337–340 (2010)

    Google Scholar 

  13. He, Y., Hui, S.C., Sim, Y.: A Novel Ant-Based Clustering Approach for Document Clustering. In: Ng, H.T., Leong, M.-K., Kan, M.-Y., Ji, D. (eds.) AIRS 2006. LNCS, vol. 4182, pp. 537–544. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  14. Zaw, M.M., Mon, E.E.: Web Document Clustering Using Cuckoo Clustering Algorithm based on Levy Flight. International Journal of Innovation and Applied Studies 4(1), 182–188 (2013)

    Google Scholar 

  15. Cui, X., Potok, T.E., Palathingal, P.: Document Clustering using Particle Swarm Optimization. In: Proceedings of the 2005 IEEE Swarm Intelligence Symposium, SIS 2005, pp. 185–191 (2005)

    Google Scholar 

  16. Yang, X.S.: Nature-inspired Metaheuristic Algorithms, 2nd edn. Luniver Press, United Kingdom (2010)

    Google Scholar 

  17. Yang, X.S.: Firefly Algorithm, Stochastic Test Functions and Design Optimization. Int. J. Bio-Inspired Computation 2(2), 78–84 (2010)

    Article  Google Scholar 

  18. Yang, X.S., He, X.: Firefly Algorithm: Recent Advances and Applications. Int. J. Swarm Intelligence 1(1), 36–50 (2013)

    Article  Google Scholar 

  19. Mohammed, A.J., Yusof, Y., Husni, H.: A Newton’s Universal Gravitation Inspired Firefly Algorithm for Document Clustering. In: Jeong, H.Y., Obaidat, M.S., Yen, N.Y., Park, J.J. (eds.) Advanced in Computer Science and Its Applications. LNEE, vol. 279, pp. 1259–1264. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  20. Murugesan, K., Zhang, J.: Hybrid Bisect K-means Clustering Algorithm. In: IEEE International Conference on Business Computing and Global Informatization (BCGIN), pp. 216–219. IEEE (2011)

    Google Scholar 

  21. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: Proc. KDD Workshop on Text Mining, Boston (2000)

    Google Scholar 

  22. 20 Newsgroup Data Set, http://people.csail.mit.edu/20Newsgroup/

  23. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, 1 ed. Cambridge University Press (2008)

    Google Scholar 

  24. Shannon, C.E.: A Mathematical theory of communication. Bell System Technical Journal 27, 379–423, 623–656 (1948)

    Google Scholar 

  25. Das, S., Abraham, A., Konar, A.: Metaheuristic Clustering. Springer, Heidelberg (2009)

    Google Scholar 

  26. Youssef, S.M.: A New Hybrid Evolutionary-based Data Clustering Using Fuzzy Particle Swarm Optimization. In: The 23rd IEEE International Conference on Tools with Artificial Intelligence, pp. 717–724 (2011)

    Google Scholar 

  27. Hu, G., Zhou, S., Guan, J., Hu, X.: Towards Effective Document Clustering: A Constrained K-means Based Approach. Information Processing & Management 44(4), 1397–1409 (2008)

    Article  Google Scholar 

  28. Lu, Y., Wang, S., Li, S., Zhou, C.: Text Clustering via Particle Swarm Optimization. In: The Swarm Intelligence Symposium, pp. 45–51. IEEE (2009)

    Google Scholar 

  29. Tang, R., Fong, S., Yang, X.S., Deb, S.: Integrating Nature-Inspired Optimization Algorithms to K-means Clustering. In: Proceedings of the 7th International Conference on Digital Information Management (ICDIM), pp. 116–123. IEEE, Macau (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Mohammed, A.J., Yusof, Y., Husni, H. (2014). Nature Inspired Data Mining Algorithm for Document Clustering in Information Retrieval. In: Jaafar, A., et al. Information Retrieval Technology. AIRS 2014. Lecture Notes in Computer Science, vol 8870. Springer, Cham. https://doi.org/10.1007/978-3-319-12844-3_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12844-3_33

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12843-6

  • Online ISBN: 978-3-319-12844-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics