Skip to main content

Text Classification Using Ensemble Features Selection and Data Mining Techniques

  • Conference paper
  • First Online:
Book cover Swarm, Evolutionary, and Memetic Computing (SEMCCO 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8947))

Included in the following conference series:

Abstract

Text categorization is a task of text mining/analytics which involves extracting useful information from unstructured resources followed by categorizing these documents. In this paper, we classify the TechTC dataset collected from various Web directories. We employed feature selection methods such as Gini index, chi-square, t-statistic, correlation which drastically reduced the model building time. Various neural network models such as probabilistic neural network, group method of data handling, multi layer perceptron yielded higher accuracies compared to other techniques applied in literature.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Chinta, P.M., Murty, M.N.: Discriminative feature analysis and selection for document classification. In: Huang, T., Zeng, Z., Li, C., Leung, C.S. (eds.) ICONIP 2012, Part I. LNCS, vol. 7663, pp. 366–374. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  2. Gabrilovich, E., Markovitch, S.: Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5. In: The 21st International Conference on Machine Learning (ICML), pp. 321–328, Banff, Alberta, Canada (2004)

    Google Scholar 

  3. Pandey, M., Ravi, V.: Detecting phishing emails using text and data mining. In: The Proceedings of International Conference on Computational Intelligence and Computing Research (ICCIC, 2012), pp. 249–254, Coimbatore, India (2012)

    Google Scholar 

  4. Sundarkumar, G.G., Ravi, V.: Malware detection by text and data mining. In: The Proceedings of International Conference on Computational Intelligence and Computing Research (ICCIC) (2013)

    Google Scholar 

  5. Pandey, M., Ravi, V.: Text and data mining to detect phishing websites and spam emails. SEMCCO 2, 559–573 (2013)

    Google Scholar 

  6. http://www.techtc.cs.technion.ac.il/techtc.html#plain_text

  7. http://rapid-i.com

  8. http://www.knime.org

  9. http://www.neuroshell.com

  10. Quinlan, J.R.: Simplifying decision trees. Int. J. Man Mach. Stud. 27(3), 221–234 (1987)

    Article  Google Scholar 

  11. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. LS8-Report 23, Universität Dortmund (LS VIII-Report) (1997)

    Google Scholar 

  12. Rosenblatt, F.: Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington, D.C. (1961)

    Google Scholar 

  13. Ivakhnenko, A.G.: Heuristic self-organization in problems of engineering cybernetics. Automatica 6, 207–219 (1970)

    Article  Google Scholar 

  14. Ivakhnenko, A.G.: Polynomial theory of complex system. IEEE Trans. Syst. Man Cybern. SMC-1(4):364–378 (1971)

    Google Scholar 

  15. Specht, D.F.: Probabilistic neural networks. Neural Netw. 3, 109–118 (1990)

    Article  Google Scholar 

  16. Gini, C.: Variability and Mutability, 156 p. C. Cuppini, Bologna (1912)

    Google Scholar 

  17. Helmert, F.R.: Mathematical and Physical Theories of Higher Geodesy, vol. 1. B. G. Teubner, Leipzig (1964)

    Google Scholar 

  18. Pearson, E.S.: Student - A Statistical Biography of William Sealy Gosset. Oxford University Press, Oxford (1990)

    MATH  Google Scholar 

  19. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–4 (2002)

    Article  Google Scholar 

  20. Feldman, R., Dagan, I.: Knowledge discovery in textual databases. In: Proceedings of the First International Conference on Knowledge Discovery and Data Mining, KDD-95, pp. 112–117. Montreal, Canada, 20–21 Aug 1995

    Google Scholar 

  21. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.): From data mining to knowledge discovery: an overview. In: Advances in Knowledge Discovery and Data Mining, pp. 1–36. MIT Press, Cambridge (1996)

    Google Scholar 

  22. Tan, A.H.: Text mining: the state of the art and the challenges. In: Proceedings of the PAKDD-99 Workshop on Knowledge Discovery from Advanced Databases (1999)

    Google Scholar 

  23. Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)

    MATH  Google Scholar 

  24. Breiman, L.: Classification and Regression Trees. Chapman & Hall/CRC, London (1984)

    Google Scholar 

  25. http://nlp.stanford.edu/IR-book/html/htmledition/feature-selection-1.html

  26. He, W., Zha, S., Li, L.: Social media competitive analysis and text mining: a case study in the pizza industry. Int. J. Inf. Manage. 33(3), 464–472 (2013)

    Article  Google Scholar 

  27. Holt, J.D., Chung, S.M.: Efficient of mining rules in text databases. In: Eighth International Conference on Information and Knowledge Management, CIKM-99, pp. 234–242. ACM, New York, NY, USA (1999)

    Google Scholar 

  28. Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Seventh International Conference on Information and Knowledge Management, CIKM-98, pp. 148–155. ACM, New York, NY, USA (1998)

    Google Scholar 

  29. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)

    Google Scholar 

  30. Joachims, T.: Text Categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European Conference on Machine Learning (ECML-98), pp. 137–142 (1998)

    Google Scholar 

  31. Brin, S., Davis, J., Garcia-Molina, H.: Copy Detection mechanisms for digital documents. In: SIGMOD’95, Proceedings of the International conference on Management of data, pp. 398–409 (1995)

    Google Scholar 

  32. Mena, J.: Investigative Data Mining for Security and Criminal Detection. Elsevier Science, Burlington (2003)

    Google Scholar 

  33. Zanasi, A.: Text Mining and Its Applications to Intelligence. CRM and Knowledge Management, WIT Press, Southampton, Boston (2007)

    Google Scholar 

  34. Aggarwal, C.C., Wang, H.: Text Mining in Social Networks, Social Network Data Analytics, pp. 353–378. Springer, New York (2011)

    Book  Google Scholar 

  35. Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge Press, Cambridge (2007)

    Google Scholar 

  36. Klimt, B., Yang, Y.: The Enron corpus: a new dataset for email classification research. ECML 2004, 217–226 (2004)

    Google Scholar 

  37. Manning, C.D., Raghavan, P., Schutze, H.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  Google Scholar 

  38. Salton, G., Buckley, C.: Term-weighting approaches in text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)

    Article  Google Scholar 

  39. Steinbach, M., Kumar, V.: Introduction to data mining. Pearson Addison-Wesley, Boston (2006)

    Google Scholar 

  40. Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zero-one loss. Mach. Learn. 29, 103–130 (1997)

    Article  MATH  Google Scholar 

  41. Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14, 1–37 (2008)

    Article  Google Scholar 

  42. Cohen, W.W.: Fast effective rule induction. In: Proceedings of the Twelfth International Conference on Machine Learning, vol. 12 (1995)

    Google Scholar 

  43. Kenney, D.A.: Correlation and Causality. Wiley (1979)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vadlamani Ravi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Shravankumar, B., Ravi, V. (2015). Text Classification Using Ensemble Features Selection and Data Mining Techniques. In: Panigrahi, B., Suganthan, P., Das, S. (eds) Swarm, Evolutionary, and Memetic Computing. SEMCCO 2014. Lecture Notes in Computer Science(), vol 8947. Springer, Cham. https://doi.org/10.1007/978-3-319-20294-5_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-20294-5_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-20293-8

  • Online ISBN: 978-3-319-20294-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics