Text Classification Using Ensemble Features Selection and Data Mining Techniques

Shravankumar, B.; Ravi, Vadlamani

doi:10.1007/978-3-319-20294-5_16

B. Shravankumar^16,17 &
Vadlamani Ravi¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8947))

Included in the following conference series:

International Conference on Swarm, Evolutionary, and Memetic Computing

1718 Accesses
1 Citations

Abstract

Text categorization is a task of text mining/analytics which involves extracting useful information from unstructured resources followed by categorizing these documents. In this paper, we classify the TechTC dataset collected from various Web directories. We employed feature selection methods such as Gini index, chi-square, t-statistic, correlation which drastically reduced the model building time. Various neural network models such as probabilistic neural network, group method of data handling, multi layer perceptron yielded higher accuracies compared to other techniques applied in literature.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Chinta, P.M., Murty, M.N.: Discriminative feature analysis and selection for document classification. In: Huang, T., Zeng, Z., Li, C., Leung, C.S. (eds.) ICONIP 2012, Part I. LNCS, vol. 7663, pp. 366–374. Springer, Heidelberg (2012)
Chapter Google Scholar
Gabrilovich, E., Markovitch, S.: Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5. In: The 21st International Conference on Machine Learning (ICML), pp. 321–328, Banff, Alberta, Canada (2004)
Google Scholar
Pandey, M., Ravi, V.: Detecting phishing emails using text and data mining. In: The Proceedings of International Conference on Computational Intelligence and Computing Research (ICCIC, 2012), pp. 249–254, Coimbatore, India (2012)
Google Scholar
Sundarkumar, G.G., Ravi, V.: Malware detection by text and data mining. In: The Proceedings of International Conference on Computational Intelligence and Computing Research (ICCIC) (2013)
Google Scholar
Pandey, M., Ravi, V.: Text and data mining to detect phishing websites and spam emails. SEMCCO 2, 559–573 (2013)
Google Scholar
http://www.techtc.cs.technion.ac.il/techtc.html#plain_text
http://rapid-i.com
http://www.knime.org
http://www.neuroshell.com
Quinlan, J.R.: Simplifying decision trees. Int. J. Man Mach. Stud. 27(3), 221–234 (1987)
Article Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. LS8-Report 23, Universität Dortmund (LS VIII-Report) (1997)
Google Scholar
Rosenblatt, F.: Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington, D.C. (1961)
Google Scholar
Ivakhnenko, A.G.: Heuristic self-organization in problems of engineering cybernetics. Automatica 6, 207–219 (1970)
Article Google Scholar
Ivakhnenko, A.G.: Polynomial theory of complex system. IEEE Trans. Syst. Man Cybern. SMC-1(4):364–378 (1971)
Google Scholar
Specht, D.F.: Probabilistic neural networks. Neural Netw. 3, 109–118 (1990)
Article Google Scholar
Gini, C.: Variability and Mutability, 156 p. C. Cuppini, Bologna (1912)
Google Scholar
Helmert, F.R.: Mathematical and Physical Theories of Higher Geodesy, vol. 1. B. G. Teubner, Leipzig (1964)
Google Scholar
Pearson, E.S.: Student - A Statistical Biography of William Sealy Gosset. Oxford University Press, Oxford (1990)
MATH Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–4 (2002)
Article Google Scholar
Feldman, R., Dagan, I.: Knowledge discovery in textual databases. In: Proceedings of the First International Conference on Knowledge Discovery and Data Mining, KDD-95, pp. 112–117. Montreal, Canada, 20–21 Aug 1995
Google Scholar
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.): From data mining to knowledge discovery: an overview. In: Advances in Knowledge Discovery and Data Mining, pp. 1–36. MIT Press, Cambridge (1996)
Google Scholar
Tan, A.H.: Text mining: the state of the art and the challenges. In: Proceedings of the PAKDD-99 Workshop on Knowledge Discovery from Advanced Databases (1999)
Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
MATH Google Scholar
Breiman, L.: Classification and Regression Trees. Chapman & Hall/CRC, London (1984)
Google Scholar
http://nlp.stanford.edu/IR-book/html/htmledition/feature-selection-1.html
He, W., Zha, S., Li, L.: Social media competitive analysis and text mining: a case study in the pizza industry. Int. J. Inf. Manage. 33(3), 464–472 (2013)
Article Google Scholar
Holt, J.D., Chung, S.M.: Efficient of mining rules in text databases. In: Eighth International Conference on Information and Knowledge Management, CIKM-99, pp. 234–242. ACM, New York, NY, USA (1999)
Google Scholar
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Seventh International Conference on Information and Knowledge Management, CIKM-98, pp. 148–155. ACM, New York, NY, USA (1998)
Google Scholar
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)
Google Scholar
Joachims, T.: Text Categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European Conference on Machine Learning (ECML-98), pp. 137–142 (1998)
Google Scholar
Brin, S., Davis, J., Garcia-Molina, H.: Copy Detection mechanisms for digital documents. In: SIGMOD’95, Proceedings of the International conference on Management of data, pp. 398–409 (1995)
Google Scholar
Mena, J.: Investigative Data Mining for Security and Criminal Detection. Elsevier Science, Burlington (2003)
Google Scholar
Zanasi, A.: Text Mining and Its Applications to Intelligence. CRM and Knowledge Management, WIT Press, Southampton, Boston (2007)
Google Scholar
Aggarwal, C.C., Wang, H.: Text Mining in Social Networks, Social Network Data Analytics, pp. 353–378. Springer, New York (2011)
Book Google Scholar
Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge Press, Cambridge (2007)
Google Scholar
Klimt, B., Yang, Y.: The Enron corpus: a new dataset for email classification research. ECML 2004, 217–226 (2004)
Google Scholar
Manning, C.D., Raghavan, P., Schutze, H.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Article Google Scholar
Steinbach, M., Kumar, V.: Introduction to data mining. Pearson Addison-Wesley, Boston (2006)
Google Scholar
Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zero-one loss. Mach. Learn. 29, 103–130 (1997)
Article MATH Google Scholar
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14, 1–37 (2008)
Article Google Scholar
Cohen, W.W.: Fast effective rule induction. In: Proceedings of the Twelfth International Conference on Machine Learning, vol. 12 (1995)
Google Scholar
Kenney, D.A.: Correlation and Causality. Wiley (1979)
Google Scholar

Download references

Author information

Authors and Affiliations

Centre of Excellence in CRM and Analytics, Institute for Development and Research in Banking Technology, Castle Hills Road no 1, Masab Tank, Hyderabad, 500 057, Andhra Pradesh, India
B. Shravankumar & Vadlamani Ravi
SCIS, University of Hyderabad, Hyderabad, 500 046, Andhra Pradesh, India
B. Shravankumar

Authors

B. Shravankumar
View author publications
You can also search for this author in PubMed Google Scholar
Vadlamani Ravi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vadlamani Ravi .

Editor information

Editors and Affiliations

Department of Electrical Engineering, IIT, New Delhi, India
Bijaya Ketan Panigrahi
School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Singapore
Ponnuthurai Nagaratnam Suganthan
Electronics and Communication Sciences Unit, Indian Statistical Institute, Kolkata, India
Swagatam Das

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shravankumar, B., Ravi, V. (2015). Text Classification Using Ensemble Features Selection and Data Mining Techniques. In: Panigrahi, B., Suganthan, P., Das, S. (eds) Swarm, Evolutionary, and Memetic Computing. SEMCCO 2014. Lecture Notes in Computer Science(), vol 8947. Springer, Cham. https://doi.org/10.1007/978-3-319-20294-5_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-20294-5_16
Published: 16 July 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20293-8
Online ISBN: 978-3-319-20294-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics