One-Class Text Document Classification with OCSVM and LSI

Shravan Kumar, B.; Ravi, Vadlamani

doi:10.1007/978-981-10-3174-8_50

B. Shravan Kumar^18,19 &
Vadlamani Ravi¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 517))

1501 Accesses
3 Citations

Abstract

In this paper, we propose a novel one-class classification approach for text document classification using One-Class Support Vector Machine (OCSVM) and Latent Semantic Indexing (LSI) in tandem. We first apply t-statistic-based feature selection on the text corpus. Then, we apply OCSVM on the rows corresponding to the negative class of the document-term matrix of a collection of text documents and extract the Support Vectors (SV). Then, in the test phase, we employ LSI on the query documents from the positive class to compare them with the SVs extracted from the negative class and match score is computed using the cosine similarity measure. Then, based on a prespecified threshold for the match score, we classify the positive category of the text corpus. Use of SV for comparison reduces the computational load, which is the main contribution of the paper. We demonstrated the effectiveness of our approach on the datasets pertaining to Phishing, and sentiment analysis in a bank.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 299.00; Price excludes VAT (USA)

Softcover Book: USD 379.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Text Document Classification with PCA and One-Class SVM

Comparison of Support Vector Machines With and Without Latent Semantic Analysis for Document Classification

Feature Extraction Using Single Variable Classifiers for Binary Text Classification

References

Apte, C., Damerau, F., and Weissl, S. M.: Automated learning of decision rules for text categorization, ACM Transactions on Information Systems (TOIS) 12 (3), 233–251 (1994).
Google Scholar
Bonchi, F., Castilo, C., and Gions, A.: Social Network Analysis and Mining for Business Applications. ACM Transactions on Intelligent Systems and Technology 2 (3), 1–37 (2011).
Google Scholar
Dasgupta, K., Sigh, R., Viswanathan, B., Chakraborty, D., Mukherjea, S., Nanavati, A. A., and Joshi, A.: Social ties and their relevance to churn in mobile and telecom networks. In: 11^th International Conference on Extending Database Technology (EDBT), March 25–30, Nantes, France, pp. 668-677 (2008).
Google Scholar
Verbeke, W., Martens, D., and Baesens, B.: Social Network analysis for customer churn prediction. Applied Soft Computing 14 (C), 431–446 (2014).
Google Scholar
Chakraborthy, G., Murali, P., and Satish, G.: Text mining and analysis: Practical methods, examples, and case studies. SAS Institute publisher (2014).
Google Scholar
Abdelhamid, N., Ayesh, A., Thabtah, F.: Phishing detection based Associative Classification Data mining. Expert Systems with Applications 41(13), 5948–5959 (2014).
Google Scholar
He, M., Horng, S-J., Fan, P., Khan, M. K., Run, R., Lai, J-L., Chen, R-J., and Sutanto, A.: An efficient phishing webpage detector. Expert Systems with Applications 38 (10), 12018–12027 (2011).
Google Scholar
Metsis, V., Androutsopoulos, I., and Paliouras, G.: Spam Filtering with Naive Bayes - Which Naive Bayes?. In: 3^rd Conference on Email and Anti-Spam (CEAS), July 27–28, Mountain View, California, USA (2006).
Google Scholar
Ahmed, F., Hameed, H., Shafiq, Z., and Farooq, M.: Using Spatio temporal Information in API calls with Machine learning Algorithms for Malware detection. In: 2^nd ACM workshop on Security and Artificial Intelligence (AISec), November 9^th, Chicago Illinois, USA, pp. 55–62 (2009).
Google Scholar
Salton, G., and McGill, M. J.: Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA, (1986).
Google Scholar
Abu-Nimeh, S., Nappa, D., Wang, X., Nair, S.: A comparison of Machine Learning techniques for phishing detection. In: APWG eCrime Researchers Summit, October 4–5, Pittsburgh, PA, USA, pp. 60-69 (2007).
Google Scholar
Garera, S., Provos, N., Chew, M., and Rubin, A. D.: A Framework for Detection and Measurement of Phishing Attacks. In: Special Interest Group on Security, Audit and Control (SIGSAC) Workshop On Recurring Malcode (WORM), November 2, Alexandria, Virginia, USA, pp. 1–8 (2007).
Google Scholar
Ludl, C., Mcallister, S., Kirda, E., Kruegel, C.: On the effectiveness of techniques to detect phishing sites. In: Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), July 12-13, Switzerland, pp. 20–39 (2007).
Google Scholar
Chen, X., Bose, I., Leung, A. C. M., and Guo, C.: Assessing the severity of phishing attacks: A hybrid data mining approach. Decision Support Systems 50 (4), 662–672 (2011).
Google Scholar
Pandey, M., and Ravi, V.: Detecting phishing e-mails using text and data mining, in: International Conference on Computational Intelligence & Computing Research (ICCIC). December 18–20, Coimbatore, India, pp. 249-255 (2012).
Google Scholar
Pandey, M., and Ravi, V.: Text and Data mining to detect phishing websites and spam emails, in: Swarm, Evolutionary, and Memetic Computing (SEMCCO), December 19-21, Chennai, India, LNCS 8298 Part-II, pp. 559–573 (2013).
Google Scholar
Lee, W., and Stolofo, J. S.: Data mining approaches for intrusion detection. In: USENIX Security symposium, January 26–29, San Antonio, Texas, pp. 1-6 (1988).
Google Scholar
Ye, Y., Wang, D., Li, T., and Ye, D.: IMDS: Intelligent Malware Detection System. In: 13^th KDD, August 12–15, San Jose, California, USA, pp. 1043-1047 (2007).
Google Scholar
Sundarkumar, G. G., and Ravi, V.: Malware detection by text and data mining. In: International Conference on Computational Intelligence & Computing Research (ICCIC), December 26–28, Enathi, India, pp. 1-6 (2013).
Google Scholar
Li, C. H., and Park, S. C.: An efficient document classification model using an improved back propagation neural network and singular value decomposition. Expert Systems with Applications 36 (2), 3208–3215 (2009).
Google Scholar
Song, W., and Park, S. C.: Genetic algorithm for text clustering based on latent semantic indexing. Computers and mathematics with applications 57 (11), 1901–1907 (2009).
Google Scholar
Thorleuchter, D., and Van den Poel, D.: Application based Technology Classification with Latent Semantic Indexing. Expert Systems with Applications 40 (5), 1786–1795 (2013).
Google Scholar
Chen, Y., Zhou, X., and Huang, T. S.: One-class SVM for learning in image retrieval. In: International Conference on Image Processing, October 7-11, Thessaloniki, Greece, pp. 34–37 (2001).
Google Scholar
Manevitz, L. M., and Yosef, M.: One-Class SVMs for document classification. Journal of Machine Learning Research 2, 139–154 (2001).
Google Scholar
Jin, H., Liu, Q., Lu, H.: Face detection using one-class-based support vectors, In: 6^th International Conference on Automatic Face and Gesture Recognition (FGR), 19^th May, Seoul, South Korea, pp. 457–462 (2004).
Google Scholar
Hempstalk, K., Frank, E., and Witten, I. H.: One-class classification by combining density and class probability estimation. In: ECML PKDD, September 15-19, Antwerp, Belgium, Part I, LNAI 5211, pp. 505–519 (2008).
Google Scholar
Liu, C., Wang, G., Ning, W., Lin, X., Li, L., and Liu, Z.: Anomaly detection in surveillance video using motion direction statistics. In: 17^th International Conference on Image Processing, September 26-29, Hong Kong, pp. 717–720 (2010).
Google Scholar
Berry, M. W., Dumais, S. T., and Obrien, G. W.: Using Linear Algebra for Intelligent Information Retrieval. In: Society for Industrial and Applied Mathematics (SIAM) Review, 37 (4), pp. 573–595 (1995).
Google Scholar
Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S., and Harshman, R.: Using latent semantic analysis to improve access to textual information. In: CHI, April 18-23, Los Angeles, California, USA, pp. 281–285 (1988).
Google Scholar
Deerwester, S. C., Dumais, S. T., Landauer, T. K., and Furnas, G. W.: Indexing by latent semantic analysis. Journal of the American Society for Information Science (JASIS), 391–407 (1990).
Google Scholar
Furnas, G. W., Deerwester, S., Dumais, S. T., Landauer, T. K., Harshman, R. A., Streeter, L. A., and Lochbaum, K. E.: Information Retrieval using a singular value decomposition model of latent semantic structure. In: SIGIR, August 24-28, Grenoble, France, pp. 465–480 (1998).
Google Scholar
Phishing corpus, http:// http://monkey.org/~jose/wiki/doku.php.
Phishtank, http://www.phishtank.com.
IBM SPSS, http://www-01.ibm.com/software/in/analytics/spss/products/data-collection/.
Rapid Miner (2012), https://rapidminer.com.
LIBSVM, http://www.csie.ntu.edu.tw/~cjlin/libsvm/#download. Visited 2014.
MATLAB (2012), www.mathworks.com.

Download references

Author information

Authors and Affiliations

Center of Excellence in Analytics, Institute for Development and Research in Banking Technology, Castle Hills, Road no. 1, Masab Tank, Hyderbad, 500057, India
B. Shravan Kumar & Vadlamani Ravi
School of Computer and Information Sciences, University of Hyderabad, Hyderabad, 500046, India
B. Shravan Kumar

Authors

B. Shravan Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Vadlamani Ravi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vadlamani Ravi .

Editor information

Editors and Affiliations

Department of Electrical and Electronics Engineering, SRM University, Chennai, Tamil Nadu, India
Subhransu Sekhar Dash
Department of Electrical and Electronics Engineering, SRM University, Chennai, Tamil Nadu, India
K. Vijayakumar
Department of Electrical and Electronics Engineering, Indian Institute of Technology Delhi, New Delhi, India
Bijaya Ketan Panigrahi
Electronics and Communication Sciences Unit, Indian Statistical Institute, Kolkata, West Bengal, India
Swagatam Das

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shravan Kumar, B., Ravi, V. (2017). One-Class Text Document Classification with OCSVM and LSI. In: Dash, S., Vijayakumar, K., Panigrahi, B., Das, S. (eds) Artificial Intelligence and Evolutionary Computations in Engineering Systems. Advances in Intelligent Systems and Computing, vol 517. Springer, Singapore. https://doi.org/10.1007/978-981-10-3174-8_50

Download citation

DOI: https://doi.org/10.1007/978-981-10-3174-8_50
Published: 13 July 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3173-1
Online ISBN: 978-981-10-3174-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

One-Class Text Document Classification with OCSVM and LSI

Abstract

Access this chapter

Similar content being viewed by others

Text Document Classification with PCA and One-Class SVM

Comparison of Support Vector Machines With and Without Latent Semantic Analysis for Document Classification

Feature Extraction Using Single Variable Classifiers for Binary Text Classification

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

One-Class Text Document Classification with OCSVM and LSI

Abstract

Access this chapter

Similar content being viewed by others

Text Document Classification with PCA and One-Class SVM

Comparison of Support Vector Machines With and Without Latent Semantic Analysis for Document Classification

Feature Extraction Using Single Variable Classifiers for Binary Text Classification

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation