Abstract
In this paper, we propose a novel one-class classification approach for text document classification using One-Class Support Vector Machine (OCSVM) and Latent Semantic Indexing (LSI) in tandem. We first apply t-statistic-based feature selection on the text corpus. Then, we apply OCSVM on the rows corresponding to the negative class of the document-term matrix of a collection of text documents and extract the Support Vectors (SV). Then, in the test phase, we employ LSI on the query documents from the positive class to compare them with the SVs extracted from the negative class and match score is computed using the cosine similarity measure. Then, based on a prespecified threshold for the match score, we classify the positive category of the text corpus. Use of SV for comparison reduces the computational load, which is the main contribution of the paper. We demonstrated the effectiveness of our approach on the datasets pertaining to Phishing, and sentiment analysis in a bank.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Apte, C., Damerau, F., and Weissl, S. M.: Automated learning of decision rules for text categorization, ACM Transactions on Information Systems (TOIS) 12 (3), 233–251 (1994).
Bonchi, F., Castilo, C., and Gions, A.: Social Network Analysis and Mining for Business Applications. ACM Transactions on Intelligent Systems and Technology 2 (3), 1–37 (2011).
Dasgupta, K., Sigh, R., Viswanathan, B., Chakraborty, D., Mukherjea, S., Nanavati, A. A., and Joshi, A.: Social ties and their relevance to churn in mobile and telecom networks. In: 11th International Conference on Extending Database Technology (EDBT), March 25–30, Nantes, France, pp. 668-677 (2008).
Verbeke, W., Martens, D., and Baesens, B.: Social Network analysis for customer churn prediction. Applied Soft Computing 14 (C), 431–446 (2014).
Chakraborthy, G., Murali, P., and Satish, G.: Text mining and analysis: Practical methods, examples, and case studies. SAS Institute publisher (2014).
Abdelhamid, N., Ayesh, A., Thabtah, F.: Phishing detection based Associative Classification Data mining. Expert Systems with Applications 41(13), 5948–5959 (2014).
He, M., Horng, S-J., Fan, P., Khan, M. K., Run, R., Lai, J-L., Chen, R-J., and Sutanto, A.: An efficient phishing webpage detector. Expert Systems with Applications 38 (10), 12018–12027 (2011).
Metsis, V., Androutsopoulos, I., and Paliouras, G.: Spam Filtering with Naive Bayes - Which Naive Bayes?. In: 3rd Conference on Email and Anti-Spam (CEAS), July 27–28, Mountain View, California, USA (2006).
Ahmed, F., Hameed, H., Shafiq, Z., and Farooq, M.: Using Spatio temporal Information in API calls with Machine learning Algorithms for Malware detection. In: 2nd ACM workshop on Security and Artificial Intelligence (AISec), November 9th, Chicago Illinois, USA, pp. 55–62 (2009).
Salton, G., and McGill, M. J.: Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA, (1986).
Abu-Nimeh, S., Nappa, D., Wang, X., Nair, S.: A comparison of Machine Learning techniques for phishing detection. In: APWG eCrime Researchers Summit, October 4–5, Pittsburgh, PA, USA, pp. 60-69 (2007).
Garera, S., Provos, N., Chew, M., and Rubin, A. D.: A Framework for Detection and Measurement of Phishing Attacks. In: Special Interest Group on Security, Audit and Control (SIGSAC) Workshop On Recurring Malcode (WORM), November 2, Alexandria, Virginia, USA, pp. 1–8 (2007).
Ludl, C., Mcallister, S., Kirda, E., Kruegel, C.: On the effectiveness of techniques to detect phishing sites. In: Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), July 12-13, Switzerland, pp. 20–39 (2007).
Chen, X., Bose, I., Leung, A. C. M., and Guo, C.: Assessing the severity of phishing attacks: A hybrid data mining approach. Decision Support Systems 50 (4), 662–672 (2011).
Pandey, M., and Ravi, V.: Detecting phishing e-mails using text and data mining, in: International Conference on Computational Intelligence & Computing Research (ICCIC). December 18–20, Coimbatore, India, pp. 249-255 (2012).
Pandey, M., and Ravi, V.: Text and Data mining to detect phishing websites and spam emails, in: Swarm, Evolutionary, and Memetic Computing (SEMCCO), December 19-21, Chennai, India, LNCS 8298 Part-II, pp. 559–573 (2013).
Lee, W., and Stolofo, J. S.: Data mining approaches for intrusion detection. In: USENIX Security symposium, January 26–29, San Antonio, Texas, pp. 1-6 (1988).
Ye, Y., Wang, D., Li, T., and Ye, D.: IMDS: Intelligent Malware Detection System. In: 13th KDD, August 12–15, San Jose, California, USA, pp. 1043-1047 (2007).
Sundarkumar, G. G., and Ravi, V.: Malware detection by text and data mining. In: International Conference on Computational Intelligence & Computing Research (ICCIC), December 26–28, Enathi, India, pp. 1-6 (2013).
Li, C. H., and Park, S. C.: An efficient document classification model using an improved back propagation neural network and singular value decomposition. Expert Systems with Applications 36 (2), 3208–3215 (2009).
Song, W., and Park, S. C.: Genetic algorithm for text clustering based on latent semantic indexing. Computers and mathematics with applications 57 (11), 1901–1907 (2009).
Thorleuchter, D., and Van den Poel, D.: Application based Technology Classification with Latent Semantic Indexing. Expert Systems with Applications 40 (5), 1786–1795 (2013).
Chen, Y., Zhou, X., and Huang, T. S.: One-class SVM for learning in image retrieval. In: International Conference on Image Processing, October 7-11, Thessaloniki, Greece, pp. 34–37 (2001).
Manevitz, L. M., and Yosef, M.: One-Class SVMs for document classification. Journal of Machine Learning Research 2, 139–154 (2001).
Jin, H., Liu, Q., Lu, H.: Face detection using one-class-based support vectors, In: 6th International Conference on Automatic Face and Gesture Recognition (FGR), 19th May, Seoul, South Korea, pp. 457–462 (2004).
Hempstalk, K., Frank, E., and Witten, I. H.: One-class classification by combining density and class probability estimation. In: ECML PKDD, September 15-19, Antwerp, Belgium, Part I, LNAI 5211, pp. 505–519 (2008).
Liu, C., Wang, G., Ning, W., Lin, X., Li, L., and Liu, Z.: Anomaly detection in surveillance video using motion direction statistics. In: 17th International Conference on Image Processing, September 26-29, Hong Kong, pp. 717–720 (2010).
Berry, M. W., Dumais, S. T., and Obrien, G. W.: Using Linear Algebra for Intelligent Information Retrieval. In: Society for Industrial and Applied Mathematics (SIAM) Review, 37 (4), pp. 573–595 (1995).
Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S., and Harshman, R.: Using latent semantic analysis to improve access to textual information. In: CHI, April 18-23, Los Angeles, California, USA, pp. 281–285 (1988).
Deerwester, S. C., Dumais, S. T., Landauer, T. K., and Furnas, G. W.: Indexing by latent semantic analysis. Journal of the American Society for Information Science (JASIS), 391–407 (1990).
Furnas, G. W., Deerwester, S., Dumais, S. T., Landauer, T. K., Harshman, R. A., Streeter, L. A., and Lochbaum, K. E.: Information Retrieval using a singular value decomposition model of latent semantic structure. In: SIGIR, August 24-28, Grenoble, France, pp. 465–480 (1998).
Phishing corpus, http:// http://monkey.org/~jose/wiki/doku.php.
Phishtank, http://www.phishtank.com.
IBM SPSS, http://www-01.ibm.com/software/in/analytics/spss/products/data-collection/.
Rapid Miner (2012), https://rapidminer.com.
LIBSVM, http://www.csie.ntu.edu.tw/~cjlin/libsvm/#download. Visited 2014.
MATLAB (2012), www.mathworks.com.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Shravan Kumar, B., Ravi, V. (2017). One-Class Text Document Classification with OCSVM and LSI. In: Dash, S., Vijayakumar, K., Panigrahi, B., Das, S. (eds) Artificial Intelligence and Evolutionary Computations in Engineering Systems. Advances in Intelligent Systems and Computing, vol 517. Springer, Singapore. https://doi.org/10.1007/978-981-10-3174-8_50
Download citation
DOI: https://doi.org/10.1007/978-981-10-3174-8_50
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3173-1
Online ISBN: 978-981-10-3174-8
eBook Packages: EngineeringEngineering (R0)