Skip to main content

One-Class Text Document Classification with OCSVM and LSI

  • Conference paper
  • First Online:
Artificial Intelligence and Evolutionary Computations in Engineering Systems

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 517))

Abstract

In this paper, we propose a novel one-class classification approach for text document classification using One-Class Support Vector Machine (OCSVM) and Latent Semantic Indexing (LSI) in tandem. We first apply t-statistic-based feature selection on the text corpus. Then, we apply OCSVM on the rows corresponding to the negative class of the document-term matrix of a collection of text documents and extract the Support Vectors (SV). Then, in the test phase, we employ LSI on the query documents from the positive class to compare them with the SVs extracted from the negative class and match score is computed using the cosine similarity measure. Then, based on a prespecified threshold for the match score, we classify the positive category of the text corpus. Use of SV for comparison reduces the computational load, which is the main contribution of the paper. We demonstrated the effectiveness of our approach on the datasets pertaining to Phishing, and sentiment analysis in a bank.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 299.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 379.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Apte, C., Damerau, F., and Weissl, S. M.: Automated learning of decision rules for text categorization, ACM Transactions on Information Systems (TOIS) 12 (3), 233–251 (1994).

    Google Scholar 

  2. Bonchi, F., Castilo, C., and Gions, A.: Social Network Analysis and Mining for Business Applications. ACM Transactions on Intelligent Systems and Technology 2 (3), 1–37 (2011).

    Google Scholar 

  3. Dasgupta, K., Sigh, R., Viswanathan, B., Chakraborty, D., Mukherjea, S., Nanavati, A. A., and Joshi, A.: Social ties and their relevance to churn in mobile and telecom networks. In: 11th International Conference on Extending Database Technology (EDBT), March 25–30, Nantes, France, pp. 668-677 (2008).

    Google Scholar 

  4. Verbeke, W., Martens, D., and Baesens, B.: Social Network analysis for customer churn prediction. Applied Soft Computing 14 (C), 431–446 (2014).

    Google Scholar 

  5. Chakraborthy, G., Murali, P., and Satish, G.: Text mining and analysis: Practical methods, examples, and case studies. SAS Institute publisher (2014).

    Google Scholar 

  6. Abdelhamid, N., Ayesh, A., Thabtah, F.: Phishing detection based Associative Classification Data mining. Expert Systems with Applications 41(13), 5948–5959 (2014).

    Google Scholar 

  7. He, M., Horng, S-J., Fan, P., Khan, M. K., Run, R., Lai, J-L., Chen, R-J., and Sutanto, A.: An efficient phishing webpage detector. Expert Systems with Applications 38 (10), 12018–12027 (2011).

    Google Scholar 

  8. Metsis, V., Androutsopoulos, I., and Paliouras, G.: Spam Filtering with Naive Bayes - Which Naive Bayes?. In: 3rd Conference on Email and Anti-Spam (CEAS), July 27–28, Mountain View, California, USA (2006).

    Google Scholar 

  9. Ahmed, F., Hameed, H., Shafiq, Z., and Farooq, M.: Using Spatio temporal Information in API calls with Machine learning Algorithms for Malware detection. In: 2nd ACM workshop on Security and Artificial Intelligence (AISec), November 9th, Chicago Illinois, USA, pp. 55–62 (2009).

    Google Scholar 

  10. Salton, G., and McGill, M. J.: Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA, (1986).

    Google Scholar 

  11. Abu-Nimeh, S., Nappa, D., Wang, X., Nair, S.: A comparison of Machine Learning techniques for phishing detection. In: APWG eCrime Researchers Summit, October 4–5, Pittsburgh, PA, USA, pp. 60-69 (2007).

    Google Scholar 

  12. Garera, S., Provos, N., Chew, M., and Rubin, A. D.: A Framework for Detection and Measurement of Phishing Attacks. In: Special Interest Group on Security, Audit and Control (SIGSAC) Workshop On Recurring Malcode (WORM), November 2, Alexandria, Virginia, USA, pp. 1–8 (2007).

    Google Scholar 

  13. Ludl, C., Mcallister, S., Kirda, E., Kruegel, C.: On the effectiveness of techniques to detect phishing sites. In: Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), July 12-13, Switzerland, pp. 20–39 (2007).

    Google Scholar 

  14. Chen, X., Bose, I., Leung, A. C. M., and Guo, C.: Assessing the severity of phishing attacks: A hybrid data mining approach. Decision Support Systems 50 (4), 662–672 (2011).

    Google Scholar 

  15. Pandey, M., and Ravi, V.: Detecting phishing e-mails using text and data mining, in: International Conference on Computational Intelligence & Computing Research (ICCIC). December 18–20, Coimbatore, India, pp. 249-255 (2012).

    Google Scholar 

  16. Pandey, M., and Ravi, V.: Text and Data mining to detect phishing websites and spam emails, in: Swarm, Evolutionary, and Memetic Computing (SEMCCO), December 19-21, Chennai, India, LNCS 8298 Part-II, pp. 559–573 (2013).

    Google Scholar 

  17. Lee, W., and Stolofo, J. S.: Data mining approaches for intrusion detection. In: USENIX Security symposium, January 26–29, San Antonio, Texas, pp. 1-6 (1988).

    Google Scholar 

  18. Ye, Y., Wang, D., Li, T., and Ye, D.: IMDS: Intelligent Malware Detection System. In: 13th KDD, August 12–15, San Jose, California, USA, pp. 1043-1047 (2007).

    Google Scholar 

  19. Sundarkumar, G. G., and Ravi, V.: Malware detection by text and data mining. In: International Conference on Computational Intelligence & Computing Research (ICCIC), December 26–28, Enathi, India, pp. 1-6 (2013).

    Google Scholar 

  20. Li, C. H., and Park, S. C.: An efficient document classification model using an improved back propagation neural network and singular value decomposition. Expert Systems with Applications 36 (2), 3208–3215 (2009).

    Google Scholar 

  21. Song, W., and Park, S. C.: Genetic algorithm for text clustering based on latent semantic indexing. Computers and mathematics with applications 57 (11), 1901–1907 (2009).

    Google Scholar 

  22. Thorleuchter, D., and Van den Poel, D.: Application based Technology Classification with Latent Semantic Indexing. Expert Systems with Applications 40 (5), 1786–1795 (2013).

    Google Scholar 

  23. Chen, Y., Zhou, X., and Huang, T. S.: One-class SVM for learning in image retrieval. In: International Conference on Image Processing, October 7-11, Thessaloniki, Greece, pp. 34–37 (2001).

    Google Scholar 

  24. Manevitz, L. M., and Yosef, M.: One-Class SVMs for document classification. Journal of Machine Learning Research 2, 139–154 (2001).

    Google Scholar 

  25. Jin, H., Liu, Q., Lu, H.: Face detection using one-class-based support vectors, In: 6th International Conference on Automatic Face and Gesture Recognition (FGR), 19th May, Seoul, South Korea, pp. 457–462 (2004).

    Google Scholar 

  26. Hempstalk, K., Frank, E., and Witten, I. H.: One-class classification by combining density and class probability estimation. In: ECML PKDD, September 15-19, Antwerp, Belgium, Part I, LNAI 5211, pp. 505–519 (2008).

    Google Scholar 

  27. Liu, C., Wang, G., Ning, W., Lin, X., Li, L., and Liu, Z.: Anomaly detection in surveillance video using motion direction statistics. In: 17th International Conference on Image Processing, September 26-29, Hong Kong, pp. 717–720 (2010).

    Google Scholar 

  28. Berry, M. W., Dumais, S. T., and Obrien, G. W.: Using Linear Algebra for Intelligent Information Retrieval. In: Society for Industrial and Applied Mathematics (SIAM) Review, 37 (4), pp. 573–595 (1995).

    Google Scholar 

  29. Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S., and Harshman, R.: Using latent semantic analysis to improve access to textual information. In: CHI, April 18-23, Los Angeles, California, USA, pp. 281–285 (1988).

    Google Scholar 

  30. Deerwester, S. C., Dumais, S. T., Landauer, T. K., and Furnas, G. W.: Indexing by latent semantic analysis. Journal of the American Society for Information Science (JASIS), 391–407 (1990).

    Google Scholar 

  31. Furnas, G. W., Deerwester, S., Dumais, S. T., Landauer, T. K., Harshman, R. A., Streeter, L. A., and Lochbaum, K. E.: Information Retrieval using a singular value decomposition model of latent semantic structure. In: SIGIR, August 24-28, Grenoble, France, pp. 465–480 (1998).

    Google Scholar 

  32. Phishing corpus, http:// http://monkey.org/~jose/wiki/doku.php.

  33. Phishtank, http://www.phishtank.com.

  34. IBM SPSS, http://www-01.ibm.com/software/in/analytics/spss/products/data-collection/.

  35. Rapid Miner (2012), https://rapidminer.com.

  36. LIBSVM, http://www.csie.ntu.edu.tw/~cjlin/libsvm/#download. Visited 2014.

  37. MATLAB (2012), www.mathworks.com.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vadlamani Ravi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper

Shravan Kumar, B., Ravi, V. (2017). One-Class Text Document Classification with OCSVM and LSI. In: Dash, S., Vijayakumar, K., Panigrahi, B., Das, S. (eds) Artificial Intelligence and Evolutionary Computations in Engineering Systems. Advances in Intelligent Systems and Computing, vol 517. Springer, Singapore. https://doi.org/10.1007/978-981-10-3174-8_50

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-3174-8_50

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-3173-1

  • Online ISBN: 978-981-10-3174-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics