Progress in Artificial Intelligence

, Volume 8, Issue 1, pp 123–132 | Cite as

Fuzzy clustering-based semi-supervised approach for outlier detection in big text data

  • Farek LazharEmail author
Regular Paper


Text data is often polluted by outlier documents which can significantly influence the performance of classification techniques. In this paper, we propose an approach based on fuzzy clustering to detect outlier documents. The principle of our approach is based on the assumption that documents assigned to different clusters with very close degrees are considered as candidate outliers. Firstly, a semantic data model is built using Doc2Vec framework. Secondly, a fuzzy clustering is performed. Thirdly, candidate outlier documents are detected based on the different degrees of membership. Finally, for each candidate outlier, the objective function is recomputed, and a candidate document is considered as outlier when it conducts to considerably increase the objective function score. To show the effectiveness of our approach, two classification tests, one with original datasets and the second without outlier, are applied. Experimental results show that discarding outlier from datasets conducts to improve the performance of classifiers.


Outlier detection Fuzzy clustering Big text data Doc2Vec modeling Sparsity High dimensionality Classification 


  1. 1.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques, vol. 743. Morgan Kaufmann, San Francisco (2006)zbMATHGoogle Scholar
  2. 2.
    Tamboli, J., Shukla, M.: A survey of outlier detection algorithms for data streams. In: 3rd International Conference on Computing for Sustainable Global Development (INDIACom), pp 3535–3540 (2016)Google Scholar
  3. 3.
    Sreevidya, S.S.: A survey on outlier detection methods. Int. J. Comput. Sci. Inf. Technol. 5(6), 8153–8156 (2014)Google Scholar
  4. 4.
    Sharma, S., Jain, R.: Outlier detection in agriculture domain: application and techniques. In: Aggarwal, V., Bhatnagar, V., Mishra, D. (eds.) Big Data Analytics. Advances in Intelligent Systems and Computing, vol. 654. Springer, Singapor (2018)Google Scholar
  5. 5.
    Assent, I.: Efficient density-based subspace clustering in high dimensions. In: Masulli, F., Petrosino, A., Rovetta, S. (eds.) Clustering High-Dimensional Data. Lecture Notes in Computer Science, vol. 7627, pp. 34–49. Springer, Berlin (2015)CrossRefGoogle Scholar
  6. 6.
    Merrell, R., Diaz, D.: Comparison of data mining methods on different applications: clustering and classification methods. Inf Sci Lett Lect Notes Comput Sci 4(2), 61–66 (2015)Google Scholar
  7. 7.
    Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceeding COLT’ 98 Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pp. 92–100 (1998)Google Scholar
  8. 8.
    Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional space. In: Van den Bussche, J., Vianu, V. (eds.) Database Theory ICDT 2001. Lecture Notes in Computer Science, vol. 1973. Springer, Berlin (2001)Google Scholar
  9. 9.
    Rousseeuw, P., Leroy, A.: Robust Regression and Outlier Detection, 3rd edn. Wiley, New York (1996)zbMATHGoogle Scholar
  10. 10.
    Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: SIGMOD Conference, pp. 427–438 (2000)Google Scholar
  11. 11.
    Jagadeeswaran, V.S., Uma, P.: Detection of noise by efficient hierarchical BIRCH algorithm for large data sets. Int. J. Adv. Res. Comput. Commun. Eng. 2(2), 1306–1309 (2013)Google Scholar
  12. 12.
    Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. IEEE Trans. Knowl. Data Eng. 17(2), 203–215 (2005)CrossRefzbMATHGoogle Scholar
  13. 13.
    Jain, A.K., Murty, M.N., Flyn, P.J.: Data clustering: a review. ACM Comput Surv 31(3), 264–323 (1999)CrossRefGoogle Scholar
  14. 14.
    Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York (1981)CrossRefzbMATHGoogle Scholar
  15. 15.
    Kumar, V., Kumar, S., Singh, A.K.: Outlier detection: a clustering-based approach. Int. J. Sci. Mod. Eng. 1(7), 16–19 (2013)Google Scholar
  16. 16.
    Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: ICML’14 Proceedings of the 31st International Conference on International Conference on Machine Learning, Beijing, China, vol. 32, pp. II-1188–II-1196 (2014)Google Scholar
  17. 17.
    Singh, G., Kumar, V.: An efficient clustering and distance based approach for outlier detection. Int. J. Comput. Trends Technol. 4(7), 2067–2072 (2013)Google Scholar
  18. 18.
    Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. In: ACM SIGMOD Conference, vol. 27(2) (1998)Google Scholar
  19. 19.
    Karypis, G., Han, E.H., Kumar, V.: Chameleon: hierarchical clustering using dynamic modeling. IEEE Comput. 32(8), 68–75 (1999)CrossRefGoogle Scholar
  20. 20.
    Breunig, M.M., Kriegel, H.P. ,Ng, R.T., Lof, S.J.: Identifying density-based local outliers. In: SIGMOD Conference, pp. 93–104 (2000)Google Scholar
  21. 21.
    Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: Proceeding VLDB Algorithms for Mining Distance-Based Outliers in Large Datasets, pp. 392–403 (1998)Google Scholar
  22. 22.
    Çelik, M., Dadaşer-Çelik, F., Dokuz, A.Ş.: Anomaly detection in temperature data using DBSCAN algorithm. In: International Symposium on Innovations in Intelligent Systems and Applications (INISTA), Istanbul, Turkey, pp. 91–95 (2011)Google Scholar
  23. 23.
    Mirkin, B.G.: Clustering for Data Mining: A Data Recovery Approach, vol. 3. CRC Press, Boca Raton (2005)CrossRefzbMATHGoogle Scholar
  24. 24.
    Wang, W., Yang, J., Muntz, R.: STING: a statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases, pp. 186–195. Morgan Kaufmann Publishers Inc., Burlington (1997)Google Scholar
  25. 25.
    Niu, K., Huang, C., Zhang, S., Chen, J.: ODDC: outlier detection using distance distribution clustering. In: Washio, T. (ed.) PAKDD 2007 Workshops. Lecture Notes in Artificial Intelligence (LNAI), vol. 4819, pp. 332–343. Springer, Berlin (2007)Google Scholar
  26. 26.
    Breunig, M.M., Kriegel, H., Ng, R.T., et al.: LOF: identifying density-based local outliers. In: Proceedings of ACM SIGMOD International Conference on Management of Data, Dalles, TX, pp. 93–104 (2000)Google Scholar
  27. 27.
    Gath, I., Geva, A.: Fuzzy clustering for the estimation of the parameters of the components of mixtures of normal distribution. Pattern Recognit. Lett. 9, 77–86 (1989)CrossRefzbMATHGoogle Scholar
  28. 28.
    Cutsem, B., Gath, I.: Detection of outliers and robust estimation using fuzzy clustering. Comput. Stat. Data Anal. 15, 47–61 (1993)CrossRefzbMATHGoogle Scholar
  29. 29.
    Klawonn, K., Höppner, F., Shim, K., Jayaram, B.: Efficient algorithms for mining outliers from large data sets. In: Proceeding Revised Selected Papers of the First International Workshop on Clustering High-Dimensional Data, vol. 7627, pp. 14–33 (2013)Google Scholar
  30. 30.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at the International Conference on Learning Representations, Scottsdale, USA (2013)Google Scholar
  31. 31.
    Campr, M., Ježek, K.: Comparing semantic models for evaluating automatic document summarization. In: Král, P., Matoušek, V. (eds.) Text, Speech, and Dialogue. TSD 2015. Lecture Notes in Computer Science, vol. 9302. Springer, Cham (2015)Google Scholar
  32. 32.
    Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation. In: Proceedings of the 1st Workshop on Representation Learning for NLP, Berlin, Germany, pp. 78–86 (2015)Google Scholar
  33. 33.
    Ertöz, L., Steinbach, M., Kumar, V.: Finding topics in collections of documents: a shared nearest neighbor approach. In: Ertöz, L., Steinbach, M., Kumar, V. (eds.) Clustering and Information Retrieval. Network Theory and Applications, vol. 11. Springer, Boston (2004)Google Scholar
  34. 34.
    Bayley, M.J., Gillet, V.J., Willett, P., Bradshaw, J., Green, D.V.S.: Computational analysis of molecular diversity for drug discovery. In: Proceeding of the 3rd Annual Conference on Research in Computational Molecular Biology, pp 321–330. ACM Press, New York (1999)Google Scholar
  35. 35.
    Zadeh, L.A.: Fuzzy sets. Inf. Control 8, 338–353 (1965)CrossRefzbMATHGoogle Scholar
  36. 36.
    Sami, Ä., Tommi K.: Introduction to partitioning-based clustering methods with a robust example. Reports of the Department of Mathematical Information Technology, University of Jyväskylä, Finland (2006)Google Scholar
  37. 37.
    Bora, D.J.: Computational analysis of molecular diversity for drug discovery. Int. J. Comput. Sci. Inf. Technol. 5(2), 2501–2506 (2014)MathSciNetGoogle Scholar
  38. 38.
    Bora, D.J., Gupta, A.K.: Effect of different distance measures on the performance of K-means algorithm: an experimental study in Matlab. Int. J. Comput. Sci. Inf. Technol. 5(2), 2501–2506 (2014)Google Scholar
  39. 39.
    Kull, M., Flach, P.A.: Reliability maps: a tool to enhance probability estimates and improve classification accuracy. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2014. Lecture Notes in Computer Science, vol. 8725. Springer, Berlin (2014)Google Scholar
  40. 40.
    Wang, F., Sun, J.: Survey on distance metric learning and dimensionality reduction in data mining. Data Min. Knowl. Discov. 29(2), 534–564 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  41. 41.
    Wu, W.: Clustering and information retrieval. In: Feature Selection for High-Dimensional Data. Artificial Intelligence: Foundations, Theory, and Algorithms. Springer, Cham (2015)Google Scholar
  42. 42.
    Wen, J.R., Zhang, H.J.: Query clustering in the web context. In: Wu, W., Xiong, H., Shekhar, S. (eds.) Clustering and Information Retrieval. Network Theory and Applications, vol. 11. Springer, Boston (2004)Google Scholar
  43. 43.
    López-Monroy, A.P., Montes-y-Gómez, M., Villaseñor-Pineda, L., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: A new document author representation for authorship attribution. In: Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Olvera López, J.A., Boyer, K.L. (eds.) Pattern Recognition. MCPR 2012. Lecture Notes in Computer Science, vol. 7329. Springer, Berlin (2012)Google Scholar
  44. 44.
    Forsyth, D.: Learning to classify. In: Probability and Statistics for Computer Science. Springer, Cham (2018)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.University of GuelmaGuelmaAlgeria

Personalised recommendations