Advertisement

Methods to Investigate Concept Drift in Big Data Streams

Chapter

Abstract

The explosion of information from various social networking sites, Web clickstream, information retrieval, customers’ records, users’ reviews, business transactions, network event logs, etc. Results in generating a continuous deluge of data at different rates, called streaming data. Organizing, indexing, analyzing, or mining hidden knowledge from such a data deluge becomes a critical functionality for a broad range of content analysis tasks that includes emerging topic detection, interesting content identification, user interest profiling, and real-time Web search. But managing such ‘Big Data’ becomes even more challenging when streaming data is taken for analyzing and producing results in real time. The streaming data may include numeric, categorical, or mixed value. Most of the current research has been done on numeric data streams by exploiting the statistical properties of the numeric data. But now categorical/textual data streams have also gained researchers’ interest due to the high availability of data in textual format on the Internet. Applying classification for managing data streams is an unrealistic approach as not every incoming data has a class label. So, in such a case, for managing unlabeled data streams, a clustering technique is applied. One property that can affect the results of any clustering algorithm is concept drift. Therefore, detecting and managing concept drift over a period imposes a great challenge to better cluster analysis. This chapter provides an in-depth critique of various algorithms that have been introduced to handle concept drift in a real environment. A framework for examining concept drift in big data streams is also proposed.

Keywords

Big data Social networks Social media Clustering Concept drifting 

References

  1. 1.
    Zhang, B., Qin, S., Wang, W., Wang, D., & Xue, L. (2016). Data stream clustering based on fuzzy C-mean algorithm and entropy theory. Journal of Signal Processing, 126, 111–116.CrossRefGoogle Scholar
  2. 2.
    Lifna, C., & Vijaylakshmi, M. (2015). Identifying concept drifts in Twitter streams. In International Conference on Advanced Computing Technologies and Applications (pp. 86–94).Google Scholar
  3. 3.
    Xhafa, F., Naranjo, V., Barolli, L., & Takizawa, M. (2015). On Streaming Consistency of Big Data Stream Processing in Heterogenous Clusters. In 18th IEEE International Conference on Network-Based Information Systems (pp. 476–482).Google Scholar
  4. 4.
    Schnitzler, K., Davies, N., Ross, F., & Harris, R. (2016). Using Twitter™ to drive research impact: A discussion of strategies, opportunities and challenges. International Journal of Nursing Studies, 59, 15–26.CrossRefGoogle Scholar
  5. 5.
    Costa, J., Silva, C., Antunes, M., & Ribeiro, B. (2014). Concept Drift Awareness in Twitter Streams. In 13th IEEE International Conference on Machine Learning and Applications (pp. 294–299).Google Scholar
  6. 6.
    Wang, Y., Liu, J., Huang, Y., & Feng, X. (2016). Using Hashtag graph-based topic model to connect semantically-related words without co-occurrence in microblogs. IEEE Transactions on Knowledge and Data Engineering, 28(7), 1919–1933.CrossRefGoogle Scholar
  7. 7.
    Entities—Twitter Developers. (2017). In Dev.twitter.com. https://dev.twitter.com/overview/api/entities. Accessed May 8, 2017.
  8. 8.
    Eskandari, S., & Javidi, M. (2016). Online streaming feature selection using rough sets. International Journal of Approximate Reasoning, 69, 35–57.MathSciNetCrossRefMATHGoogle Scholar
  9. 9.
    Li, J., Tai, Z., Zhang, R., Yu, W., & Liu, L. (2014). Online bursty event detection from microblog. In IEEE/ACM 7th International Conference on Utility and Cloud Computing (pp. 865–870).Google Scholar
  10. 10.
    Adedoyin-Olowe, M., Gaber, M., Dancausa, C., Stahl, F., & Gomes, J. (2016). A rule dynamics approach to event detection in Twitter with its application to sports and politics. Expert Systems with Applications, 55, 351–360.CrossRefGoogle Scholar
  11. 11.
    Villanueva, D., González-Carrasco, I., López-Cuadrado, J., & Lado, N. (2016). SMORE: Towards a semantic modeling for knowledge representation on social media. Science of Computer Programming, 121, 16–33.CrossRefGoogle Scholar
  12. 12.
    Li, H. (2014). Detecting campaign promoters on Twitter using Markov random fields. In IEEE International Conference on Data Mining (pp. 290–299).Google Scholar
  13. 13.
    Kuo, R., Mei, C., Zulvia, F., & Tsai, C. (2016). An application of a meta-heuristic algorithm-based clustering ensemble method to APP customer segmentation. Neurocomputing, 205, 116–129.CrossRefGoogle Scholar
  14. 14.
    Wang, B., Miao, Y., Zhao, H., Jin, J., & Chen, Y. (2016). A biclustering-based method for market segmentation using customer pain points. Journal of Engineering Applications of Artificial Intelligence, 47, 101–109.CrossRefGoogle Scholar
  15. 15.
    Giannitsioti, E., Athanasia, S., Plachouras, D., Kanellaki, S., Bobota, F., Tzepetzi, G., et al. (2016). Impact of patients’ professional and educational status on perception of an antibiotic policy campaign: A pilot study at a university hospital. Journal of Global Antimicrobial Resistance, 6, 123–127.CrossRefGoogle Scholar
  16. 16.
    Han, J., Kamber, M., & Pei, J. (2011). Data mining (3rd ed.). Amsterdam: Elsevier/Morgan Kaufmann.MATHGoogle Scholar
  17. 17.
    He, Z., Xu, X., & Deng, S. (2011). Clustering categorical data streams. Journal of Computational Methods in Sciences and Engineering, 11(4), 185–192.Google Scholar
  18. 18.
    Wu, Q., & Ma, S. (2011). Detecting outliers in sliding window over categorical data streams. In Eighth International Conference on Fuzzy Systems and Knowledge Discovery (pp. 1663–1667).Google Scholar
  19. 19.
    Sora, M., Roy, S., & Singh, I. (2011). FLoMSqueezer: An effective approach for clustering categorical data stream. International Journal of Computer Science Issues, 8(6), 1.Google Scholar
  20. 20.
    Carbonera, J., & Abel, M. (2014). An entropy-based subspace clustering algorithm for categorical data. In IEEE 26th International Conference on Tools with Artificial Intelligence (pp. 272–277).Google Scholar
  21. 21.
    Qin, H., Ma, X., Herawan, T., & Zain, J. (2014). MGR: An information theory based hierarchical divisive clustering algorithm for categorical data. Knowledge-Based Systems, 67, 401–411.  https://doi.org/10.1016/j.knosys.2014.03.013.CrossRefGoogle Scholar
  22. 22.
    Lenco, D., Bifet, A., Pfahringer, B., & Poncelet, P. (2014). Change detection in categorical evolving data streams. In 29th Annual ACM Symposium on Applied Computing (pp. 792–797).Google Scholar
  23. 23.
    Cao, F., & Huang, J. Z. (2013). A concept-drifting detection algorithm for categorical evolving data. In J. Pei, V. S. Tseng, L. Cao, H. Motoda & G. Xu (Eds.), Advances in knowledge discovery and data mining. PAKDD 2013. Lecture notes in computer science (vol. 7819). Berlin, Heidelberg: Springer.Google Scholar
  24. 24.
    Li, Y., Li, D., Wang, S., & Zhai, Y. (2014). Incremental entropy-based clustering on categorical data streams with concept drift. Knowledge-Based Systems, 59, 33–47.  https://doi.org/10.1016/j.knosys.2014.02.004.CrossRefGoogle Scholar
  25. 25.
    Chen, H. L., Chen, M. S., & Lin, S. C. (2009). Catching the trend: A framework for clustering concept-drifting categorical data. IEEE Transactions on Knowledge Data Engineering, 21(5), 652–665.CrossRefGoogle Scholar
  26. 26.
    Cao, F., Liang, J., Bai, L., Zhao, X., & Dang, C. (2010). A framework for clustering categoricaltime-evolving data. IEEE Transactions on Fuzzy System, 18(5), 872–882.CrossRefGoogle Scholar
  27. 27.
    Cao, F., & Liang, J. (2011). A data labeling method for clustering categorical data. Expert Systems with Applications, 38, 2381–2385.  https://doi.org/10.1016/j.eswa.2010.08.026.CrossRefGoogle Scholar
  28. 28.
    Talistu, M., Moh, T. S., & Moh, M. (2015). Gossip-based spectral clustering of distributed data streams. In International Conference on High Performance Computing and Simulation (pp. 325–333).  https://doi.org/10.1109/HPCSim.2015.7237058.
  29. 29.
    Xhafa, F., Naranjo, V., Barolli, L., & Takizawa, M. (2015). On streaming consistency of big data stream processing in heterogenous clusters. In 18th International Conference on Network-Based Information Systems (pp. 476–482).Google Scholar
  30. 30.
    Martinez-Gil, J. (2016). CoTO: A novel approach for fuzzy aggregation of semantic similarity measures. Cognitive Systems Research, 40, 8–17.  https://doi.org/10.1016/j.cogsys.2016.01.001.CrossRefGoogle Scholar
  31. 31.
    Rehioui, H., Idrissi, A., Abourezq, M., & Zegrari F. (2016). DENCLUE-IM: A new approach for big data clustering. In 7th International Conference on Ambient Systems, Networks and Technologies (pp. 560–567).Google Scholar
  32. 32.
    Laohakiat, S., Phimoltares, S., & Lursinsap, C. (2017). A clustering algorithm for stream data with LDA-based unsupervised localized dimension reduction. Information Sciences, 381, 104–123.  https://doi.org/10.1016/j.ins.2016.11.018.CrossRefGoogle Scholar
  33. 33.
    Barddal, J., Gomes, H., Enembreck, F., & Pfahringer, B. (2017). A survey on feature drift adaptation: Definition, benchmark, challenges and future directions. Journal of Systems and Software, 127, 278–294.  https://doi.org/10.1016/j.jss.2016.07.005.CrossRefGoogle Scholar
  34. 34.
    Andrade Silva, J., Hruschka, E., & Gama, J. (2017). An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Systems with Applications, 67, 228–238.  https://doi.org/10.1016/j.eswa.2016.09.020.CrossRefGoogle Scholar
  35. 35.
    Liu, J., & Zio, E. (2016). A SVR-based ensemble approach for drifting data streams with recurring patterns. Applied Soft Computing, 47, 553–564.  https://doi.org/10.1016/j.asoc.2016.06.030.CrossRefGoogle Scholar
  36. 36.
    Bai, L., Cheng, X., Liang, J., & Shen, H. (2016). An optimization model for clustering categorical data streams with drifting concepts. IEEE Transactions on Knowledge and Data Engineering, 28, 2871–2883.  https://doi.org/10.1109/tkde.2016.2594068.CrossRefGoogle Scholar
  37. 37.
    Chen, H.-L., Chen, M.-S., & Lin, S.-C. (2009). Catching the trend: A framework for clustering concept-drifting categorical data. IEEE Transactions on Knowledge and Data Engineering, 21, 652–665.  https://doi.org/10.1109/tkde.2008.192.CrossRefGoogle Scholar
  38. 38.
    Cao, F., Liang, J., Bai, L., Zhao, X., & Dang, C. (2010). A framework for clustering categorical time-evolving data. IEEE Transactions on Fuzzy Systems, 18, 872–882.  https://doi.org/10.1109/tfuzz.2010.2050891.CrossRefGoogle Scholar
  39. 39.
    Koh, Y. S. (2016). CD-TDS: Change detection in transactional data streams for frequent pattern mining. In International Joint Conference on Neural Networks (pp. 1554–1561).  https://doi.org/10.1109/IJCNN.2016.7727383.
  40. 40.
    Song, G., Ye, Y., Zhang, H., Xu, X., Lau, R., & Liu, F. (2016). Dynamic clustering forest: An ensemble framework to efficiently classify textual data stream with concept drift. Information Sciences, 357, 125–143.  https://doi.org/10.1016/j.ins.2016.03.043.CrossRefGoogle Scholar
  41. 41.
    Haque, A., Khan, L., Baron, M., Thuraisingham, B., & Aggarwal, C. (2016). Efficient handling of concept drift and concept evolution over Stream Data. In IEEE 32nd International Conference on Data Engineering (pp. 481–492).Google Scholar
  42. 42.
    Sethi, T. S., Kantardzic, M., & Arabmakki, E. (2016). Monitoring classification blindspots to detect drifts from unlabeled data. In IEEE 17th International Conference on Information Reuse and Integration (pp. 142–151).Google Scholar
  43. 43.
    da Costa, F., Rios, R., & de Mello, R. (2016). Using dynamical systems tools to detect concept drift in data streams. Expert Systems with Applications, 60, 39–50.  https://doi.org/10.1016/j.eswa.2016.04.026.CrossRefGoogle Scholar
  44. 44.
    Lughofer, E., & Mouchaweh, M. S. (2015). Autonomous data stream clustering implementing split-and-merge concepts—Towards a plug-and-play approach. Information Sciences, 304, 54–79.  https://doi.org/10.1016/j.ins.2015.01.010.CrossRefGoogle Scholar
  45. 45.
    Yang, H., & Fong, S. (2015). Countering the concept-drift problems in big data by an incrementally optimized stream mining model. Journal of Systems and Software, 102, 158–166.  https://doi.org/10.1016/j.jss.2014.07.010.CrossRefGoogle Scholar
  46. 46.
    Wu, X., Li, P., & Hu, X. (2012). Learning from concept drifting data streams with unlabeled data. Neurocomputing, 92, 145–155.  https://doi.org/10.1016/j.neucom.2011.08.041.CrossRefGoogle Scholar
  47. 47.
    Hong, L., Dan, O., & Davison, B. D. (2011). Predicting popular messages in Twitter. In ACM International Conference on World Wide Web(WWW).Google Scholar
  48. 48.
    Li, C., Shan, M., Jheng, S., & Chou, K. (2016). Exploiting concept drift to predict popularity of social multimedia in microblogs. Information Sciences, 339, 310–331.  https://doi.org/10.1016/j.ins.2016.01.009.CrossRefGoogle Scholar
  49. 49.
    Shang, K., Yan, W., & Small, M. (2016). Evolving networks—Using past structure to predict the future. Physica A: Statistical Mechanics and its Applications, 455, 120–135.  https://doi.org/10.1016/j.physa.2016.02.067.CrossRefGoogle Scholar
  50. 50.
    Lipizzi, C., Dessavre, D., Iandoli, L., & Marquez, J. (2016). Social media conversation monitoring: Visualize information contents of Twitter messages using conversational metrics. Procedia Computer Science, 80, 2216–2220.  https://doi.org/10.1016/j.procs.2016.05.384.CrossRefGoogle Scholar
  51. 51.
    Miller, Z., Dickinson, B., Deitrick, W., Hu, W., & Wang, A. (2014). Twitter spammer detection using data stream clustering. Information Sciences, 260, 64–73.  https://doi.org/10.1016/j.ins.2013.11.016.CrossRefGoogle Scholar
  52. 52.
    Karunasekera, S., Harwood, A., Samarawickrama, S., Ramamohanrao, K., & Robins, G. (2014). Topic-specific post identification in microblog streams. In IEEE International Conference on Big Data (pp. 7–13).  https://doi.org/10.1109/BigData.2014.7004416.
  53. 53.
    Malik, S., Smith, A., Hawes, T., Papadatos, P., Li, J., Dunne, C., et al. (2013). TopicFlow: Visualizing topic alignment of Twitter data over time. In IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (pp. 720–726).  https://doi.org/10.1145/2492517.2492639.
  54. 54.
    Jiang, W., & Brice, P. (2009). Data stream clustering and modeling using context-trees. In 6th IEEE International Conference on Service Systems and Service Management (pp. 932–937).Google Scholar
  55. 55.
    Li, Wen J., Tai, Z., Zhang, R., & Yu, W. (2015). Bursty event detection from microblog: A distributed and incremental approach. Concurrency and Computation: Practice and Experience, 28(11), 3115–3130.CrossRefGoogle Scholar
  56. 56.
    Kalloubi, F., Nfaoui, E. H., & Beqqali, O. El. (2014). Named entity linking in microblog posts using graph-based centrality scoring. In 9th International Conference on Intelligent Systems: Theories and Application (pp. 501–506).  https://doi.org/10.1109/SITA.2014.6847286.
  57. 57.
    Gaglio, S., Re, G., & Morana, M. (2015). Real-time detection of Twitter social events from the user’s perspective. In IEEE International Conference on Communications (ICC) (pp. 1207–1212).Google Scholar
  58. 58.
    Kalloubi, F., Nfaoui, E., & Beqqali, O. (2014). Graph-based tweet entity linking using DBpedia. In IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA) (pp. 501–506).  https://doi.org/10.1109/AICCSA.2014.7073240.
  59. 59.
    Kumar, N., & Muruganantham, D. (2016). Disambiguating the Twitter stream entities and enhancing the search operation using DBpedia ontology. International Journal of Information Technology and Web Engineering, 11(2), 51–62.  https://doi.org/10.4018/IJITWE.2016040104.CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.UIETPanjab UniversityChandigarhIndia

Personalised recommendations