PAS3-HSID: a Dynamic Bio-Inspired Approach for Real-Time Hot Spot Identification in Data Streams
- 24 Downloads
Hot spot identification is a very relevant problem in a wide variety of areas such as health care, energy or transportation. A hot spot is defined as a region of high likelihood of occurrence of a particular event. To identify hot spots, location data for those events is required, which is typically collected by telematics devices. These sensors are constantly gathering information, generating very large volumes of data. Current state-of-the-art solutions are capable of identifying hot spots from big static batches of data by means of variations of clustering or instance selection techniques that pre-process the original input data, providing the most relevant locations. However, these approaches neglect to address changes in hot spots over time. This paper presents a dynamic bio-inspired approach to detect hot spots in big data streams. This computational intelligence method is designed and applied to the transportation sector as a case study to identify incidents in the roads caused by heavy goods vehicles. We adapt an immune-based algorithm to account for the temporary aspect of hot spots inspired by the idea of pheromones, which is then subsequently implemented using Apache Spark Streaming. Experimental results on real datasets with up to 4.5 million data points—provided by a telematics company—show that the algorithm is capable of quickly processing large streaming batches of data, as well as successfully adapting over time to detect hot spots. The outcome of this method is twofold, both reducing data storage requirements and demonstrating resilience to sudden changes in the input data (concept drift).
KeywordsHot spots Road incidents Instance selection Telematics data Big data streams Computational intelligence
The authors would like to thank the Soft Computing and Intelligent Information Systems research group from the University of Granada, for allowing us to use their big data infrastructure to carry out the experiments.
Compliance with Ethical Standards
Conflict of Interests
The authors declare that they have no conflict of interest.
This article does not contain any studies with human participants or animals performed by any of the authors.
- 1.Alpaydin E. Introduction to machine learning. Cambridge: The MIT Press; 2014.Google Scholar
- 8.Chu F, Zaniolo C. Fast and light boosting for adaptive mining of data streams. Advances in Knowledge Discovery and Data Mining, p 282–92. In: Dai H, Srikant R, and Zhang C, editors; 2004.Google Scholar
- 11.Dorigo M, Di Caro G. Ant colony optimization: a new meta-heuristic. Proceedings of the 1999 congress on evolutionary computation, 1999. IEEE; 1999. p. 1470–7.Google Scholar
- 14.Ester M, Kriegel HP, Sander J, Xu X, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd; 1996. p. 226–31.Google Scholar
- 16.Figueredo GP, Ebecken NFF, Barbosa HJC. The SUPRAIC algorithm: a suppression immune based mechanism to find a representative training set in data classification tasks. ICARIS, Lecture notes in computer science. Berlin: Springer; 2007. p. 59–70.Google Scholar
- 20.García S, Luengo J, Herrera F. Data preprocessing in data mining. Berlin: Springer Publishing Company, Incorporated; 2014.Google Scholar
- 21.Han J, Kamber M, Tung AKH. Spatial clustering methods in data mining: a survey. In: Miller HJ and Han J, editors. Milton Park: Taylor and Francis; 2001.Google Scholar
- 22.Han J, Pei J, Kamber M. Data mining: concepts and techniques. Amsterdam: Elsevier; 2011.Google Scholar
- 24.Hulten G, Spencer L, Domingos P. Mining time-changing data streams. Proceedings of the Seventh ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’01. New York: ACM; 2001. p. 97–106.Google Scholar
- 33.Passini MLC, Estébanez KB, Figueredo GP, Ebecken NFF. A strategy for training set selection in text classification problems. Int J Adv Comput Sci Appl 2013;4(6):54–60.Google Scholar
- 37.Shirkhorshidi AS, Aghabozorgi S, Wah TY, Herawan T. 2014. Big data clustering: a review. In: International conference on computational science and its applications, Springer; p. 707–20.Google Scholar
- 40.Street WN, Kim Y. 2001. A streaming ensemble algorithm (SEA) for large-scale classification. In: Proceedings of the Seventh ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’01, p. 377–82.Google Scholar
- 41.Triguero I, Figueredo GP, Mesgarpour M, Garibaldi JM, John RI. 2017. Vehicle incident hot spots identification: an approach for big data. In: 2017 IEEE Trustcom/bigdataSE/ICESS, p. 901–8.Google Scholar
- 43.Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation, NSDI’12, p. 15–28.Google Scholar
- 44.Zaharia M, Das T, Li H, Shenker S, Stoica I. 2012. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In: Proceedings of the 4th USENIX conference on hot topics in cloud computing, p. 10–0.Google Scholar