Evaluation of distributed stream processing frameworks for IoT applications in Smart Cities
The widespread growth of Big Data and the evolution of Internet of Things (IoT) technologies enable cities to obtain valuable intelligence from a large amount of real-time produced data. In a Smart City, various IoT devices generate streams of data continuously which need to be analyzed within a short period of time; using some Big Data technique. Distributed stream processing frameworks (DSPFs) have the capacity to handle real-time data processing for Smart Cities. In this paper, we examine the applicability of employing distributed stream processing frameworks at the data processing layer of Smart City and appraising the current state of their adoption and maturity among the IoT applications. Our experiments focus on evaluating the performance of three DSPFs, namely Apache Storm, Apache Spark Streaming, and Apache Flink. According to our obtained results, choosing a proper framework at the data analytics layer of a Smart City requires enough knowledge about the characteristics of target applications. Finally, we conclude each of the frameworks studied here have their advantages and disadvantages. Our experiments show Storm and Flink have very similar performance, and Spark Streaming, has much higher latency, while it provides higher throughput.
KeywordsDistributed stream processing Smart City IoT applications Latency Throughput
Internet of Things
distributed stream processing framework
Radio Frequency IDentification
Hadoop Distributed File System
application programming interface
directed acyclic graph
extract, transform, load
Complex Event Processing
Image and Video Processing
Large number of embedded devices, massive volumes of data, users and applications are driving the digital world to move faster than ever. To be competitive in today’s digital economy companies have to process large volumes of dynamically changing data at real-time. There are many industries from health-care, e-commerce, insurance and telecommunications with various use cases such as DNA sequencing, capturing customer insights, real-time offers, high-frequency trading, and real-time intrusion detection that have taken the use of Big Data analytics into account to make critical decisions that impact their business .
On the other hand, the Internet of Things (IoT) is becoming the primary grounds for data mining and Big Data analytics . With the rapid growth of IoT and its use cases in different domains such as Smart City, Mobile e-Health and Smart Grid, streaming applications are driving a new wave of data revolutions. In most IoT applications the resulting analytics give some feedbacks to the system to improve it . Compared to the other Big Data domains, there is a low-latency cycle between system responses which makes it necessary to process events in real-time, to derive acceptable responsiveness. In all of these domains, one of the most fundamental challenge is to explore the large volumes of data and extract useful information for future actions. In particular, this real-time exploration has to be done at massive scales.
Large volumes of data the amount of real-time generated data by different applications in a Smart City can be in order of zetabytes.
Heterogeneous data sources in Smart Cities, the data sources are diverse. For example, there are many sensors data, RFID data, cameras data, human generated data, and so on.
Heterogeneous data types collected data by different devices are different in format, packet size, required precision and arrival time.
It is expected to about 68% of the population to live in cities by 2050 . On the other hand, using technology to improve citizen’s lives makes a city into a smart one. In other words, Smart Cities offer their citizens a high quality of life by improving processes and reducing environmental footprint with technology . These cities capture data using sensors, and use Big Data analytics techniques to improve environmental, financial, and social aspects of urban life. According to  Smart City technology, spending reached $80 billion in 2016, and is predicted to grow to $135 billion by 2021.
An intelligent combination of technological advances such as IoT, Cloud Computing, Smart Grid and Smart Building allows tracking huge amounts of information; this combination creates an intelligent system known as Smart City. A Smart City uses IoT sensors and many other technologies to connect components through a city to derive data and improve the lives of citizens and visitors. For example, in a Smart City, buildings have interconnected electrical, cooling, and heating systems monitored by an intelligent management system and a Smart Grid can be used to control on-demand generation of electricity.
System architecture for Smart City
Device layer: Set of sensors, RFID, cameras, and other devices that continuously capture physical measures such as temperature, humidity, light, vibration, location, movement, etc.
Data collection layer: Continuous streams of unstructured data which are being captured by device layer, will be transferred to the data nodes. In the data collection layer, all of the collected data by the data nodes will be aggregated. Aggregation of these raw data allows offering usable services.
Application service layer: Here many applications such as intelligent traffic management, water and electricity monitoring, disaster discovery, fraud detection and web display analysis are provided. In this layer, people and machines directly interact with each other.
In the third layer of this architecture the main requirements for Big Data processing are needed. So, choosing efficient Big Data tools and frameworks has important role to utilize the processing resources.
The goal of this paper is to examine applicability of employing distributed stream processing frameworks at the data processing layer of Smart Cities and appraising the current state of their adoption and maturity among the IoT applications. Regardless to where the data processing is happening (edge or center of processing layer) our experimental results can be taken into account to determine the best framework depending on the processing requirements of streaming applications. The main contribution of our work is providing a fair comparison among three well-known DSPFs in terms of performance, throughput, and scalability.
In comparison with our previously published conference paper , we have added complementary information to “Programming model of stream processing” section and studied more frameworks in “Distributed stream processing frameworks” section. The discussed frameworks are also explained with more details and some figures are used to better show their architectures. A new section is added to cover the prior works that evaluate stream processing systems from different aspects. Moreover, to show the powerful cooperation of DSPFs and message brokers (which makes the real-time streaming processing of Big Data possible) we have added some information about the integration of DSPFs with message brokers.
The evaluations are extended and we have also taken into account the resource utilization (including CPU and network usage) in our experiments. Another real-world application “Training application” is chosen among several benchmark application and is developed for all intended frameworks. We have also added a new section namely “Results and discussion” to discuss the experimental results in more details. Here, all discussions relay on our observations from two different benchmark applications, and we have tried to cover enough considerations. At the end of this section we have provided a recommendation on what kind of frameworks may be followed for certain IoT workloads.
The rest of this paper is organized as follows. “Related work” section presents the related works in this topic. “Programming model of stream processing” section explains some primary information about the stream processing models. “Distributed stream processing frameworks” section shortly introduces the most popular distributed stream processing frameworks. The next section describes the technological infrastructure and the application benchmarks used in this research process. “Evaluation” section presents the experiment scenarios and reports the obtained results, highlighting the performed benchmarks and the needed resources, in terms of performance metrics and resource usage. The obtained results are discussed in “Results and discussion” section. Finally, “Conclusion and future works” section presents the main conclusions and applicability of DSPFs in Smart Cities.
To realize the mechanism, performance and efficiency of different distributed stream processing frameworks many survey papers are published. They study the characteristics of these Big Data platforms and compare them in different conditions. In some papers [19, 20, 21, 22, 23] authors present a conceptual overview on several stream data processing systems such as Apache Storm, Apache Spark, etc. These works discuss general information of these frameworks such as mechanisms, programming languages, and some characteristics such as fault-tolerance; whereas no experimental evaluation is provided. So, there is no metric to assess the efficiency of such frameworks in case of IoT applications.
Authors in  consider some important aspects of Big Data processing frameworks such as scalability, fault tolerance, and availability and use k-means as the case study of their evaluations. However, they do not consider specific features of IoT applications such as throughput, latency and resource utilization. Paper  presents some information about Big Data analytics besides of some important open issues in this area. Lack of execution of some benchmark applications and hence not to pay attention to important metrics such as throughput and latency are the weaknesses of this work; whereas these parameters are taken into account in our work.
Researchers in  provide a valuable overview on Big Data processing frameworks and their efficiency in Smart City domain. They categorize Big Data processing frameworks according to their programming model, type of data sources, and supported programming languages. They have focused on hardware aspects of the processing cluster while some important considerations of Big Data processing in IoT applications (such as throughput and processing latency) are not covered.
Paper  provides some information about the Big Data processing frameworks namely Apache Hadoop, Apache Spark and Apache Flink. To have quantitative evaluation, it uses some well-known applications such as Terasort, WordCount, Grep, PageRank and K-means and provide comparative information about the performance of these frameworks in terms of latency. The first drawback of this work is the different programming model of evaluated frameworks which dramatically effects the definition of comparison metrics. In addition, in such a comparative works latency and throughput can not be reported separately; while no throughput is reported in the experiments. Moreover, other important metrics of large-scale distributed systems like resource usage efficiency are not discussed. The data sources are not clarified and the adaptation of applications and their input data for both batch and stream processing in different frameworks are not mentioned.
Programming model of stream processing
The most important aspect of a processing system probably is the programming model, because it defines the future limitations, costs and available operations. In stream processing, one of the fundamental parts of a programming model is how each pieces of new data is processed when it arrives; or later as a part of a set of new data. That distinction divides stream processing into two categories: native and micro-batch.
Both methods have their own pros and cons. The great advantage of native streaming is its expressiveness. Because it takes stream as it is, it is not limited by any unnatural abstraction. Also in comparison with micro-batching systems, achievable execution time in native streaming is always much better; because the tuples are processed immediately upon arrival. But on the other hand, native streaming systems have usually lower throughput. Furthermore, fault tolerance and load balancing are more expensive, in native streaming rather than micro-batch systems [20, 28].
In micro-batching systems, splitting data streams into micro-batches reduces system’s costs . Some operations, such as state management, are much harder to implement, because system has to deal with whole batch . Finally, it is good to mentioned that micro-batching systems can be built on a native streaming. In the following section, we will have quick look at some well-known streaming processing systems.
Distributed stream processing frameworks
Apache Storm is the most popular and widely adopted open-source distributed real-time computational platform introduced by Twitter . Similar to what Hadoop does for batch processing, Apache Storm performs on unbounded streams of data in a reliable manner. A Storm cluster consists of two types of nodes: master node and worker node.
Master node runs a daemon called Nimbus, which is the central component of Apache Storm. The responsibility of Nimbus is distributing codes and assigning tasks to the worker nodes. It also monitors health of the cluster by listening to the produced heartbeats by worker nodes and re-assigns failed tasks if needed. Worker nodes do the actual execution of the streaming application. Each worker node runs a daemon called Supervisor, which works by Nimbus and starts and stops the worker processes when required.
Apache Flink is an open-source streaming platform, which provides capability to run real-time data processing pipelines in a fault-tolerant way at a scale of millions of tuples per second . Flink is based on native stream processing rather than processing micro-batches. Flink processes the user-defined functions code through the system stack. It has master-slave architecture, consists of a Job Manager and one or more Task Manager(s) .
Topology Master: it is responsible for managing a topology from its submitting moment until it is killed.
Stream Manager: its duty is managing the routing of tuples between topology components.
Heron Instance: it is a process that executes a single task and allows for easy debugging and profiling.
Akka is an advanced toolkit and message-driven runtime based on the Actor Model that helps development teams to build the right foundation for successful micro-services architectures and streaming data pipelines . In Akka, the communication between services uses messaging primitives that optimized for CPU utilization, low latency, high throughput and scalability.
Akka Streams is a streaming dataflow abstraction on top of Akka Actors, giving developers a better way to define their workflows. As a founding implementation of the Reactive Streams specification, Akka Streams adds benefits of back-pressure and type-safety to the development of streaming applications . Akka Streams and the underlying Akka Actor model are ideal for low-latency processing of data streams.
Integration with message brokers
DSPFs and message brokers naturally complement each other, and their powerful cooperation enables real-time streaming processing of Big Data. In most streaming applications, message brokers such as Apache Kafka, RabbitMQ, ActiveMQ, and Kestrel, prepare data streams instead of connecting to the data source, directly. They employs a pull-based consumption model that allows an application to consume data at its own rate and rewind the consumption whenever needed. They usually provide integrated distributed support and can scale out . Furthermore, the message brokers allow data replication to make it available for several systems and persists data coming from various sources.
In all experiments the reported values for all metrics are measured under the same conditions and a specific input rate. For each run 5 min of warm up execution is considered before the metrics were captured from the Web UI of each framework; because the throughput of the streaming application that is being evaluated should have stabilized and converged. According to our experiments after 5 min all applications have stable state in the target frameworks. All of the benchmarks were performed using default settings for Storm, Flink, and Spark Streaming and we have avoided configuration tweaks to tune them.
Number of cores
Moreover, since Apache Storm supports two message processing guarantee semantics, we have run the benchmark applications under both at least once and at most once (No-Ack) semantics. At most once semantics happens when the acknowledgment mechanism is disabled.
Source: This component reads the tuples from Kafka message broker and prepares them as standard data units according to the data processing model of intended framework.
Deserialize: Divides the input JSON string to some meaningful fields.
Filter: Filters out irrelevant tuples based on their type.
Projection: Remove unnecessary fields.
Join: Joins tuples by a specific field with its associated information of another field.
Count: Take a windowed count of tuples per joined field and store them.
Timer Source: This component simulates the model training trigger.
Table Read: At each run it fetches data from the stored table, available since the last run.
Multi-Var Linear Reg. train: It uses the fetched data to train a linear regression model.
Annotation and Decision Tree: Fetched tuples from the Table-Read operator are also annotated to allow a decision tree classifier to be trained.
Table Write: Stores the trained model files.
MQTT Publish: Publishes all updates to the database records to the MQTT broker.
As explained in “Distributed stream processing frameworks” section there are many distributed stream processing frameworks which can be employed as the data analytics layer of IoT applications in Smart Cities . However, Apache Storm , Apache Spark Streaming , and Apache Flink  are the most popular DSPFs for real-time processing . So, we evaluate these three open source and community driven frameworks in terms of performance and scalability, using two streaming applications.
The most commonly-used quantitative performance measures to evaluate efficiency of DSPFs, are latency and throughput . In our experiments, we consider these metrics to compare performance and scalability of intended DSPFs. Considering each streaming application as a dataflow graph, we use the following definitions for latency and throughput. We also measure CPU and Network utilization, to realize how efficient each DSPF works.
Latency: For a tuple that is consumed by an application graph, latency is the time it took for that tuple to be processed by intermediate vertexes and transferred (including the network and queuing time) between them. This latency is known as end-to-end latency. As the end-to-end latency of each tuple may vary depending on the tuple size and type, resource allocation, and input rate, we consider the average latency in our evaluations.
Throughput: Depending on the operation of each vertex of the application graph, for each incoming tuple, different number of output tuples may be emitted. Normally the aggregated output tuple rate of all vertexes is considered as the overall throughput. But, when we are talking about different frameworks, we face different implementation of the application graph. So, we cannot use the overall throughput as a unique metric. In our evaluation, we consider the total number of received tuples by the sink operator of the application at 1 s as the processing throughput (all implementations of the applications have a sink operator which collects the output tuples).
In our experiments, Apache Kafka  is used to produce desired data streams, so we can see the behavior of the target DSPF in the real world conditions. There are some classes and API to integrate Kafka with DSPFs. The classes are used to track the Kafka brokers dynamically by maintaining the details in ZooKeeper, or statically set the Kafka brokers and its details. Provided API can be used to define configuration settings, fetching the messages from kafka topic and emitting them into DSPF as tuples.
As the data partitioning of message brokers mostly is not chosen wisely, may the raw data is partitioned in a different way than the streaming system requires it. To handle this issue some technique are used, but trades some overhead and performance lost. So, when we are benchmarking a distributed stream processing system, the data exchange between the message broker and the streaming system may become the performance bottleneck. We ran the Kafka in a separated machine and took enough instances from the spout to get sure that the message broker has no effect in our experiment results and never becomes bottleneck.
According to the mentioned above definition for throughput, we consider the total number of received tuples by the sink operator at 1 s as the processing throughput, to have a fair comparison among different frameworks in terms of throughput.
Considering these results, we see on smaller cluster Flink is the winner in terms of throughput and its throughput values have the least variance over the time. By scaling the cluster out via adding more worker nodes, the throughput of Storm and Spark Streaming increase with an almost linear ratio; however, throughput of Flink has little increment.
To appraise the ability for each framework to efficiently expand and contract its resource pool to accommodate heavier or lighter loads, we increase the input rate and measure the corresponding latency. For Advertising application the input rate is varied from 20,000 to 560,000 tuples/s, and for Model Training it is varied from 200 to 2000 tuples/s. For each input rate, benchmark applications are executed for 100 min and the end-to-end latencies are measured.
From the results in Fig. 15, we also observe that the average latency of Flink has the least dependency to the load because of its internal message handling mechanism and works well while the network is not over-utilized; and Storm shows a bit more reaction to load variances; while Spark Streaming performs worst among them due to the mentioned above reason.
Looking at these graphs we can see Flink has the worst scalability and its latency and throughput have a few improvements when the number of worker nodes is increased. In large-scale clusters, Storm and Spark Streaming beat Flink in terms of latency and throughput. Both Storm and Spark Streaming have near linear behavior in terms of scalability but Storm scales even better than Spark Streaming and behaves almost linear, specially when no acking mechanism is applied.
Results and discussion
Load scalability: Throughout the results from “Load scalability” section, we observe that (1) Storm and Flink consistently outperforms Spark Streaming in terms of latency. (2) Increasing the input rate leads to more significant improvement of them. In Spark Streaming, multiple tuples are processed in a micro-batch, which leads to higher throughput but it cause deterioration of the end-to-end latency for individual tuples since they have to wait for a little bit before being batched by the Spark Streaming component.
Resource utilization: By placing resource utilization of all frameworks under scrutiny, we realized that Flink is more network intensive than other frameworks while its CPU usage is less than both Storm and Spark Streaming. As we can see in Fig. 15 Flink has stable performance while the input rate is increasing, but at a certain rate the network gets bottleneck and its latency becomes worse than both Storm and Storm no-Ack.
Ability to scale out: According to our observations when the cluster has a few number of worker nodes Flink provides the lowest latency and hence the best performance in both applications. The reasoning behind this excellence is the message passing mechanism of Flink which trades more network usage at a benefit of better CPU utilization. For small-scale cluster the extra CPU power that is saved by efficient message passing is consumed by the processing components and hence, the latency is decreased. By increasing the number of worker nodes Storm latency is reduced almost linearly, but Flink latency decrement is lighter and does not scale as expected; insofar as, for a cluster of 8 worker nodes Storm and Flink result almost equal performance. In the larger clusters we have more inter-node communications and the network utilization becomes more valuable parameters. Although, the massage passing of Flink benefits from better CPU utilization but it worsens the network utilization which increases the processing latency. Spark Streaming, always has much higher latency, but its latency scales better than Flink by increasing size of the cluster, however not as well as Storm.
From these results, we conclude that while the network resources do not get bottleneck Flink provides more stable response time and its 99th percentile latency value says it is better solution for real-time applications. On the other hand, Spark Streaming latency strongly depends on the input rate. It would not be good choice for application with variable data load like network monitoring, but a good selection to achieve high-throughput when latency is not as important as throughput. With acking disabled, Storm has better performance and provides more reliable response time at high throughput. However, in this conditions the ability to handle failures is disabled.
Framework recommendation for different streaming application categories
Extract, transform and load
Storm, Spark Streaming
Stream machine learning
Complex event processing
Image and video processing
As our experiments show there are some trade-off to choose the proper framework for each category of applications. For example, an ETL application may has the least latency on Storm but Spark Streaming provides more processing throughput. In another case may Flink process a single tuple much faster than Storm but when the data arrival rate or cluster size are changed, Storm passes Flink.
So, regardless to the category of applications there are several parameters such as data arrival rate, cluster size, tuple size and data type  which severely effect different performance metrics. Further, giving a recommendation for a framework per each category can vary depending on the desired metric. Nevertheless, we made a recommendation for a specific framework per each category in Table 2 based on the main characteristics of the applications in each category. In this table, for each category we have specified which framework provides the best results in general terms.
Overall, each of the frameworks studied here have their advantages and disadvantages. Our experiments show Storm and Flink have very similar performance, and Spark Streaming, has much higher latency, while it provides higher throughput. However, Flink behaves very well at small-scale clusters but it has poor scalability and loses the competition on the large-scale clusters.
Conclusion and future works
Collections of large amount of IoT devices and objects are producing huge amount of data in Smart Cities which requires being processed immediately. Big Data analytics tools have the capacity to handle large volumes of data generated from IoT devices that create a continuous stream of information. There are plenty of Big Data processing platforms which are designed for special purposes. At the age of IoT and Smart Cities it is interesting to compare the behavior of available distributed stream processing frameworks and examine the applicability of employing them to process high volume of data generated in Smart Cities.
In this paper we made a deep comparison between three most popular frameworks, namely Apache Storm, Apache Flink, and Spark Streaming to show that which platform is suitable for what kind of streaming application. In our experiments we focused on evaluating the performance of intended frameworks in terms of latency, and throughput. We also evaluated the scalability (in terms of data arrival rate and the number of cluster nodes), and resource utilization of these frameworks using two benchmark applications from real world.
In addition to further consideration for resource utilization, we like to take into account the processing guarantees and fault tolerance. We also like to include other stream processing frameworks like Apache Heron and Apache Samza.
The authors thanks reviewers for valuable feedback on an early draft of this manuscript.
HN performed the primary work and analysis of this manuscript and designed the experiments. HN and SN performed experiments and analyzed the observed results and co-wrote the paper. MG supervised the research. All authors read and approved the final manuscript.
This work is partially supported by the Iran National Science Foundation (INSF), under Grant Number 96015834.
The authors declare that they have no competing interests.
- 1.Agarwal S. 2016 state of fast data and streaming applications survey. https://www.opsclarity.com/2016-state-fast-data-streaming-applications-survey/. Accessed 12 Oct 2017.
- 8.Toshniwal A, Taneja S, Shukla A, Ramasamy K, Patel JM, Kulkarni S, Jackson J, Gade K, Fu M, Donham J, et al. Storm@twitter. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data. New York: ACM; 2014. p. 147–56.Google Scholar
- 9.Zaharia M, Das T, Li H, Shenker S, Stoica I. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. HotCloud. 2012;12:10.Google Scholar
- 10.Katsifodimos A, Schelter S. Apache flink: stream analytics at scale. In: 2016 IEEE international conference on cloud engineering workshop (IC2EW). New York: IEEE; 2016. p. 193.Google Scholar
- 11.Wilmoth J. 2018 revision of the world urbanization prospects. https://population.un.org/wup/Publications/Files/WUP2018-PressRelease.pdf. Accessed 02 Mar 2019.
- 13.Shirer M, Rold SD. Worldwide semiannual smart cities spending guide. https://www.idc.com/getdoc.jsp?containerId=prUS43576718. Accessed 11 Feb 2018.
- 14.Apache hadoop. https://hadoop.apache.org/. Accessed 02 June 2018.
- 15.Apache spark: Lightning-fast unified analytics engine. https://spark.apache.org/. Accessed 02 June 2018.
- 16.Apache storm. http://storm.apache.org/. Accessed 02 June 2018.
- 17.Apache flink: Stateful computations over data streams. https://flink.apache.org. Accessed 02 June 2018.
- 18.Nasiri H, Nasehi S, Goudarzi M. A survey of distributed stream processing systems for smart city data analytics. In: Proceedings of the international conference on smart cities and internet of things. New York: ACM; 2018. p. 12.Google Scholar
- 19.Hesse G, Lorenz M. Conceptual survey on data stream processing systems. In: 2015 IEEE 21st international conference on parallel and distributed systems (ICPADS). New York: IEEE; 2015. p. 797–802.Google Scholar
- 20.Singh MP, Hoque MA, Tarkoma S. A survey of systems for massive stream analytics; 2016. arXiv preprint arXiv:1605.09021.
- 21.Kamburugamuve S, Fox G, Leake D, Qiu J. Survey of distributed stream processing for large stream sources. 2013. https://scholar.google.com/scholar?hl=en%26as_sdt=0%2C5%26q=Survey+of+distributed+stream+processing+for+large+stream+sources%26btnG=.
- 22.Kamburugamuve S, Fox G. Survey of distributed stream processing. Bloomington: Indiana University; 2016.Google Scholar
- 27.Veiga J, Expósito RR, Pardo XC, Taboada GL, Tourifio J. Performance evaluation of big data frameworks for large-scale data analytics. In: 2016 IEEE international conference on Big Data (Big Data). New York: IEEE; 2016. p. 424–31.Google Scholar
- 28.Hirzel M, Soulé R, Schneider S, Gedik B, Grimm R. A catalog of stream processing optimizations. ACM Comput Surv CSUR. 2014;46(4):46–50.Google Scholar
- 30.Oliver AC. Storm or spark: choose your real-time weapon. http://www.infoworld.com/article/2854894/application-development/spark-and-storm-for-real-time-computation.html. Accessed 01 Feb 2018.
- 31.Hunt P, Konar M, Junqueira FP, Reed, B. Zookeeper: wait-free coordination for internet-scale systems. In: USENIX annual technical conference, vol. 8, Boston, MA, USA; 2010.Google Scholar
- 32.Introduction to heron. https://streaml.io/blog/intro-to-heron. Accessed 10 Apr 2018.
- 33.Kulkarni S, Bhagat N, Fu M, Kedigehalli V, Kellogg C, Mittal S, Patel JM, Ramasamy K, Taneja S. Twitter heron: stream processing at scale. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data. New York: ACM; 2015. p. 239–50.Google Scholar
- 34.Apache kafka: a distributed streaming paltform. http://kafka.apache.org/. Accessed 02 June 2018.
- 35.Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S et al. Apache hadoop yarn: Yet another resource negotiator. In: Proceedings of the 4th annual symposium on cloud computing. New York: ACM; 2013. p. 5.Google Scholar
- 36.Apache samza: A distributed stream processing framework. http://samza.apache.org/. Accessed 11 Aug 2018.
- 37.Gorawski M, Gorawska A, Pasterak K. A survey of data stream processing tools. In: Czachórski T, Gelenbe E, Lent R, editors. Information sciences and systems 2014. Cham: Springer; 2014. p. 295–303.Google Scholar
- 39.Zapletal P. Comparison of apache stream processing frameworks. Cakesolutions. http://www.cakesolutions.net/teamblogs/comparison-of-apache-streamprocessing-frameworks-part-1. Accessed 12 Feb 2018.
- 40.Kreps J, Narkhede N, Rao J et al. Kafka: A distributed messaging system for log processing. In: Proceedings of the NetDB; 2011. p. 1–7.Google Scholar
- 41.Yehuda G. Yahoo streaming benchmarks. https://github.com/yahoo/streaming-benchmarks. Accessed 08 Oct 2017.
- 44.Brian D, Dan W. New york city taxi trip data. https://databank.illinois.edu/datasets/IDB-9610843. Accessed 12 Apr 2018.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.