Comparative Analysis of Time Series Databases in the Context of Edge Computing for Low Power Sensor Networks
- 153 Downloads
Selection of an appropriate database system for edge IoT devices is one of the essential elements that determine efficient edge-based data analysis in low power wireless sensor networks. This paper presents a comparative analysis of time series databases in the context of edge computing for IoT and Smart Systems. The research focuses on the performance comparison between three time-series databases: TimescaleDB, InfluxDB, Riak TS, as well as two relational databases, PostgreSQL and SQLite. All selected solutions were tested while being deployed on a single-board computer, Raspberry Pi. For each of them, the database schema was designed, based on a data model representing sensor readings and their corresponding timestamps. For performance testing, we developed a small application that was able to simulate insertion and querying operations. The results of the experiments showed that for presented scenarios of reading data, PostgreSQL and InfluxDB emerged as the most performing solutions. For tested insertion scenarios, PostgreSQL turned out to be the fastest. Carried out experiments also proved that low-cost, single-board computers such as Raspberry Pi can be used as small-scale data aggregation nodes on edge device in low power wireless sensor networks, that often serve as a base for IoT-based smart systems.
KeywordsTime series PostgreSQL TimescaleDB InfluxDB Edge computing Edge analytics Raspberry Pi Riak TS SQLite
In the recent years we have been observing IoT systems being applied for multiple use cases such as water monitoring , air quality monitoring , and health monitoring , generating a massive amount of data that is being sent to the cloud for storing and further processing. This is becoming a more significant challenge due to the need for sending the data over the Internet. Due to that, a new computing paradigm called edge computing started to emerge . The main idea behind edge computing is to move data processing from the cloud to the devices that are closer to the source of data in order to reduce the volume of data that needs to be send to the cloud, improve reaction time to the changing state of the system, provide resilience and prevent data loss in situations where Internet connection is not reliable or even not available most of the time. To achieve that, edge computing devices need to be able to ingest data from sensors, analyze them, aggregate metrics, and send them to the cloud for further processing if required. For example, while collecting and processing environmental data on air quality, the edge device can be responsible for aggregating data and computing Air Quality Index (AQI) , instead of sending raw sensor readings to the environmental monitoring center. In systems with multiple sensors generating data at a fast rate, efficient storage and analytical system running on edge device becomes a crucial part. Due to the time-series nature of sensor data, dedicated time series databases seem like a natural fit for this type of workload. This paper aims to evaluate several time series databases in the context of using them in edge computing, low-cost, constrained device in form of Raspberry Pi that is processing data from environmental sensors. The paper is organized as follows. In Sect. 2, we review the related works. In Sect. 3, we describe databases selected for comparison. Section 4 describes testing environment, used data model as well as testing methodology. Section 5 contains a description of the performance experiments that we carried out. Finally, Sect. 6 concludes the results of the paper.
2 Related Works
In the literature, there is a few research concerning the comparison of various time-series databases. In the paper , Tulasi Priyanka Sanaboyina compared two time-series databases, InfluxDB and OpenTSDB, based on the energy consumption of the physical servers on which the databases are running under several reading and writing scenarios. The author concludes the research with claims that InfluxDB consumes less energy than OpenTSDB in comparable situations.
Bader et al.  focused on open source time-series databases, examined 83 different solutions during their research, and focused on the comparison of twelve selected databases, including InfluxDB, PostgreSQL and OpenTSDB among others. All selected solutions were compared based on their scalability, supported functions, granularity, available interfaces, and extensions as well as licensing and support.
In his research , Goldschmidt et al. benchmarked three open-source time-series databases, OpenTSDB, KariosDB and Databus in the cloud environment with up to 36 nodes in the context of industrial workloads. The main objective of the research was to evaluate selected databases to determine their scalability and reliability features. Out of the three technologies, KairosDB emerged as the one that meets the initial hypotheses about scalability and reliability.
Wlodarczyk, in his article , provides an overview and comparison of four offerings, Chukwa, OpenTSDB, TempoDB, and Squwk. The analysis focused on feature differences between selected technologies, without any performance benchmarks. The author identified OpenTSDB as a most popular choice for the time series storage.
Pungilă et al.  compared the databases to use them in the system that stores large volumes of sensor data from smart meters. During the research, they compared three relational databases, SQLite3, MySQL, PostgreSQL, one time-series database, IBM Informix with DataBlade module, as well as three NoSQL databases, MonetDB, Hypertable and Oracle BerkeleyDB. During the experiments, it was determined that Hypertable offers the most significant number of insert operations per second, but is slower when it comes to scanning operations. The authors suggested that BerkeleyDB offers a compromise when there is a need for a workload that has a balanced number of both insert and scan operations.
Fadhel et al. presented research  concerning the evaluation of the databases for a low-cost water quality sensing system. Authors identified InfluxDB as the most suitable solution, listing the ease of installation and maintenance, support for multiple interface formats, and HTTP GUI as the deciding factors. In the second part of the research, they conducted performance experiments and determined that InfluxDB can handle the load from 450 sensors.
In his article , Kiefer provided a performance comparison between PostgreSQL and TimescaleDB for storage and analytics of large scale, time-series data. The author presented that at the scale of millions of rows, TimescaleDB offers up to 20\(\times \) higher ingest rates than PostgreSQL, at the same time offering time-based queries to be even 14,000\(\times \) faster. The author also mentions that for simple queries, e.g., indexed lookups, TimescaleDB will be slower than PostgreSQL due to more considerable planning time.
Boule, in his work , described a performance comparison for insert and read operations between InfluxDB and TimescaleDB. It is based on a simulated dataset of metrics for a fleet of trucks. According to results obtained during the experiments, TimescaleDB offers a better read performance than InfluxDB in tested scenarios.
Based on the above, it can be concluded that most of the current research focuses on the use of time-series databases for large-scale systems, running in cloud environments. One exception to that is the research , where authors evaluate several databases in the context of a low-cost system; however, presenting performance tests only for one of them, InfluxDB. In contrast to the mentioned works, this paper focuses on the comparison of the performance of several database systems for storing sensor data at the edge devices that have limited storage and compute capabilities.
3 Time-Series Databases
TimescaleDB is an open-source, time-series database, written in C programming language and is distributed as an extension of the relational database, PostgreSQL. It is developed by Timescale Inc., which also offers enterprise support and cloud hosting in the form of Timescale Cloud offering. TimescaleDB is optimized for fast ingest and complex queries . Thanks to the support for all SQL operations available in PostgreSQL, it can be used as a drop-in replacement of a traditional relational database, while also offering significant performance improvements for storing and processing time-series data. By taking advantage of automatic space-time partitioning, it enables horizontal scaling, which in turn can further improve the ingestion capabilities of the system. It stores data in structures called hypertables, which serve as an abstraction for a single, continuous table. Internally, TimescaleDB splits hypertables into chunks that correspond to a specific time interval and partition keys. Chunks are implemented by using regular PostgreSQL tables . Thanks to being an extension of PostgreSQL DBMS, it supports the same client libraries that support PostgreSQL. According to the DB Engines ranking , it is the 8th most popular time-series database.
InfluxDB is an open-source, time-series database, written in Go programming language, developed and maintained by InfluxDB Inc., which also offers enterprise support and a cloud-hosted version of the database. Internally, it uses a custom-build storage engine called Time-Structured Merge (TSM) Tree, which is optimized for time series data. It has no external dependencies, is distributed as a single binary, which in turn allows for easy deployment process on all major operating systems and platforms. InfluxDB supports InfluxQL, which is a custom, SQL-like query language with support for aggregation functions over time series data. It supports advanced data retention policies as well as continuous queries, which allow for automatic computations of aggregate data to speed up frequently used queries . It uses shards to partition data and organizes them into shards groups, based on the retention policy and timestamps. InfluxDB is also a part of TICK stack , which is a data processing platform that consists of a time-series database in form of InfluxDB, Kapacitor, which is a real-time streaming data processing engine, Telegraf, the data collection agent and Chronograf, a graphical user interface to the platform. Client libraries in the programming languages like Go, Python, Java, Ruby, and others are available, as well as command-line client “influx”. According to DB Engines ranking , it is the most popular time-series database management system.
3.3 Riak TS
Riak TS is an open-source, distributed NoSQL database, optimized for the time series data and built on top of Riak KV database , created and maintained by Basho Technologies. Riak TS is written in Erlang programming language, supports masterless, multi-node architecture to ensure resiliency to network and hardware failures. This type of architecture also allows for efficient scalability with near-linear performance increase . It supports a SQL-like query language with aggregation operations over time series data. It offers both HTTP and PBC APIs as well as dedicated client libraries in Java, Python, Ruby, Erlang, and Node.js. Besides, it has a native Apache Spark  connector for the in-memory analytics. According to DB Engines ranking , it is the 15th most popular time-series database.
PostgreSQL is an open-source relational database management system written in C language and currently maintained by PostgreSQL Global Development Group. PostgreSQL runs on all major operating systems, is ACID  compliant and supports various extensions, namely TimescaleDB. It supports a major part of the SQL standard and offers many features, including but not limited to, triggers, views, transactions, streaming replication. It uses multi-version concurrency control, MVCC . In addition to being a relational database, it also offers support for storing and querying document data thanks to JSON, JSONB, XML, and Key-value data types . There are client libraries available in programming languages like Python, C, C++, Java, Go, Erlang, Rust, and others. According to DB Engines ranking , it is the 4th most popular database overall. It does not offer any dedicated support and optimizations for time-series data.
SQLite is an open-source relational database, written in C language. The SQLite source code is currently available in the public domain. It is a lightweight, single file, and unlike most databases, it is implemented only as a library and does not require a separate server process. SQLite provides all functionalities directly by the function calls. Its simplicity makes it one of the most widely used databases, especially popular in embedded systems. SQLite has a full-featured SQL standard implementation with support for functionalities such as triggers, views, indexes, and many more . Similar to PostgreSQL, it does not offer any specific support for time series data. Besides, it does not provide a data type for storing time, and it requires users to save it as numerical timestamps or strings. According to DB Engines ranking , it is the 7th most popular relational database and 10th most popular database overall.
4 Testing Environment and Data Model
CPU - Broadcom BCM2711, Quad core Cortex-A72 (ARM v8) 64-bit SoC @ 1.5 GHz
Memory - 4 GB LPDDR4-3200 SDRAM
Storage - SDHC card (16 GB, class 10)
OS - Raspbian GNU/Linux 10 (buster) with kernel version 4.19.50-v7l+
Data model used for the performance experiments
4.1 Data Model
Each data point sent by the sensor consists of air quality metrics in the form of NO2 and dust particle size metrics – PM2.5 and PM10. Besides, it also carries information about weather conditions such as ambient temperature, pressure, and humidity. Each reading is timestamped and tagged with the location of the sensor and the unique sensor identifier. Table 1 shows the structure of a single data point with corresponding data types. For the experiments, we generated data from 10 simulated sensors, where each sensor sends reading every 15 s over 24 h. It resulted in 28,800 data points used for performance testing.
4.2 Testing Methodology
Database and client library versions
Client library version
5 Performance Experiments
To evaluate the insertion and querying performance, we conducted several experiments. Firstly, we ran the test to assess the writing capabilities of all selected databases by simulating the insertion of data points in two ways: one-by-one and in batches of 10 points. The reason for that was to accommodate the fact that databases can offer better performance for batch insertions, and it is possible to buffer data before saving it to the database. In this step, for each database, we ran the simulation 50 times (except for SQLite where simulations were run 20 times due to relatively long simulation time). Secondly, we ran the experiments to evaluate the query performance of all selected solutions in three scenarios. In the first scenario, we evaluated a query for average temperature in the chosen period, grouped by location. In the second query, we tested a query for minimum and maximum values of NO2, PM2.5, and PM10 in the selected period, once again grouped by location. In the last, third scenario, we evaluated the performance of a query that counts data points grouped by sensor ID in the selected period for which NO2 was larger than selected value and location was equal to a specific one. Each query was executed 5000 times. The query scenarios were selected in order to test the performance of the databases for most common aggregation queries that can be used in scenarios where the analysis has to be performed directly on the edge device or when the data needs to be aggregated before sending to the cloud in order to reduce the volume of transferred data.
Next, a comparison was made for the results obtained during the evaluation of second query computing minimum and maximum aggregations of air quality metrics. The recorded results and queries are shown in Fig. 5. In this example, PostgreSQL turned out to be the fastest solution with average query execution time of 48 ms, next was InfluxDB with 70 ms and TimescaleDB with 72 ms. Tested query took the longest time to execute on SQLite, taking on average 81 ms. We can observe a general trend of increased query execution time with more aggregations performed in comparison to the first testing scenario.
The last experiment was performed for the third tested query, evaluating the number of times the NO2 was higher than the predefined threshold. Figure 6 presents the query used and the results obtained during that simulation. Once again, PostgreSQL was the fastest solution with an average query execution time of 15 ms, followed by InfluxDB with 29 ms. The two slowest databases were TimescaleDB and SQLite, with 39 and 40 ms per execution on average.
5.3 Results Summary
6 Concluding Remarks
The selection of a proper storage system with declarative querying capabilities is an essential element of building efficient systems with edge-based analytics. This research aimed to compare the performance of several databases in the context of edge computing in wireless sensor networks for IoT-based smart systems. We believe that experiments and analysis of the results presented in the paper complement the performance evaluation of InfluxDB presented in  by showcasing performance results for multiple databases and can serve as a reference when selecting an appropriate database for low-cost, edge analytics applications. As it turned out, for a smaller scale, it might make sense to choose a more traditional, relational database like PostgreSQL, which offers the best performance in all but one tested case. However, when features such as data retention policies, time bucketing, automatic aggregations are crucial for the developed solution, dedicated time-series databases such as TimescaleDB and InfluxDB become a better choice.
The research was supported by the Polish Ministry of Science and Higher Education as a part of the CyPhiS program at the Silesian University of Technology, Gliwice, Poland (Contract No. POWR.03.02.00-00-I007/17-00), by Statuatory Research funds of the Silesian University of Technology, Gliwice, Poland (Grant BKM-576/RAU2/2019 ZAD.1), and partially, by the professorship grant (02/020/RGPL9 /0184) of the Rector of the Silesian University of Technology, Gliwice, Poland.
- 1.Apache Spark. https://spark.apache.org/. Accessed 9 Jan 2020
- 2.DBMS popularity broken down by database model. https://db-engines.com/en/ranking_categories. Accessed 2 Feb 2020
- 3.InfluxDB on DB-engines ranking. https://db-engines.com/en/system/InfluxDB. Accessed 1 Feb 2020
- 4.InfluxDB overview. https://www.influxdata.com/products/influxdb-overview/. Accessed 2 Feb 2020
- 5.InfluxDB overview. https://www.influxdata.com/products/influxdb-overview/. Accessed 9 Jan 2020
- 6.PostgreSQL documentation. https://www.postgresql.org/about/. Accessed 9 Jan 2020
- 7.PostgreSQL on DB-engines ranking. https://db-engines.com/en/system/PostgreSQL. Accessed 1 Feb 2020
- 8.Raspberry Pi 4 datasheet. https://www.raspberrypi.org/documentation/hardware/raspberrypi/bcm2711/rpi_DATA_2711_1p0_preliminary.pdf. Accessed 4 Feb 2020
- 9.Riak KV documentation. https://riak.com/products/riak-kv/index.html. Accessed 9 Jan 2020
- 10.Riak TS datasheet. https://riak.com/content/uploads/2016/05/Riak-Riak-TS-Datasheet.pdf. Accessed 9 Jan 2020
- 11.Riak TS on DB-engines ranking. https://db-engines.com/en/system/Riak+TS. Accessed 1 Feb 2020
- 12.SQLite documentation. https://www.sqlite.org/about.html. Accessed 9 Jan 2020
- 13.SQLite on DB-engines ranking. https://db-engines.com/en/system/SQLite. Accessed 1 Feb 2020
- 14.TimescaleDB documentation. https://docs.timescale.com/latest/introduction. Accessed 9 Jan 2020
- 15.TimescaleDB on DB-engines ranking. https://db-engines.com/en/system/TimescaleDB. Accessed 1 Feb 2020
- 16.TimescaleDB: SQL made scalable for time-series data (2017). https://pdfs.semanticscholar.org/049a/af11fa98525b663da18f39d5dcc5d345eb9a.pdf
- 17.Bader, A., Kopp, O., Falkenthal, M.: Survey and comparison of open source time series databases. In: Mitschang, B., et al. (eds.) Datenbanksysteme für Business, Technologie und Web (BTW 2017) - Workshopband, pp. 249–268. Gesellschaft für Informatik e.V., Bonn (2017)Google Scholar
- 19.Boule, B.: How to benchmark IoT time-series workloads in a production environment. https://blog.timescale.com/blog/how-to-benchmark-iot-time-series-workloads-in-a-production-environment/. Accessed 9 Jan 2020
- 21.Goldschmidt, T., Jansen, A., Koziolek, H., Doppelhamer, J., Breivold, H.P.: Scalability and robustness of time-series databases for cloud-native monitoring of industrial processes. In: 2014 IEEE 7th International Conference on Cloud Computing, pp. 602–609, June 2014Google Scholar
- 23.Kiefer, R.: TimescaleDB vs. PostgreSQL for time-series: 20x higher inserts, 2000x faster deletes, 1.2x-14,000x faster queries. https://blog.timescale.com/blog/timescaledb-vs-6a696248104e/. Accessed 9 Jan 2020
- 27.Sanaboyina, T.P.: Performance evaluation of time series databases based on energy consumption. Master’s thesis, Department of Communication Systems (2016)Google Scholar
- 28.Singh, S.: Optimize cloud computations using edge computing. In: 2017 International Conference on Big Data, IoT and Data Science (BID), pp. 49–53, December 2017Google Scholar
- 29.Wlodarczyk, T.W.: Overview of time series storage and processing in a cloud environment. In: 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings, pp. 625–628, December 2012Google Scholar
- 30.Yu, S.: ACID properties in distributed databases. Advanced eBusiness Transactions for B2B-Collaborations (2009)Google Scholar