Advertisement

The VLDB Journal

, Volume 28, Issue 6, pp 847–869 | Cite as

Coconut: sortable summarizations for scalable indexes over static and streaming data series

  • Haridimos KondylakisEmail author
  • Niv Dayan
  • Kostas Zoumpatianos
  • Themis Palpanas
Regular Paper
  • 67 Downloads

Abstract

Many modern applications produce massive streams of data series that need to be analyzed, requiring efficient similarity search operations. However, the state-of-the-art data series indexes that are used for this purpose do not scale well for massive datasets in terms of performance, or storage costs. We pinpoint the problem to the fact that existing summarizations of data series used for indexing cannot be sorted while keeping similar data series close to each other in the sorted order. To address this problem, we present Coconut, the first data series index based on sortable summarizations and the first efficient solution for indexing and querying streaming series. The first innovation in Coconut is an inverted, sortable data series summarization that organizes data series based on a z-order curve, keeping similar series close to each other in the sorted order. As a result, Coconut is able to use bulk loading and updating techniques that rely on sorting to quickly build and maintain a contiguous index using large sequential disk I/Os. We then explore prefix-based and median-based splitting policies for bottom-up bulk loading, showing that median-based splitting outperforms the state of the art, ensuring that all nodes are densely populated. Finally, we explore the impact of sortable summarizations on variable-sized window queries, showing that they can be supported in the presence of updates through efficient merging of temporal partitions. Overall, we show analytically and empirically that Coconut dominates the state-of-the-art data series indexes in terms of construction speed, query speed, and storage costs.

Keywords

Data series Indexing structures Streaming data series 

Notes

References

  1. 1.
    Adhd-200. http://fcon\_1000.projects.nitrc.org/indi/adhd200/ (2017)Google Scholar
  2. 2.
    Aggarwal, A., Vitter, J.S.: The input/output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (1988)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: FODO, pp. 69–84 (1993)Google Scholar
  4. 4.
    Alsubaiee, S., Carey, M.J., Li, C.: Lsm-based storage and indexing: an old idea with timely benefits. In: Second International ACM Workshop on Managing and Mining Enriched Geo-Spatial Data, GeoRich@SIGMOD 2015, Melbourne, VIC, Australia, May 31, 2015, pp. 1–6 (2015).  https://doi.org/10.1145/2786006.2786007
  5. 5.
    Assent, I., Krieger, R., Afschari, F., Seidl, T.: The ts-tree: efficient time series search and retrieval. In: EDBT, pp. 252–263 (2008).  https://doi.org/10.1145/1353343.1353376
  6. 6.
    Bayer, R., Markl, V.: The ub-tree: performance of multidimensional range queries. Techinal Report. Institut fur Informatik, TU, Munchen (1998) Google Scholar
  7. 7.
    Camerra, A., Palpanas, T., Shieh, J., Keogh, E.J.: isax 2.0: indexing and mining one billion time series. In: ICDM, pp. 58–67 (2010).  https://doi.org/10.1109/ICDM.2010.124
  8. 8.
    Camerra, A., Shieh, J., Palpanas, T., Rakthanmanon, T., Keogh, E.: Beyond one billion time series: indexing and mining very large time series collections with iSAX2+. KAIS 39(1), 123–151 (2014)Google Scholar
  9. 9.
    Chakrabarti, K., Keogh, E.J., Mehrotra, S., Pazzani, M.J.: Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans. Database Syst. 27(2), 188–228 (2002).  https://doi.org/10.1145/568518.568520 CrossRefGoogle Scholar
  10. 10.
    Chan, K.P., Fu, A.W.: Efficient time series matching by wavelets. In: ICDE, pp. 126–133 (1999).  https://doi.org/10.1109/ICDE.1999.754915
  11. 11.
    Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. (CSUR) 41(3), 15 (2009)CrossRefGoogle Scholar
  12. 12.
    Dayan, N., Athanassoulis, M., Idreos, S.: Monkey: optimal navigable key-value store. In: SIGMOD, pp. 79–94 (2017).  https://doi.org/10.1145/3035918.3064054
  13. 13.
    Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: The lernaean hydra of data series similarity search: an experimental evaluation of the state of the art. In: PVLDB (2019)Google Scholar
  14. 14.
    Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. In: SIGMOD, pp. 419–429 (1994).  https://doi.org/10.1145/191839.191925
  15. 15.
    Goldin, D.Q., Kanellakis, P.C.: On similarity queries for time-series data: constraint specification and implementation. In: Proceedings of First International Conference on Principles and Practice of Constraint Programming—CP’95, Cassis, France, September 19–22, 1995, pp. 137–153 (1995)CrossRefGoogle Scholar
  16. 16.
    Grabocka, J., Schilling, N., Schmidt-Thieme, L.: Latent time-series motifs. TKDD 11(1), 6:1–6:20 (2016)CrossRefGoogle Scholar
  17. 17.
    Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: SIGMOD, pp. 47–57 (1984).  https://doi.org/10.1145/602259.602266
  18. 18.
    Huijse, P., Estévez, P.A., Protopapas, P., Principe, J.C., Zegers, P.: Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE CIM 9(3), 27–39 (2014)Google Scholar
  19. 19.
    Idreos, S., Kersten, M.L., Manegold, S.: Database cracking. In: CIDR 2007, pp. 68–78 (2007). http://cidrdb.org/cidr2007/papers/cidr07p07.pdf
  20. 20.
    Incorporated Research Institutions for Seismology—Seismic Data Access. http://ds.iris.edu/data/access/ (2016)
  21. 21.
    Kashino, K., Smith, G., Murase, H.: Time-series active search for quick retrieval of audio and video. In: ICASSP, pp. 2993–2996 (1999).  https://doi.org/10.1109/ICASSP.1999.757470
  22. 22.
    Kashyap, S., Karras, P.: Scalable knn search on vertically stored time series. In: SIGKDD, pp. 1334–1342 (2011).  https://doi.org/10.1145/2020408.2020607
  23. 23.
    Kate, R.J.: Using dynamic time warping distances as features for improved time series classification. Data Min. Knowl. Discov. 30(2), 283–312 (2016).  https://doi.org/10.1007/s10618-015-0418-x MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344. Wiley, Hoboken (2009)zbMATHGoogle Scholar
  25. 25.
    Keogh, E.J.: Fast similarity search in the presence of longitudinal scaling in time series databases. In: ICTAI, pp. 578–584 (1997).  https://doi.org/10.1109/TAI.1997.632306
  26. 26.
    Keogh, E.J., Chakrabarti, K., Pazzani, M.J., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. KAIS 3(3), 263–286 (2001)zbMATHGoogle Scholar
  27. 27.
    Keogh, E.J., Pazzani, M.J.: An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: KDD, pp. 239–243 (1998). http://www.aaai.org/Library/KDD/1998/kdd98-041.php
  28. 28.
    Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut: a scalable bottom-up approach for building data series indexes. PVLDB 11(6), 677–690 (2018).  https://doi.org/10.14778/3184470.3184472 CrossRefGoogle Scholar
  29. 29.
    Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut palm: static and streaming data series exploration now in your palm. In: Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30–July 5, 2019, pp. 1941–1944 (2019).  https://doi.org/10.1145/3299869.3320233
  30. 30.
    Korn, F., Jagadish, H.V., Faloutsos, C.: Efficiently supporting ad hoc queries in large datasets of time sequences. In: SIGMOD, pp. 289–300 (1997).  https://doi.org/10.1145/253260.253332
  31. 31.
    Leutenegger, S.T., Edgington, J.M., López, M.A.: STR: a simple and efficient algorithm for r-tree packing. In: ICDE, pp. 497–506 (1997).  https://doi.org/10.1109/ICDE.1997.582015
  32. 32.
    Li, C., Yu, P.S., Castelli, V.: Hierarchyscan: a hierarchical similarity search algorithm for databases of long sequences. In: ICDE, pp. 546–553 (1996).  https://doi.org/10.1109/ICDE.1996.492205
  33. 33.
    Liao, T.W.: Clustering of time series data—a survey. Pattern Recognit. 38(11), 1857–1874 (2005).  https://doi.org/10.1016/j.patcog.2005.01.025 CrossRefzbMATHGoogle Scholar
  34. 34.
    Lin, J., Keogh, E., Lonardi, S., Chiu, B.: A symbolic representation of time series, with implications for streaming algorithms. In: Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, DMKD (2003)Google Scholar
  35. 35.
    Lin, J., Keogh, E.J., Truppel, W.: Clustering of streaming time series is meaningless. In: DMKD, pp. 56–65 (2003).  https://doi.org/10.1145/882082.882096
  36. 36.
    Linardi, M., Palpanas, T.: Scalable, variable-length similarity search in data series: the ULISSE approach. PVLDB 11(13), 2236–2248 (2018)Google Scholar
  37. 37.
    Linardi, M., Palpanas, T.: ULISSE: ULtra compact Index for variable-length Similarity SEarch in data series. In: ICDE (2018)Google Scholar
  38. 38.
    Linardi, M., Zhu, Y., Palpanas, T., Keogh, E.J.: Matrix profile X: Valmod-scalable discovery of variable-length motifs in data series. In: SIGMOD (2018)Google Scholar
  39. 39.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: BSMSP, pp. 281–297 (1967)Google Scholar
  40. 40.
    Mirylenka, K., Christophides, V., Palpanas, T., Pefkianakis, I., May, M.: Characterizing home device usage from wireless traffic time series. In: EDBT, pp. 551–562 (2016).  https://doi.org/10.5441/002/edbt.2016.51
  41. 41.
    Morton, G.M.: A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Company, Ottawa (1966)Google Scholar
  42. 42.
    Mueen, A., Hamooni, H., Estrada, T.: Time series join on subsequence correlation. In: ICDM, pp. 450–459 (2014)Google Scholar
  43. 43.
    Mueen, A., Keogh, E.J., Zhu, Q., Cash, S., Westover, M.B., Shamlo, N.B.: A disk-aware algorithm for time series motif discovery. DAMI 22(1–2), 73–105 (2011)MathSciNetzbMATHGoogle Scholar
  44. 44.
    Mueen, A., Nath, S., Liu, J.: Fast approximate correlation for massive time-series data. In: SIGMOD, pp. 171–182 (2010).  https://doi.org/10.1145/1807167.1807188
  45. 45.
    O’Neil, P.E., Cheng, E., Gawlick, D., O’Neil, E.J.: The log-structured merge-tree (lsm-tree). Acta Inf. 33(4), 351–385 (1996).  https://doi.org/10.1007/s002360050048 CrossRefzbMATHGoogle Scholar
  46. 46.
    Palpanas, T.: Data series management: the road to big sequence analytics. SIGMOD Record 44(2), 47–52 (2015)CrossRefGoogle Scholar
  47. 47.
    Palpanas, T.: Big sequence management: a glimpse of the past, the present, and the future. In: SOFSEM, pp. 63–80 (2016).  https://doi.org/10.1007/978-3-662-49192-8_6 Google Scholar
  48. 48.
    Palpanas, T.: The parallel and distributed future of data series mining. In: HPCS, pp. 916–920 (2017).  https://doi.org/10.1109/HPCS.2017.155
  49. 49.
    Paparrizos, J., Gravano, L.: k-shape: efficient and accurate clustering of time series. In: SIGMOD, pp. 1855–1870 (2015).  https://doi.org/10.1145/2723372.2737793
  50. 50.
    Paraskevopoulos, P., Dinh, T.C., Dashdorj, Z., Palpanas, T., Serafini, L.: Identification and characterization of human behavior patterns from mobile phone data. In: D4D Challenge Session, NetMob (2013)Google Scholar
  51. 51.
    Pelkonen, T., Franklin, S., Cavallaro, P., Huang, Q., Meza, J., Teller, J., Veeraraghavan, K.: Gorilla: a fast, scalable, in-memory time series database. PVLDB 8(12), 1816–1827 (2015)Google Scholar
  52. 52.
    Peng, B., Fatourou, P., Palpanas, T.: ParIS: the next destination for fast data series indexing and query answering. In: BIGDATA, pp. 791–800 (2018).  https://doi.org/10.1109/BigData.2018.8622293
  53. 53.
    Rafiei, D.: On similarity-based queries for time series data. In: ICDE, pp. 410–417 (1999).  https://doi.org/10.1109/ICDE.1999.754957
  54. 54.
    Rafiei, D., Mendelzon, A.O.: Similarity-based queries for time series data. In: SIGMOD, pp. 13–25 (1997).  https://doi.org/10.1145/253260.253264
  55. 55.
    Rakthanmanon, T., Campana, B.J.L., Mueen, A., Batista, G.E.A.P.A., Westover, M.B., Zhu, Q., Zakaria, J., Keogh, E.J.: Searching and mining trillions of time series subsequences under dynamic time warping. In: SIGKDD, pp. 262–270 (2012).  https://doi.org/10.1145/2339530.2339576
  56. 56.
    Rakthanmanon, T., Keogh, E.J., Lonardi, S., Evans, S.: Time series epenthesis: clustering time series streams requires ignoring some data. In: ICDM, pp. 547–556 (2011).  https://doi.org/10.1109/ICDM.2011.146
  57. 57.
    Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill, New York (2003)zbMATHGoogle Scholar
  58. 58.
    Ramsak, F., Markl, V., Fenk, R., Zirkel, M., Elhardt, K., Bayer, R.: Integrating the ub-tree into a database system kernel. In: VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10–14, 2000, Cairo, Egypt, pp. 263–272 (2000)Google Scholar
  59. 59.
    Rao, J., Ross, K.A.: Making b\(^{+}\)-trees cache conscious in main memory. In: SIGMOD, pp. 475–486 (2000).  https://doi.org/10.1145/342009.335449
  60. 60.
    Ratanamahatana, C.A., Keogh, E.J.: Three myths about dynamic time warping data mining. In: SIAM, pp. 506–510 (2005).  https://doi.org/10.1137/1.9781611972757.50
  61. 61.
    Ravi Kanth, K.V., Agrawal, D., Singh, A.: Dimensionality reduction for similarity searching in dynamic databases. In: SIGMOD, pp. 166–176 (1998)Google Scholar
  62. 62.
    Raza, U., Camerra, A., Murphy, A.L., Palpanas, T., Picco, G.P.: Practical data prediction for real-world wireless sensor networks. IEEE Trans. Knowl. Data Eng. 27(8), 2231–2244 (2015).  https://doi.org/10.1109/TKDE.2015.2411594 CrossRefGoogle Scholar
  63. 63.
    Rodrigues, P.P., Gama, J., Pedroso, J.P.: Hierarchical clustering of time-series data streams. IEEE Trans. Knowl. Data Eng. 20(5), 615–627 (2008).  https://doi.org/10.1109/TKDE.2007.190727 CrossRefGoogle Scholar
  64. 64.
    Shasha, D.: Tuning time series queries in finance: case studies and recommendations. IEEE Data Eng. Bull. 22(2), 40–46 (1999)Google Scholar
  65. 65.
    Shieh, J., Keogh, E.: iSAX: disk-aware mining and indexing of massive time series datasets. Data Min. Knowl. Discov. 19(1), 24–57 (2009)MathSciNetCrossRefGoogle Scholar
  66. 66.
    Shieh, J., Keogh, E.J.: isax: indexing and mining terabyte sized time series. In: ACM SIGKDD, pp. 623–631 (2008).  https://doi.org/10.1145/1401890.1401966
  67. 67.
    Sloan digital sky survey. https://www.sdss3.org/dr10/data_access/volume.php (2017)
  68. 68.
    Soldi, S., Beckmann, V., Baumgartner, W., Ponti, G., Shrader, C., Lubinski, P., Krimm, H., Mattana, F., Tueller, J.: Long-term variability of AGN at hard X-rays. Astron. Astrophys. 563, A57 (2014)CrossRefGoogle Scholar
  69. 69.
    Wang, Y., Wang, P., Pei, J., Wang, W., Huang, S.: A data-adaptive and dynamic segmentation index for whole matching on time series. PVLDB 6(10), 793–804 (2013)Google Scholar
  70. 70.
    Xi, X., Keogh, E.J., Shelton, C.R., Wei, L., Ratanamahatana, C.A.: Fast time series classification using numerosity reduction. In: ICML, pp. 1033–1040 (2006).  https://doi.org/10.1145/1143844.1143974
  71. 71.
    Yagoubi, D.E., Akbarinia, R., Masseglia, F., Palpanas, T.: Dpisax: massively distributed partitioned isax. In: ICDM, pp. 1135–1140 (2017)Google Scholar
  72. 72.
    Yagoubi, D.E., Akbarinia, R., Masseglia, F., Palpanas, T.: Massively distributed time series indexing and querying. TKDE (accepted for publication, 2018)Google Scholar
  73. 73.
    Ye, L., Keogh, E.J.: Time series shapelets: a new primitive for data mining. In: ACM SIGKDD, pp. 947–956 (2009).  https://doi.org/10.1145/1557019.1557122
  74. 74.
    Zoumpatianos, K., Idreos, S., Palpanas, T.: Indexing for interactive exploration of big data series. In: SIGMOD, pp. 1555–1566 (2014)Google Scholar
  75. 75.
    Zoumpatianos, K., Idreos, S., Palpanas, T.: RINSE: interactive data series exploration with ADS+. PVLDB 8(12), 1912–1915 (2015)Google Scholar
  76. 76.
    Zoumpatianos, K., Idreos, S., Palpanas, T.: ADS: the adaptive data series index. VLDB J. 25(6), 843–866 (2016).  https://doi.org/10.1007/s00778-016-0442-5 CrossRefGoogle Scholar
  77. 77.
    Zoumpatianos, K., Lou, Y., Palpanas, T., Gehrke, J.: Query workloads for data series indexes. In: ACM SIGKDD, pp. 1603–1612 (2015)Google Scholar
  78. 78.
    Zoumpatianos, K., Palpanas, T.: Data series management: fulfilling the need for big sequence analytics. In: ICDE (2018)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.FORTH-ICSHeraklionGreece
  2. 2.Harvard UniversityCambridgeUSA
  3. 3.Paris Descartes UniversityParisFrance

Personalised recommendations