Big Sequence Management: A glimpse of the Past, the Present, and the Future

Palpanas, Themis

doi:10.1007/978-3-662-49192-8_6

Big Sequence Management: A glimpse of the Past, the Present, and the Future

Themis Palpanas¹⁶

Conference paper
First Online: 08 January 2016

1114 Accesses
26 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9587))

Abstract

There is an increasingly pressing need, by several applications in diverse domains, for developing techniques able to index and mine very large collections of sequences, or data series. Examples of such applications come from biology, astronomy, entomology, the web, and other domains. It is not unusual for these applications to involve numbers of data series in the order of hundreds of millions to billions, which are often times not analyzed in their full detail due to their sheer size. In this work, we describe recent efforts in designing techniques for indexing and mining truly massive collections of data series that will enable scientists to easily analyze their data. We show that the main bottleneck in mining such massive datasets is the time taken to build the index, and we thus introduce solutions to this problem. Furthermore, we discuss novel techniques that adaptively create data series indexes, allowing users to correctly answer queries before the indexing task is finished. We also show how our methods allow mining on datasets that would otherwise be completely untenable, including the first published experiments using one billion data series. Finally, we present our vision for the future in big sequence management research.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
For the rest of this paper, we are going to use the terms data series and sequence interchangeably.
2.
A more detailed analysis of the topics discussed in this paper can be found in our previous studies [12, 13, 17, 18, 29, 36–38, 57–59].

References

Adhd-200 (2011). http://fcon\_1000.projects.nitrc.org/indi/adhd200/
Google Scholar
Sloan digital sky survey (2015). https://www.sdss3.org/dr10/data_access/volume.php
Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: Lomet, D.B. (ed.) FODO 1993. LNCS, vol. 730, pp. 69–84. Springer, Heidelberg (1993)
Chapter Google Scholar
An, N., Kanth, R., Kothuri, V., Ravada, S.: Improving performance with bulk-inserts in oracle r-trees. In: VLDB, pp. 948–951. VLDB Endowment (2003)
Google Scholar
Assent, L., Krieger, R., Afschari, F., Seidl, T.: The TS-tree: efficient time series search and retrieval. In EDBT (2008)
Google Scholar
Aßfalg, J., Kriegel, H.-P., Kröger, P., Renz, M.: Probabilistic similarity search for uncertain time series. In: Winslett, M. (ed.) SSDBM 2009. LNCS, vol. 5566, pp. 435–443. Springer, Heidelberg (2009)
Chapter Google Scholar
Astrahan, M.M., Blasgen, M.W., Chamberlin, D.D., Eswaran, K.P., Gray, J., Griffiths, P.P., King, W.F., Lorie, R.A., McJones, P.R., Mehl, J.W., Putzolu, G.R., Traiger, I.L., Wade, B.W., Watson, V.: System R: relational approach to database management. TODS 1(2), 97–137 (1976)
Article Google Scholar
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Article MATH MathSciNet Google Scholar
Berchtold, S., Keim, D.A., Kriegel, H.P.: The X-tree: an index structure for high-dimensional data. In: VLDB, pp. 28–39 (1996)
Google Scholar
Bernstein, P., Bykov, S., Geller, A., Kliot, G., Thelin, J.: Orleans: distributed virtual actors for programmability and scalability. MSR-TR-2014-41 (2014)
Google Scholar
Bu, Y., wing Leung, T., chee Fu, A.W., Keogh, E., Pei, J., Meshkin, S.: Wat: finding top-k discords in time series database. In: SDM, pp. 449–454 (2007)
Google Scholar
Camerra, A., Palpanas, T., Shieh, J., Keogh, E.: iSAX 2.0: indexing and mining one billion time series. In: ICDM (2010)
Google Scholar
Camerra, A., Shieh, J., Palpanas, T., Rakthanmanon, T., Keogh, E.J.: Beyond one billion time series: indexing and mining very large time series collections with iSAX2+. KAIS 39(1), 123–151 (2014)
Google Scholar
Chakrabarti, K., Keogh, E., Mehrotra, S., Pazzani, M.: Locally adaptive dimensionality reduction for indexing large time series databases. In: SIGMOD (2002)
Google Scholar
Chan, K.-P., Fu. A.-C.: Efficient time series matching by wavelets. In: ICDE (1999)
Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 1–58 (2009)
Article Google Scholar
Dallachiesa, M., Nushi, B., Mirylenka, K., Palpanas, T.: Uncertain time-series similarity: return to the basics. PVLDB 5(11), 1662–1673 (2012)
Google Scholar
Dallachiesa, M., Palpanas, T., Ilyas, I.F.: Top-k nearest neighbor search in uncertain data series. PVLDB 8(1), 13–24 (2014)
Google Scholar
Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., Keogh, E.: Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB 1, 1542–1552 (2008)
Google Scholar
Soisalon-Soininen, E., Widmayer, P.: Single and bulk updates in stratified trees: an amortized and worst-case analysis. In: Klein, R., Six, H.-W., Wegner, L. (eds.) Computer Science in Perspective. LNCS, vol. 2598, pp. 278–292. Springer, Heidelberg (2003)
Chapter Google Scholar
Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: SIGMOD (1984)
Google Scholar
Huijse, P., Estévez, P.A., Protopapas, P., Principe, J.C., Zegers, P.: Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE Comp. Int. Mag. 9(3), 27–39 (2014)
Article Google Scholar
Van den Bercken, J., Seeger, B.: An evaluation of generic bulk loading techniques. In: VLDB, pp. 461–470 (2001)
Google Scholar
Van den Bercken, J., Widmayer, P., Seeger, B.: A generic approach to bulk loading multidimensional index structures. In: VLDB (1997)
Google Scholar
Kadiyala, S., Shiri, N.: A compact multi-resolution index for variable length queries in time series databases. KAIS 15(2), 131–147 (2008)
Google Scholar
Kashino, K., Smith, G., Murase, H.: Time-series active search for quick retrieval of audio and video. In: ICASSP (1999)
Google Scholar
Kashyap, S., Karras, P.: Scalable knn search on vertically stored time series. In: KDD (2011)
Google Scholar
Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. KAIS 3(3), 263–286 (2000)
Google Scholar
Keogh, E.J., Palpanas, T., Zordan, V.B., Gunopulos, D., Cardle, M.: Indexing large human-motion databases. In: VLDB, pp. 780–791 (2004)
Google Scholar
Arge, L., Hinrichs, K.H., Vahrenhold, J., Vitter, J.V.: Efficient bulk operations on dynamic R-trees. Algorithmica 33(1), 104–128 (2002)
Article MATH MathSciNet Google Scholar
Lerner, A., Shasha, D.: Aquery: query language for ordered data, optimization techniques, and experiments. In: VLDB (2003)
Google Scholar
Li, C.S., Yu, P., Castelli, V.: Hierarchyscan: a hierarchical similarity search algorithm for databases of long sequences. In: ICDE (1996)
Google Scholar
Liao, H., Han, J., Fang, J.: Multi-dimensional index on hadoop distributed file system. In: NAS (2010)
Google Scholar
Lin, J., Keogh, E., Lonardi, S.: A symbolic representation of time series, with implications for streaming algorithms. In: DMKD (2003)
Google Scholar
Lin, J., Khade, R., Li, Y.: Rotation-invariant similarity in time series using bag-of-patterns representation. J. Intell. Inf. Syst. 39(2), 287–315 (2012)
Article Google Scholar
Palpanas, T.: Data series management: the road to big sequence analytics. SIGMOD Rec. 44(2), 47–52 (2015)
Article Google Scholar
Palpanas, T., Vlachos, M., Keogh, E.J., Gunopulos, D.: Streaming time series summarization using user-defined amnesic functions. IEEE Trans. Knowl. Data Eng. 20(7), 992–1006 (2008)
Article Google Scholar
Palpanas, T., Vlachos, M., Keogh, E.J., Gunopulos, D., Truppel, W.: Online amnesic approximation of streaming time series. In: ICDE, pp. 339–349 (2004)
Google Scholar
Rafiei, D., Mendelzon, A.: Similarity-based queries for time series data. In: SIGMOD (1997)
Google Scholar
Rakthanmanon, T., Campana, B.J.L., Mueen, A., Batista, G., Westover, M.B., Zhu, Q., Zakaria, J., Keogh, E.J.: Searching and mining trillions of time series subsequences under dynamic time warping. In: KDD (2012)
Google Scholar
Raman, V., Attaluri, G.K., Barber, R., Chainani, N., Kalmuk, D., KulandaiSamy, V., Leenstra, J., Lightstone, S., S. Liu, S., Lohman, G.M., Malkemus, T., Müller, R., Pandis, I., Schiefer, B., Sharpe, D., Sidle, R., Storm, A.J., Zhang, L.: DB2 with BLU acceleration: so much more than just a column store. PVLDB 6(11), 1080–1091 (2013)
Google Scholar
Raza, U., Camerra, A., Murphy, A.L., Palpanas, T., Picco, G.P.: Practical data prediction for real-world wireless sensor networks. IEEE Trans. Knowl. Data Eng. 27(8), 2231–2244 (2015)
Article Google Scholar
Choubey, R., Chen, L., Rundensteiner, E.A.: GBI: a generalized R-tree bulk-insertion strategy. In: Güting, R.H., Papadias, D., Lochovsky, F.H. (eds.) SSD 1999. LNCS, vol. 1651, pp. 91–108. Springer, Heidelberg (1999)
Chapter Google Scholar
Sadri, R., Zaniolo, C., Zarkesh, A.M., Adibi, J.: A sequential pattern query language for supporting instant data mining for e-services. In: VLDB (2001)
Google Scholar
Sarangi, S.R., Murthy, K.: DUST: a generalized notion of similarity between uncertain time series. In: KDD (2010)
Google Scholar
Schäfer, P., Högqvist, M.: SFA: a symbolic fourier approximation and index for similarity search in high dimensional datasets. In: EDBT (2012)
Google Scholar
Shasha, D.: Tuning time series queries in finance: case studies and recommendations. IEEE Data Eng. Bull. 22(2), 40–46 (1999)
Google Scholar
Shieh, J., Keogh, E.: iSAX: disk-aware mining and indexing of massive time series datasets. DMKD 19(1), 24–57 (2009)
MathSciNet Google Scholar
Shieh, J., Keogh, E.J.: iSAX: indexing and mining terabyte sized time series. In: KDD, pp. 623–631 (2008)
Google Scholar
Stonebraker, M., Abadi, M., Batkin, D.J., Chen, J. X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E.J., O’Neil, P.E., Rasin, A., Tran, N., Zdonik, S.B.: C-store: a column-oriented DBMS. In: VLDB (2005)
Google Scholar
Stonebraker, M., Brown, P., Poliakov, A., Raman, S.: The architecture of SciDB. In: Bayard Cushing, J., French, J., Bowers, S. (eds.) SSDBM 2011. LNCS, vol. 6809, pp. 1–16. Springer, Heidelberg (2011)
Chapter Google Scholar
Wang, Y., Wang, P., Pei, J., Wang, W., Huang, S.: A data-adaptive and dynamic segmentation index for whole matching on time series. PVLDB 6(10), 793–804 (2013)
Google Scholar
Warren Liao, T.: Clustering of time series data - a survey. Pattern Recogn. 38(11), 1857–1874 (2005)
Article MATH Google Scholar
Ye, L., Keogh, E.J.: Time series shapelets: a new primitive for data mining. In: KDD (2009)
Google Scholar
Yeh, M., Wu, K., Yu, P.S., Chen, M.: PROUD: a probabilistic approach to processing similarity queries over uncertain data streams. In: EDBT (2009)
Google Scholar
Yi, B., Faloutsos, C.: Fast time sequence indexing for arbitrary Lp norms. In: VLDB (2000)
Google Scholar
Zoumpatianos, K., Idreos, S., Palpanas, T.: Indexing for interactive exploration of big data series. In: SIGMOD (2014)
Google Scholar
Zoumpatianos, K., Idreos, S., Palpanas, T.: RINSE: interactive data series exploration with ADS+. PVLDB 8(12), 1912–1923 (2015)
Google Scholar
Zoumpatianos, K., Lou, Y., Palpanas, T., Gehrke, J.: Query workloads for data series indexes. In: KDD (2015)
Google Scholar

Download references

Acknowledgements

I would like to thank my collaborators (in alphabetical order): Alessandro Camerra, Johannes Gehrke, Stratos Idreos, Eamonn Keogh, Michele Linardi, and Yin Lou. Special thanks go to Kostas Zoumpatianos, who has been the driving force behind several of the ideas discussed in this paper.

Author information

Authors and Affiliations

Paris Descartes University, Paris, France
Themis Palpanas

Authors

Themis Palpanas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Themis Palpanas .

Editor information

Editors and Affiliations

University of Latvia, Riga, Latvia
Rūsiņš Mārtiņš Freivalds
University of Paderborn, Paderborn, Germany
Gregor Engels
University of Genoa, Genoa, Italy
Barbara Catania

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Palpanas, T. (2016). Big Sequence Management: A glimpse of the Past, the Present, and the Future. In: Freivalds, R., Engels, G., Catania, B. (eds) SOFSEM 2016: Theory and Practice of Computer Science. SOFSEM 2016. Lecture Notes in Computer Science(), vol 9587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-49192-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-662-49192-8_6
Published: 08 January 2016
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-49191-1
Online ISBN: 978-3-662-49192-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics