Parallel Time Series Modeling - A Case Study of In-Database Big Data Analytics

Qian, Hai; Yang, Shengwen; Iyer, Rahul; Feng, Xixuan; Wellons, Mark; Welton, Caleb

doi:10.1007/978-3-319-13186-3_38

Parallel Time Series Modeling - A Case Study of In-Database Big Data Analytics

Hai Qian¹¹,
Shengwen Yang¹¹,
Rahul Iyer¹¹,
Xixuan Feng¹¹,
Mark Wellons¹² &
…
Caleb Welton¹¹

Conference paper
First Online: 26 November 2014

2347 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8643))

Abstract

MADlibis an open-source library for scalable in-database analytics. In this paper, we present our parallel design of time series analysis and implementation of ARIMA modeling in MADlib’s framework. The algorithms for fitting time series models are intrinsically sequential since any calculation for a specific time \(t\) depends on the result from the previous time step \(t-1\). Our solution parallelizes this computation by splitting the data into \(n\) chunks. Since the model fitting involves multiple iterations, we use the results from previous iteration as the initial values for each chunk in the current iteration. Thus the computation for each chunk of data is not dependenton on the results from the previous chunk. We further improve performance by redistributing the original data such that each chunk can be loaded into memory, minimizing communication overhead. Experiments show that our parallel implementation has good speed-up when compared to a sequential version of the algorithm and R’s default implementation in the “stats” package.

The first four authors contributed equally to this work.

Mark Wellons was an intern with Pivotal Inc. during the completion of this work.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Pivotal (2013). http://gopivotal.com/
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. Proc. VLDB Endowment 5(7), 622–633 (2012)
Article Google Scholar
Bekkerman, R., Bilenko, M., Langford, J.: Scaling up machine learning: parallel and distributed approaches. In: Proceedings of 17th ACM SIGKDD Tutorials, KDD ’11 Tutorials, pp. 4:1–4:1 (2011)
Google Scholar
Box, G.E., Pierce, D.A.: Distribution of residual autocorrelations in autoregressive-integrated moving average time series models. J. Am. Stat. Assoc. 65(332), 1509–1526 (1970)
Article MATH MathSciNet Google Scholar
Chen, F., Feng, X., Ré, C., Wang, M.: Optimizing statistical information extraction programs over evolving text. In: 2012 IEEE 28th International Conference on Data Engineering (ICDE), pp. 870–881. IEEE (2012)
Google Scholar
Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: MAD skills: new analysis practices for big data. Proc. VLDB Endowment 2(2), 1481–1492 (2009)
Article Google Scholar
Cox, D.: Regression models and life-tables. J. Roy. Stat. Soc. Ser. B (Methodological) 34(2), 187–220 (1972)
MATH Google Scholar
Feng, X., Kumar, A., Recht, B., Ré, C.: Towards a unified architecture for in-rdbms analytics. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 325–336. ACM (2012)
Google Scholar
Hellerstein, J.M., Ré, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The MADlib analytics library: or MAD skills, the SQL. Proc. VLDB 5(12), 1700–1711 (2012)
Article Google Scholar
Ordonez, C.: Building statistical models and scoring with udfs. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of data. ACM (2007)
Google Scholar
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2013)
Google Scholar
Sarawagi, S., Thomas, S., Agrawal, R.: Integrating association rule mining with relational database systems: alternatives and implications, vol. 27. ACM (1998)
Google Scholar
The Predictive Analytics Team at Pivotal Inc., Design document for MADlib (2013). http://madlib.net/design.pdf
The Predictive Analytics Team at Pivotal Inc. MADlib: an in-database analytics platform (2013). http://madlib.net
The Predictive Analytics Team at Pivotal Inc., PivotalR: an R front-end to both GPDB/Postgres and MADlib (2013). http://cran.r-project.org/web/packages/PivotalR/
Zhu, Z.A., Chen, W., Wang, G., Zhu, C., Chen, Z.: P-packsvm: parallel primal gradient descent kernel SVM. In: 2009 Ninth IEEE International Conference on Data Mining, ICDM’09, pp. 677–686. IEEE (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Predictive Analytics Team, Pivotal Inc., Palo Alto, USA
Hai Qian, Shengwen Yang, Rahul Iyer, Xixuan Feng & Caleb Welton
Amazon.Com Inc., Seattle, USA
Mark Wellons

Authors

Hai Qian
View author publications
You can also search for this author in PubMed Google Scholar
Shengwen Yang
View author publications
You can also search for this author in PubMed Google Scholar
Rahul Iyer
View author publications
You can also search for this author in PubMed Google Scholar
Xixuan Feng
View author publications
You can also search for this author in PubMed Google Scholar
Mark Wellons
View author publications
You can also search for this author in PubMed Google Scholar
Caleb Welton
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hai Qian .

Editor information

Editors and Affiliations

National Chiao Tung University, Hsinchu, Taiwan
Wen-Chih Peng
Google Research, Mountain View, California, USA
Haixun Wang
University of Melbourne, Melbourne, Victoria, Australia
James Bailey
National Cheng Kung University, Tainan, Taiwan
Vincent S. Tseng
Japan Advanced Institute of Science and Technology, Nomi City, Japan
Tu Bao Ho
Nanjing University, Nanjing, China
Zhi-Hua Zhou
National Chengchi University, Taipei, Taiwan
Arbee L.P. Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qian, H., Yang, S., Iyer, R., Feng, X., Wellons, M., Welton, C. (2014). Parallel Time Series Modeling - A Case Study of In-Database Big Data Analytics. In: Peng, WC., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2014. Lecture Notes in Computer Science(), vol 8643. Springer, Cham. https://doi.org/10.1007/978-3-319-13186-3_38

Download citation

DOI: https://doi.org/10.1007/978-3-319-13186-3_38
Published: 26 November 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13185-6
Online ISBN: 978-3-319-13186-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics