Optimizing Intelligent Reduction Techniques for Big Data

Pop, Florin; Negru, Catalin; Ciolofan, Sorin N.; Mocanu, Mariana; Cristea, Valentin

doi:10.1007/978-3-319-30265-2_3

Florin Pop³,
Catalin Negru³,
Sorin N. Ciolofan³,
Mariana Mocanu³ &
…
Valentin Cristea³

Part of the book series: Studies in Big Data ((SBD,volume 18))

3261 Accesses
1 Citations

Abstract

Working with big volume of data collected through many applications in multiple storage locations is both challenging and rewarding. Extracting valuable information from data means to combine qualitative and quantitative analysis techniques. One of the main promises of analytics is data reduction with the primary function to support decision-making. The motivation of this chapter comes from the new age of applications (social media, smart cities, cyber-infrastructures, environment monitoring and control, healthcare, etc.), which produce big data and many new mechanisms for data creation rather than a new mechanism for data storage. The goal of this chapter is to analyze existing techniques for data reduction, at scale to facilitate Big Data processing optimization and understanding. The chapter will cover the following subjects: data manipulation, analytics and Big Data reduction techniques considering descriptive analytics, predictive analytics and prescriptive analytics. The CyberWater case study will be presented by referring to: optimization process, monitoring, analysis and control of natural resources, especially water resources to preserve the water quality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
These solution were grouped in the Special Issue on “Modern Dimension Reduction Methods for Big Data Problems in Ecology” edited by Wikle, Holan and Hooten, in Journal of Agricultural, Biological, and Environmental Statistics.

References

Laurila, J.K., Gatica-Perez, D., Aad, I., Bornet, O., Do, T.M.T., Dousse, O., Eberle, J., Miettinen, M.: The mobile data challenge: Big data for mobile computing research. In: Mobile Data Challenge by Nokia Workshop (2012)
Google Scholar
Davenport, T.H., Dyche, J.: Big data in big companies. Int. Inst. Anal. (2013)
Google Scholar
Ho, D., Snow, C., Obel, B., Dissing Srensen, P., Kallehave, P.: Unleashing the potential of big data. Technical report, Organizational Design Community (2013)
Google Scholar
Lynch, C.: Big data: How do your data grow? Nature 455(7209), pp. 28–29 (2008)
Article Google Scholar
Szala, A.: Science in an exponential world. Nature 440, 2020 (2006)
Google Scholar
Birney, E.: The making of encode: lessons for big-data projects. Nature 489(7414), pp. 49–51 (2012)
Article Google Scholar
Bizer, C., Boncz, P., Brodie, M.L., Erling, O.: The meaningful use of big data: Four perspectives-four challenges. SIGMOD Rec. 40(4), pp. 56–60 (2012)
Google Scholar
Madden, S.: From databases to big data. IEEE Internet Comput. 16(3), pp. 4–6 (2012)
Google Scholar
Chen, Y., Alspaugh, S., Katz, R.: Interactive analytical processing in big data systems: a cross-industry study of mapreduce workloads. Proc. VLDB Endow. 5(12), pp. 1802–1813 (2012)
Google Scholar
Cuzzocrea, A., Song, I.Y. Davis, K.C.: Analytics over large-scale multidimensional data: the big data revolution! In: Proceedings of the ACM 14th International Workshop on Data Warehousing and OLAP, DOLAP’11, pp. 101–104. ACM, New York, NY, USA (2011)
Google Scholar
Negru, C., Pop, F., Cristea, V., Bessisy, N., Li, J.: Energy efficient cloud storage service: key issues and challenges. In: Proceedings of the 2013 Fourth International Conference on Emerging Intelligent Data and Web Technologies, EIDWT’13, pp. 763–766. IEEE Computer Society, Washington, DC, USA (2013)
Google Scholar
Rao, S., Ramakrishnan, R., Silberstein, A., Ovsiannikov, M., Reeves, D.: Sailfish: a framework for large scale data processing. In: Proceedings of the Third ACM Symposium on Cloud Computing, SoCC’12, pp. 4:1–4:14. ACM, New York, NY, USA (2012)
Google Scholar
Roddick, J.F., Hoel, E., Egenhofer, M.J., Papadias, D., Salzberg, B.: Spatial, temporal and spatio-temporal databases—hot issues and directions for Ph.D. research. SIGMOD Rec. 33(2), pp. 126–131 (2004)
Google Scholar
Chen, C.X.: Spatio-temporal databases. In: Shekhar, S., Xiong, H. (eds.) Encyclopedia of GIS, pp. 1121–1121. Springer, USA (2008)
Google Scholar
Guhaniyogi, R., Finley, A., Banerjee, S., Kobe, R.: Modeling complex spatial dependencies: low-rank spatially varying cross-covariances with application to soil nutrient data. J. Agric. Biol. Environ. Stat. 18(3), pp. 274–298 (2013)
Article MathSciNet MATH Google Scholar
Johnson, D.S., Ream, R.R., Towell, R.G., Williams, M.T., Guerrero, J.D.L.: Bayesian clustering of animal abundance trends for inference and dimension reduction. J. Agric. Biol. Environ. Stat. 18(3), pp. 299–313 (2013)
Google Scholar
Leininger, T.J., Gelfand, A.E., Allen, J.M., Silander Jr, J.A.: Spatial regression modeling for compositional data with many zeros. J. Agric. Biol. Environ. Stat. 18(3), pp. 314–334 (2013)
Google Scholar
Wu, G., Holan, S.H., Wikle, C.K.: Hierarchical Bayesian spatio-temporal conwaymaxwell poisson models with dynamic dispersion. J. Agric. Biol. Environ. Stat. 18(3), pp. 335–356 (2013)
Google Scholar
Dunstan, P.K., Foster, S.D., Hui, F.K., Warton, D.I.: Finite mixture of regression modeling for high-dimensional count and biomass data in ecology. J. Agric. Biol. Environ. Stat. 18(3), pp. 357–375 (2013)
Google Scholar
Hooten, M.B., Garlick, M.J., Powell, J.A.: Computationally efficient statistical differential equation modeling using homogenization. J. Agric. Biol. Environ. Stat. 18(3), pp. 405–428 (2013)
Google Scholar
Yang, W.-H., Wikle, C.K., Holan, S.H., Wildhaber, M.L.: Ecological prediction with nonlinear multivariate time-frequency functional data models. J. Agric. Biol. Environ. Stat. 18(3), pp. 450–474 (2013)
Google Scholar
Loshin, D.: Nosql data management for big data. In: Loshin, D. (ed.) Big Data Analytics, pp. 83–90. Morgan Kaufmann, Boston (2013)
Google Scholar
Madden, S.: Query processing for streaming sensor data. Comput. Sci. Div. (2002)
Google Scholar
Hamm, C., Burleson, D.K.: Oracle Data Mining: Mining Gold from Your Warehouse. Oracle In-Focus Series. Rampant TechPress (2006)
Google Scholar
Hellerstein, J.M., Ré, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The MADlib analytics library: or MAD skills, the SQL. Proc. VLDB Endow. 5(12), pp. 1700–1711 (2012)
Google Scholar
Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5(8), pp. 716–727 (2012)
Google Scholar
Han, J., Kamber, M.: Data Mining, Southeast Asia Edition: Concepts and Techniques. Morgan kaufmann (2006)
Google Scholar
Hilbert, M., Lopez, P.: The worlds technological capacity to store, communicate, and compute information. Science 332(6025), pp. 60–65 (2011)
Article Google Scholar
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. (TOCS) 26(2), 4 (2008)
Google Scholar
Agrawal, P., Arasu, A., Kaushik, R.: On indexing error-tolerant set containment. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD’10, pp. 927–938. ACM, New York, NY, USA (2010)
Google Scholar
Arasu, A., Gotz, M., Kaushik, R.: On active learning of record matching packages. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD’10, pp. 783–794. ACM, New York, NY, USA (2010)
Google Scholar
Arasu, A., Re, C., Suciu, D.: Large-scale deduplication with constraints using dedupalog. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE’09, pp. 952–963. IEEE Computer Society, Washington, DC, USA (2009)
Google Scholar
Varbanescu, A.L., Iosup, A.: On many-task big data processing: from GPUs to clouds. In: MTAGS Workshop, held in conjunction with ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–8. ACM (2013)
Google Scholar
Loshin, D.: Big data tools and techniques. In: Loshin, D. (ed.) Big Data Analytics, pp. 61–72. Morgan Kaufmann, Boston (2013)
Google Scholar
Hunt, P., Konar, M., Junqueira, F.P., Reed, B.: Zookeeper: Wait-free coordination for internet-scale systems. In: Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC’10, pp. 11–11. USENIX Association, Berkeley, CA, USA (2010)
Google Scholar
Jiang, Y.: HBase Administration Cookbook. Packt Publishing, Birmingham (2012)
Google Scholar
Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E.N., O’Malley, O., Pandey, J., Yuan, Y., Lee, R., Zhang, X.: Major technical advancements in apache hive. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD’14, pp. 1235–1246. ACM, New York, NY, USA (2014)
Google Scholar
Shang, W., Adams, B., Hassan, A.E.: Using pig as a data preparation language for large-scale mining software repositories studies: an experience report. J. Syst. Softw. 85(10), pp. 2195–2204 (2012)
Google Scholar
Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning Publications Co., Greenwich, CT, USA (2011)
Google Scholar
Banerjee, S., Gelfand, A.E., Finley, A.O., Sang, H.: Gaussian predictive process models for large spatial data sets. J. R. Stat. Soc. Series B (Stat. Methodol.) 70(4), pp. 825–848 (2008)
Google Scholar
Hensman, J., Fusi, N., Lawrence, N.D.: Gaussian processes for big data (2013). arXiv:1309.6835
Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: constant-size coresets for k-means, pca and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’13, pp. 1434–1453. SIAM (2013)
Google Scholar
Aflalo, Y., Kimmel, R.: Spectral multidimensional scaling. Proc. Natl. Acad. Sci. 110(45), pp. 18052–18057 (2013)
Article MathSciNet MATH Google Scholar
Pop, F., Ciobanu, R.-I., Dobre, C.: Adaptive method to support social-based mobile networks using a pagerank approach. In: Concurrency and Computation: Practice and Experience (2013)
Google Scholar
Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: a self-tuning system for big data analytics. In: CIDR, vol. 11, pp. 261–272 (2011)
Google Scholar
Kantardzic, M.: Data Mining: Concepts, Models, Methods, and Algorithms, 2nd edn. Wiley-IEEE Press, Hoboken (2011)
Google Scholar
Namey, E., Guest, G., Thairu, L., Johnson, L.: Data reduction techniques for large qualitative data sets. In: Guest, G., MacQueen, K.M. (eds.) Handbook for Team-Based Qualitative Research, pp. 137–162. AltaMira Press, USA (2007)
Google Scholar
Aflalo, Y., Kimmel, R., Raviv, D.: Scale invariant geometry for nonrigid shapes. SIAM J. Imaging Sci. 6(3), pp. 1579–1597 (2013)
Article MathSciNet MATH Google Scholar
Cambria, E., Rajagopal, D., Olsher, D., Das, D.: Big social data analysis. In: R. Akerkar (ed.) Big Data Computing, pp. 401–414. Taylor & Francis, New York (2013)
Google Scholar
Yang, C., Zhang, X., Zhong, C., Liu, C., Pei, J., Ramamohanarao, K., Chen, J.: A spatiotemporal compression based approach for efficient big data processing on cloud. J. Comput. Syst. Sci. 80(8), pp. 1563–1583 (2014)
Google Scholar
Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: Mad skills: new analysis practices for big data. Proc. VLDB Endow. 2(2), pp. 1481–1492 (2009)
Google Scholar
Ciolofan, S.N., Mocanu, M., Ionita, A.: Distributed cyberinfrastructure for decision support in risk related environments. In: 2013 IEEE 12th International Symposium on Parallel and Distributed Computing (ISPDC), pp. 109–115 (2013)
Google Scholar

Download references

Acknowledgments

The research presented in this paper is supported by projects: CyberWater grant of the Romanian National Authority for Scientific Research, CNDI-UEFISCDI, project number 47/2012; CLUeFARM: Information system based on cloud services accessible through mobile devices, to increase product quality and business development farms—PN-II-PT-PCCA-2013-4-0870; DataWay: Real-time Data Processing Platform for Smart Cities: Making sense of Big Data - PN-II-RU-TE-2014-4-2731; MobiWay: Mobility Beyond Individualism: an Integrated Platform for Intelligent Transportation Systems of Tomorrow—PN-II-PT-PCCA-2013-4-0321.

Author information

Authors and Affiliations

Faculty of Automatic Control and Computers, Computer Science Department, University Politehnica of Bucharest, Bucharest, Romania
Florin Pop, Catalin Negru, Sorin N. Ciolofan, Mariana Mocanu & Valentin Cristea

Authors

Florin Pop
View author publications
You can also search for this author in PubMed Google Scholar
Catalin Negru
View author publications
You can also search for this author in PubMed Google Scholar
Sorin N. Ciolofan
View author publications
You can also search for this author in PubMed Google Scholar
Mariana Mocanu
View author publications
You can also search for this author in PubMed Google Scholar
Valentin Cristea
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Florin Pop .

Editor information

Editors and Affiliations

Aston Business School, Aston University, Birmingham, United Kingdom
Ali Emrouznejad

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Pop, F., Negru, C., Ciolofan, S.N., Mocanu, M., Cristea, V. (2016). Optimizing Intelligent Reduction Techniques for Big Data. In: Emrouznejad, A. (eds) Big Data Optimization: Recent Developments and Challenges. Studies in Big Data, vol 18. Springer, Cham. https://doi.org/10.1007/978-3-319-30265-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-30265-2_3
Published: 27 May 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30263-8
Online ISBN: 978-3-319-30265-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics