Abstract
Working with big volume of data collected through many applications in multiple storage locations is both challenging and rewarding. Extracting valuable information from data means to combine qualitative and quantitative analysis techniques. One of the main promises of analytics is data reduction with the primary function to support decision-making. The motivation of this chapter comes from the new age of applications (social media, smart cities, cyber-infrastructures, environment monitoring and control, healthcare, etc.), which produce big data and many new mechanisms for data creation rather than a new mechanism for data storage. The goal of this chapter is to analyze existing techniques for data reduction, at scale to facilitate Big Data processing optimization and understanding. The chapter will cover the following subjects: data manipulation, analytics and Big Data reduction techniques considering descriptive analytics, predictive analytics and prescriptive analytics. The CyberWater case study will be presented by referring to: optimization process, monitoring, analysis and control of natural resources, especially water resources to preserve the water quality.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
These solution were grouped in the Special Issue on “Modern Dimension Reduction Methods for Big Data Problems in Ecology” edited by Wikle, Holan and Hooten, in Journal of Agricultural, Biological, and Environmental Statistics.
References
Laurila, J.K., Gatica-Perez, D., Aad, I., Bornet, O., Do, T.M.T., Dousse, O., Eberle, J., Miettinen, M.: The mobile data challenge: Big data for mobile computing research. In: Mobile Data Challenge by Nokia Workshop (2012)
Davenport, T.H., Dyche, J.: Big data in big companies. Int. Inst. Anal. (2013)
Ho, D., Snow, C., Obel, B., Dissing Srensen, P., Kallehave, P.: Unleashing the potential of big data. Technical report, Organizational Design Community (2013)
Lynch, C.: Big data: How do your data grow? Nature 455(7209), pp. 28–29 (2008)
Szala, A.: Science in an exponential world. Nature 440, 2020 (2006)
Birney, E.: The making of encode: lessons for big-data projects. Nature 489(7414), pp. 49–51 (2012)
Bizer, C., Boncz, P., Brodie, M.L., Erling, O.: The meaningful use of big data: Four perspectives-four challenges. SIGMOD Rec. 40(4), pp. 56–60 (2012)
Madden, S.: From databases to big data. IEEE Internet Comput. 16(3), pp. 4–6 (2012)
Chen, Y., Alspaugh, S., Katz, R.: Interactive analytical processing in big data systems: a cross-industry study of mapreduce workloads. Proc. VLDB Endow. 5(12), pp. 1802–1813 (2012)
Cuzzocrea, A., Song, I.Y. Davis, K.C.: Analytics over large-scale multidimensional data: the big data revolution! In: Proceedings of the ACM 14th International Workshop on Data Warehousing and OLAP, DOLAP’11, pp. 101–104. ACM, New York, NY, USA (2011)
Negru, C., Pop, F., Cristea, V., Bessisy, N., Li, J.: Energy efficient cloud storage service: key issues and challenges. In: Proceedings of the 2013 Fourth International Conference on Emerging Intelligent Data and Web Technologies, EIDWT’13, pp. 763–766. IEEE Computer Society, Washington, DC, USA (2013)
Rao, S., Ramakrishnan, R., Silberstein, A., Ovsiannikov, M., Reeves, D.: Sailfish: a framework for large scale data processing. In: Proceedings of the Third ACM Symposium on Cloud Computing, SoCC’12, pp. 4:1–4:14. ACM, New York, NY, USA (2012)
Roddick, J.F., Hoel, E., Egenhofer, M.J., Papadias, D., Salzberg, B.: Spatial, temporal and spatio-temporal databases—hot issues and directions for Ph.D. research. SIGMOD Rec. 33(2), pp. 126–131 (2004)
Chen, C.X.: Spatio-temporal databases. In: Shekhar, S., Xiong, H. (eds.) Encyclopedia of GIS, pp. 1121–1121. Springer, USA (2008)
Guhaniyogi, R., Finley, A., Banerjee, S., Kobe, R.: Modeling complex spatial dependencies: low-rank spatially varying cross-covariances with application to soil nutrient data. J. Agric. Biol. Environ. Stat. 18(3), pp. 274–298 (2013)
Johnson, D.S., Ream, R.R., Towell, R.G., Williams, M.T., Guerrero, J.D.L.: Bayesian clustering of animal abundance trends for inference and dimension reduction. J. Agric. Biol. Environ. Stat. 18(3), pp. 299–313 (2013)
Leininger, T.J., Gelfand, A.E., Allen, J.M., Silander Jr, J.A.: Spatial regression modeling for compositional data with many zeros. J. Agric. Biol. Environ. Stat. 18(3), pp. 314–334 (2013)
Wu, G., Holan, S.H., Wikle, C.K.: Hierarchical Bayesian spatio-temporal conwaymaxwell poisson models with dynamic dispersion. J. Agric. Biol. Environ. Stat. 18(3), pp. 335–356 (2013)
Dunstan, P.K., Foster, S.D., Hui, F.K., Warton, D.I.: Finite mixture of regression modeling for high-dimensional count and biomass data in ecology. J. Agric. Biol. Environ. Stat. 18(3), pp. 357–375 (2013)
Hooten, M.B., Garlick, M.J., Powell, J.A.: Computationally efficient statistical differential equation modeling using homogenization. J. Agric. Biol. Environ. Stat. 18(3), pp. 405–428 (2013)
Yang, W.-H., Wikle, C.K., Holan, S.H., Wildhaber, M.L.: Ecological prediction with nonlinear multivariate time-frequency functional data models. J. Agric. Biol. Environ. Stat. 18(3), pp. 450–474 (2013)
Loshin, D.: Nosql data management for big data. In: Loshin, D. (ed.) Big Data Analytics, pp. 83–90. Morgan Kaufmann, Boston (2013)
Madden, S.: Query processing for streaming sensor data. Comput. Sci. Div. (2002)
Hamm, C., Burleson, D.K.: Oracle Data Mining: Mining Gold from Your Warehouse. Oracle In-Focus Series. Rampant TechPress (2006)
Hellerstein, J.M., Ré, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The MADlib analytics library: or MAD skills, the SQL. Proc. VLDB Endow. 5(12), pp. 1700–1711 (2012)
Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5(8), pp. 716–727 (2012)
Han, J., Kamber, M.: Data Mining, Southeast Asia Edition: Concepts and Techniques. Morgan kaufmann (2006)
Hilbert, M., Lopez, P.: The worlds technological capacity to store, communicate, and compute information. Science 332(6025), pp. 60–65 (2011)
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. (TOCS) 26(2), 4 (2008)
Agrawal, P., Arasu, A., Kaushik, R.: On indexing error-tolerant set containment. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD’10, pp. 927–938. ACM, New York, NY, USA (2010)
Arasu, A., Gotz, M., Kaushik, R.: On active learning of record matching packages. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD’10, pp. 783–794. ACM, New York, NY, USA (2010)
Arasu, A., Re, C., Suciu, D.: Large-scale deduplication with constraints using dedupalog. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE’09, pp. 952–963. IEEE Computer Society, Washington, DC, USA (2009)
Varbanescu, A.L., Iosup, A.: On many-task big data processing: from GPUs to clouds. In: MTAGS Workshop, held in conjunction with ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–8. ACM (2013)
Loshin, D.: Big data tools and techniques. In: Loshin, D. (ed.) Big Data Analytics, pp. 61–72. Morgan Kaufmann, Boston (2013)
Hunt, P., Konar, M., Junqueira, F.P., Reed, B.: Zookeeper: Wait-free coordination for internet-scale systems. In: Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC’10, pp. 11–11. USENIX Association, Berkeley, CA, USA (2010)
Jiang, Y.: HBase Administration Cookbook. Packt Publishing, Birmingham (2012)
Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E.N., O’Malley, O., Pandey, J., Yuan, Y., Lee, R., Zhang, X.: Major technical advancements in apache hive. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD’14, pp. 1235–1246. ACM, New York, NY, USA (2014)
Shang, W., Adams, B., Hassan, A.E.: Using pig as a data preparation language for large-scale mining software repositories studies: an experience report. J. Syst. Softw. 85(10), pp. 2195–2204 (2012)
Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning Publications Co., Greenwich, CT, USA (2011)
Banerjee, S., Gelfand, A.E., Finley, A.O., Sang, H.: Gaussian predictive process models for large spatial data sets. J. R. Stat. Soc. Series B (Stat. Methodol.) 70(4), pp. 825–848 (2008)
Hensman, J., Fusi, N., Lawrence, N.D.: Gaussian processes for big data (2013). arXiv:1309.6835
Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: constant-size coresets for k-means, pca and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’13, pp. 1434–1453. SIAM (2013)
Aflalo, Y., Kimmel, R.: Spectral multidimensional scaling. Proc. Natl. Acad. Sci. 110(45), pp. 18052–18057 (2013)
Pop, F., Ciobanu, R.-I., Dobre, C.: Adaptive method to support social-based mobile networks using a pagerank approach. In: Concurrency and Computation: Practice and Experience (2013)
Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: a self-tuning system for big data analytics. In: CIDR, vol. 11, pp. 261–272 (2011)
Kantardzic, M.: Data Mining: Concepts, Models, Methods, and Algorithms, 2nd edn. Wiley-IEEE Press, Hoboken (2011)
Namey, E., Guest, G., Thairu, L., Johnson, L.: Data reduction techniques for large qualitative data sets. In: Guest, G., MacQueen, K.M. (eds.) Handbook for Team-Based Qualitative Research, pp. 137–162. AltaMira Press, USA (2007)
Aflalo, Y., Kimmel, R., Raviv, D.: Scale invariant geometry for nonrigid shapes. SIAM J. Imaging Sci. 6(3), pp. 1579–1597 (2013)
Cambria, E., Rajagopal, D., Olsher, D., Das, D.: Big social data analysis. In: R. Akerkar (ed.) Big Data Computing, pp. 401–414. Taylor & Francis, New York (2013)
Yang, C., Zhang, X., Zhong, C., Liu, C., Pei, J., Ramamohanarao, K., Chen, J.: A spatiotemporal compression based approach for efficient big data processing on cloud. J. Comput. Syst. Sci. 80(8), pp. 1563–1583 (2014)
Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: Mad skills: new analysis practices for big data. Proc. VLDB Endow. 2(2), pp. 1481–1492 (2009)
Ciolofan, S.N., Mocanu, M., Ionita, A.: Distributed cyberinfrastructure for decision support in risk related environments. In: 2013 IEEE 12th International Symposium on Parallel and Distributed Computing (ISPDC), pp. 109–115 (2013)
Acknowledgments
The research presented in this paper is supported by projects: CyberWater grant of the Romanian National Authority for Scientific Research, CNDI-UEFISCDI, project number 47/2012; CLUeFARM: Information system based on cloud services accessible through mobile devices, to increase product quality and business development farms—PN-II-PT-PCCA-2013-4-0870; DataWay: Real-time Data Processing Platform for Smart Cities: Making sense of Big Data - PN-II-RU-TE-2014-4-2731; MobiWay: Mobility Beyond Individualism: an Integrated Platform for Intelligent Transportation Systems of Tomorrow—PN-II-PT-PCCA-2013-4-0321.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Pop, F., Negru, C., Ciolofan, S.N., Mocanu, M., Cristea, V. (2016). Optimizing Intelligent Reduction Techniques for Big Data. In: Emrouznejad, A. (eds) Big Data Optimization: Recent Developments and Challenges. Studies in Big Data, vol 18. Springer, Cham. https://doi.org/10.1007/978-3-319-30265-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-30265-2_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30263-8
Online ISBN: 978-3-319-30265-2
eBook Packages: EngineeringEngineering (R0)