Abstract
Procedures of sequential updating of information are important for “big data streams” processing because they avoid accumulating and storing large data sets. As a model of information accumulation, we study the Bayesian updating procedure for linear experiments. Analysis and gradual transformation of the original processing scheme in order to increase its efficiency lead to certain mathematical structures - information spaces. We show that processing can be simplified by introducing a special intermediate form of information representation. Thanks to the rich algebraic properties of the corresponding information space, it allows unifying and increasing the efficiency of the information updating. It also leads to various parallelization options for inherently sequential Bayesian procedure, which are suited for distributed data processing platforms, such as MapReduce. Besides, we will see how certain formalization of the concept of information and its algebraic properties can arise simply from adopting data processing to big data demands. Approaches and concepts developed in the paper allow to increase efficiency and uniformity of data processing and present a systematic approach to transforming sequential processing into parallel.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Athreya, K.B.: Another conjugate family for the normal distribution. Stat. Probab. Lett. 4, 61–64 (1986)
Barra, J.-R.: Notions foundamentales de statistique mathematique. Dunod, Paris (1971)
Borovkov, A.A.: Mathematical Statistics. Gordon and Breach (1998)
Bekkerman, R., Bilenko, M., Langford, J.: Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, Cambridge (2012)
Bernardo, J.M., Smith, A.F.M.: Bayesian Theory. Wiley, Hoboken (2000)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Dhar, M., Baruah, H.K.: Theory of fuzzy sets: an overview. Int. J. Inf. Eng. Electron. Bus. (IJIEEB) 5(3), 22–33 (2013)
Diaconis, P., Ylvsaker, D.: Conjugate priors for exponential families. Ann. Stat. 7(2), 269–281 (1979)
Ekanayake, J., Pallickara, S., Fox, G.: MapReduce for data intensive scientific analyses. In: Fourth IEEE International Conference on eScience, Indianapolis, IN, pp. 277–284 (2008)
Fan, J., Han, F., Liu, H.: Challenges of big data analysis. Natl. Sci. Rev. 1(2), 293–314 (2013)
Farhan, N., Habib, A., Ali, A.: A study and performance comparison of MapReduce and apache spark on twitter data on hadoop cluster. Int. J. Inf. Technol. Comput. Sci. (IJITCS) 10(7) (2018)
Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian Data Analysis, vol. 2. Taylor & Francis (2014)
Golubtsov, P.V.: Theory of fuzzy sets as a theory of uncertainty and decision-making problems in fuzzy experiments. Probl. Inform. Transm. 30(3), 232–250 (1994)
Golubtsov, P.: Algebra of information in big data processing. In: International Conference Information Systems 2017 Special Interest Group on Big Data Proceedings, no. 4, pp. 1–15 (2017)
Golubtsov, P.V.: The linear estimation problem and information in big-data systems. Autom. Doc. Math. Linguist. 52(2), 73–79 (2018)
He, M., Petoukhov, S.: Mathematics of Bioinformatics: Theory, Practice, and Applications. Chap. 10. Evolutionary Trends and Central Dogma of Informatics. Wiley (2010)
Hu, Z., Bodyanskiy, Y.V., Oleksii, K., Tyshchenko, O.K., Boiko, O.O.: An ensemble of adaptive neuro-fuzzy Kohonen networks for online data stream fuzzy clustering. Int. J. Mod. Educ. Comput. Sci. (IJMECS) 8(5), 12–18 (2016)
Kholod, M., Yada, K.: An examination of the impact of neurophysiologic and environmental variables on shopping behavior of customers in a grocery store in Japan. Front. Artif. Intell. Appl. (243), 2099–2103 (2012)
Lindley, D.: Bayesian Statistics: A Review. SIAM, Philadelphia (1972)
Lindley, D.V., Smith, A.F.M.: Bayes estimates for the linear model. J. R. Stat. Soc. Ser. B 34(1), 1–41 (1972)
Oravecz, Z., Huentelman, M., Vandekerckhove, J.: Sequential Bayesian updating for big data, Chap. 2. In: Jones, M.N. (ed.) Big Data in Cognitive Science, pp. 13–33. Taylor & Francis (2016)
Palit, I., Reddy, C.K.: Scalable and parallel boosting with MapReduce. IEEE Trans. Knowl. Data Eng. 24(10), 1904–1916 (2012)
Pyt’ev, Yu.P.: Pseudoinverse operators. Properties and applications. Math. USSR Sb. 46(1), 17–50 (1983)
Pyt’ev, Yu.P.: Reduction problems in experimental investigations. Math. USSR Sb. 48(1), 237–272 (1984)
Pyt’ev, Yu.P.: Methods of Mathematical Modeling of Measurement-Computer Systems. Fizmatlit, Moscow (2012). (in Russian)
Robert, C.P.: On the relevance of the Bayesian approach to statistics. Rev. Econ. Anal. 2(2), 139–152 (2010)
Roy, C., Rautaray, S.S., Pandey, M.: Big data optimization techniques: a survey. Int. J. Inf. Eng. Electron. Bus. (IJIEEB) 10(4) (2018)
Ryza, S., Laserson, U., Owen, S., Wills J.: Advanced Analytics with Spark: Patterns for Learning from Data at Scale. O’Reilly (2015)
Shannon, C.E., Weaver, W.: The Mathematical Theory of Communication. University of Illinois Press (1949)
Spiegelhalter, D.J., Lauritzen, S.L.: Sequential updating of conditional probabilities on directed graphical structures. Networks 20, 579–605 (1990)
Spiegelhalter, D.J., Dawid, A.P., Lauritzen, S.L., Cowell, R.G.: Bayesian analysis in expert systems. Stat. Sci. 8, 219–247 (1993)
Tripathy, B.K., Rahul: Fuzzy clustering of sequential data. Int. J. Intell. Syst. Appl. (IJISA), 11(1), 43–54 (2019)
Vasudevan, A.: On the a priori and a posteriori assessment of probabilities. J. Appl. Logic 11(4), 440–451 (2013)
White, T.: Hadoop: The Definitive Guide. O’Reilly (2015)
Zhu, J., Chen, J., Hu, W.: Big learning with Bayesian methods. Natl. Sci. Rev. 4(4), 627–651 (2017)
Acknowledgments
The reported study was supported by RFBR, research project number 19-29-09044.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Golubtsov, P. (2020). Scalability and Parallelization of Sequential Processing: Big Data Demands and Information Algebras. In: Hu, Z., Petoukhov, S., He, M. (eds) Advances in Intelligent Systems, Computer Science and Digital Economics. CSDEIS 2019. Advances in Intelligent Systems and Computing, vol 1127. Springer, Cham. https://doi.org/10.1007/978-3-030-39216-1_25
Download citation
DOI: https://doi.org/10.1007/978-3-030-39216-1_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39215-4
Online ISBN: 978-3-030-39216-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)