Scalability and Parallelization of Sequential Processing: Big Data Demands and Information Algebras

Golubtsov, Peter

doi:10.1007/978-3-030-39216-1_25

Peter Golubtsov^17,18,19

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1127))

Included in the following conference series:

The International Symposium on Computer Science, Digital Economy and Intelligent Systems

436 Accesses
2 Citations

Abstract

Procedures of sequential updating of information are important for “big data streams” processing because they avoid accumulating and storing large data sets. As a model of information accumulation, we study the Bayesian updating procedure for linear experiments. Analysis and gradual transformation of the original processing scheme in order to increase its efficiency lead to certain mathematical structures - information spaces. We show that processing can be simplified by introducing a special intermediate form of information representation. Thanks to the rich algebraic properties of the corresponding information space, it allows unifying and increasing the efficiency of the information updating. It also leads to various parallelization options for inherently sequential Bayesian procedure, which are suited for distributed data processing platforms, such as MapReduce. Besides, we will see how certain formalization of the concept of information and its algebraic properties can arise simply from adopting data processing to big data demands. Approaches and concepts developed in the paper allow to increase efficiency and uniformity of data processing and present a systematic approach to transforming sequential processing into parallel.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Athreya, K.B.: Another conjugate family for the normal distribution. Stat. Probab. Lett. 4, 61–64 (1986)
Article MathSciNet Google Scholar
Barra, J.-R.: Notions foundamentales de statistique mathematique. Dunod, Paris (1971)
Google Scholar
Borovkov, A.A.: Mathematical Statistics. Gordon and Breach (1998)
Google Scholar
Bekkerman, R., Bilenko, M., Langford, J.: Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, Cambridge (2012)
Google Scholar
Bernardo, J.M., Smith, A.F.M.: Bayesian Theory. Wiley, Hoboken (2000)
MATH Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Article Google Scholar
Dhar, M., Baruah, H.K.: Theory of fuzzy sets: an overview. Int. J. Inf. Eng. Electron. Bus. (IJIEEB) 5(3), 22–33 (2013)
Google Scholar
Diaconis, P., Ylvsaker, D.: Conjugate priors for exponential families. Ann. Stat. 7(2), 269–281 (1979)
Article MathSciNet Google Scholar
Ekanayake, J., Pallickara, S., Fox, G.: MapReduce for data intensive scientific analyses. In: Fourth IEEE International Conference on eScience, Indianapolis, IN, pp. 277–284 (2008)
Google Scholar
Fan, J., Han, F., Liu, H.: Challenges of big data analysis. Natl. Sci. Rev. 1(2), 293–314 (2013)
Article Google Scholar
Farhan, N., Habib, A., Ali, A.: A study and performance comparison of MapReduce and apache spark on twitter data on hadoop cluster. Int. J. Inf. Technol. Comput. Sci. (IJITCS) 10(7) (2018)
Article Google Scholar
Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian Data Analysis, vol. 2. Taylor & Francis (2014)
Google Scholar
Golubtsov, P.V.: Theory of fuzzy sets as a theory of uncertainty and decision-making problems in fuzzy experiments. Probl. Inform. Transm. 30(3), 232–250 (1994)
MathSciNet MATH Google Scholar
Golubtsov, P.: Algebra of information in big data processing. In: International Conference Information Systems 2017 Special Interest Group on Big Data Proceedings, no. 4, pp. 1–15 (2017)
Google Scholar
Golubtsov, P.V.: The linear estimation problem and information in big-data systems. Autom. Doc. Math. Linguist. 52(2), 73–79 (2018)
Article Google Scholar
He, M., Petoukhov, S.: Mathematics of Bioinformatics: Theory, Practice, and Applications. Chap. 10. Evolutionary Trends and Central Dogma of Informatics. Wiley (2010)
Google Scholar
Hu, Z., Bodyanskiy, Y.V., Oleksii, K., Tyshchenko, O.K., Boiko, O.O.: An ensemble of adaptive neuro-fuzzy Kohonen networks for online data stream fuzzy clustering. Int. J. Mod. Educ. Comput. Sci. (IJMECS) 8(5), 12–18 (2016)
Article Google Scholar
Kholod, M., Yada, K.: An examination of the impact of neurophysiologic and environmental variables on shopping behavior of customers in a grocery store in Japan. Front. Artif. Intell. Appl. (243), 2099–2103 (2012)
Google Scholar
Lindley, D.: Bayesian Statistics: A Review. SIAM, Philadelphia (1972)
Book Google Scholar
Lindley, D.V., Smith, A.F.M.: Bayes estimates for the linear model. J. R. Stat. Soc. Ser. B 34(1), 1–41 (1972)
MathSciNet MATH Google Scholar
Oravecz, Z., Huentelman, M., Vandekerckhove, J.: Sequential Bayesian updating for big data, Chap. 2. In: Jones, M.N. (ed.) Big Data in Cognitive Science, pp. 13–33. Taylor & Francis (2016)
Google Scholar
Palit, I., Reddy, C.K.: Scalable and parallel boosting with MapReduce. IEEE Trans. Knowl. Data Eng. 24(10), 1904–1916 (2012)
Article Google Scholar
Pyt’ev, Yu.P.: Pseudoinverse operators. Properties and applications. Math. USSR Sb. 46(1), 17–50 (1983)
Article Google Scholar
Pyt’ev, Yu.P.: Reduction problems in experimental investigations. Math. USSR Sb. 48(1), 237–272 (1984)
Article Google Scholar
Pyt’ev, Yu.P.: Methods of Mathematical Modeling of Measurement-Computer Systems. Fizmatlit, Moscow (2012). (in Russian)
Google Scholar
Robert, C.P.: On the relevance of the Bayesian approach to statistics. Rev. Econ. Anal. 2(2), 139–152 (2010)
Google Scholar
Roy, C., Rautaray, S.S., Pandey, M.: Big data optimization techniques: a survey. Int. J. Inf. Eng. Electron. Bus. (IJIEEB) 10(4) (2018)
Article Google Scholar
Ryza, S., Laserson, U., Owen, S., Wills J.: Advanced Analytics with Spark: Patterns for Learning from Data at Scale. O’Reilly (2015)
Google Scholar
Shannon, C.E., Weaver, W.: The Mathematical Theory of Communication. University of Illinois Press (1949)
Google Scholar
Spiegelhalter, D.J., Lauritzen, S.L.: Sequential updating of conditional probabilities on directed graphical structures. Networks 20, 579–605 (1990)
Article MathSciNet Google Scholar
Spiegelhalter, D.J., Dawid, A.P., Lauritzen, S.L., Cowell, R.G.: Bayesian analysis in expert systems. Stat. Sci. 8, 219–247 (1993)
Article MathSciNet Google Scholar
Tripathy, B.K., Rahul: Fuzzy clustering of sequential data. Int. J. Intell. Syst. Appl. (IJISA), 11(1), 43–54 (2019)
Article Google Scholar
Vasudevan, A.: On the a priori and a posteriori assessment of probabilities. J. Appl. Logic 11(4), 440–451 (2013)
Article MathSciNet Google Scholar
White, T.: Hadoop: The Definitive Guide. O’Reilly (2015)
Google Scholar
Zhu, J., Chen, J., Hu, W.: Big learning with Bayesian methods. Natl. Sci. Rev. 4(4), 627–651 (2017)
Article Google Scholar

Download references

Acknowledgments

The reported study was supported by RFBR, research project number 19-29-09044.

Author information

Authors and Affiliations

Lomonosov Moscow State University, Moscow, Russia
Peter Golubtsov
National Research University Higher School of Economics, Moscow, Russia
Peter Golubtsov
Russian Institute for Scientific and Technical Information (VINITI RAS), Moscow, Russia
Peter Golubtsov

Authors

Peter Golubtsov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter Golubtsov .

Editor information

Editors and Affiliations

School of Educational Information Technology, Central China Normal University, Wuhan, Hubei, China
Zhengbing Hu
Mechanical Engineering Research Institute, Russian Academy of Sciences, Moscow, Russia
Sergey Petoukhov
Halmos College of Natural Sciences and Oceanography, Nova Southeastern University, Davie, FL, USA
Matthew He

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Golubtsov, P. (2020). Scalability and Parallelization of Sequential Processing: Big Data Demands and Information Algebras. In: Hu, Z., Petoukhov, S., He, M. (eds) Advances in Intelligent Systems, Computer Science and Digital Economics. CSDEIS 2019. Advances in Intelligent Systems and Computing, vol 1127. Springer, Cham. https://doi.org/10.1007/978-3-030-39216-1_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-39216-1_25
Published: 24 January 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39215-4
Online ISBN: 978-3-030-39216-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics