Skip to main content

Scalability and Parallelization of Sequential Processing: Big Data Demands and Information Algebras

  • Conference paper
  • First Online:
Advances in Intelligent Systems, Computer Science and Digital Economics (CSDEIS 2019)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1127))

Abstract

Procedures of sequential updating of information are important for “big data streams” processing because they avoid accumulating and storing large data sets. As a model of information accumulation, we study the Bayesian updating procedure for linear experiments. Analysis and gradual transformation of the original processing scheme in order to increase its efficiency lead to certain mathematical structures - information spaces. We show that processing can be simplified by introducing a special intermediate form of information representation. Thanks to the rich algebraic properties of the corresponding information space, it allows unifying and increasing the efficiency of the information updating. It also leads to various parallelization options for inherently sequential Bayesian procedure, which are suited for distributed data processing platforms, such as MapReduce. Besides, we will see how certain formalization of the concept of information and its algebraic properties can arise simply from adopting data processing to big data demands. Approaches and concepts developed in the paper allow to increase efficiency and uniformity of data processing and present a systematic approach to transforming sequential processing into parallel.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Athreya, K.B.: Another conjugate family for the normal distribution. Stat. Probab. Lett. 4, 61–64 (1986)

    Article  MathSciNet  Google Scholar 

  • Barra, J.-R.: Notions foundamentales de statistique mathematique. Dunod, Paris (1971)

    Google Scholar 

  • Borovkov, A.A.: Mathematical Statistics. Gordon and Breach (1998)

    Google Scholar 

  • Bekkerman, R., Bilenko, M., Langford, J.: Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, Cambridge (2012)

    Google Scholar 

  • Bernardo, J.M., Smith, A.F.M.: Bayesian Theory. Wiley, Hoboken (2000)

    MATH  Google Scholar 

  • Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  • Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)

    Article  Google Scholar 

  • Dhar, M., Baruah, H.K.: Theory of fuzzy sets: an overview. Int. J. Inf. Eng. Electron. Bus. (IJIEEB) 5(3), 22–33 (2013)

    Google Scholar 

  • Diaconis, P., Ylvsaker, D.: Conjugate priors for exponential families. Ann. Stat. 7(2), 269–281 (1979)

    Article  MathSciNet  Google Scholar 

  • Ekanayake, J., Pallickara, S., Fox, G.: MapReduce for data intensive scientific analyses. In: Fourth IEEE International Conference on eScience, Indianapolis, IN, pp. 277–284 (2008)

    Google Scholar 

  • Fan, J., Han, F., Liu, H.: Challenges of big data analysis. Natl. Sci. Rev. 1(2), 293–314 (2013)

    Article  Google Scholar 

  • Farhan, N., Habib, A., Ali, A.: A study and performance comparison of MapReduce and apache spark on twitter data on hadoop cluster. Int. J. Inf. Technol. Comput. Sci. (IJITCS) 10(7) (2018)

    Article  Google Scholar 

  • Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian Data Analysis, vol. 2. Taylor & Francis (2014)

    Google Scholar 

  • Golubtsov, P.V.: Theory of fuzzy sets as a theory of uncertainty and decision-making problems in fuzzy experiments. Probl. Inform. Transm. 30(3), 232–250 (1994)

    MathSciNet  MATH  Google Scholar 

  • Golubtsov, P.: Algebra of information in big data processing. In: International Conference Information Systems 2017 Special Interest Group on Big Data Proceedings, no. 4, pp. 1–15 (2017)

    Google Scholar 

  • Golubtsov, P.V.: The linear estimation problem and information in big-data systems. Autom. Doc. Math. Linguist. 52(2), 73–79 (2018)

    Article  Google Scholar 

  • He, M., Petoukhov, S.: Mathematics of Bioinformatics: Theory, Practice, and Applications. Chap. 10. Evolutionary Trends and Central Dogma of Informatics. Wiley (2010)

    Google Scholar 

  • Hu, Z., Bodyanskiy, Y.V., Oleksii, K., Tyshchenko, O.K., Boiko, O.O.: An ensemble of adaptive neuro-fuzzy Kohonen networks for online data stream fuzzy clustering. Int. J. Mod. Educ. Comput. Sci. (IJMECS) 8(5), 12–18 (2016)

    Article  Google Scholar 

  • Kholod, M., Yada, K.: An examination of the impact of neurophysiologic and environmental variables on shopping behavior of customers in a grocery store in Japan. Front. Artif. Intell. Appl. (243), 2099–2103 (2012)

    Google Scholar 

  • Lindley, D.: Bayesian Statistics: A Review. SIAM, Philadelphia (1972)

    Book  Google Scholar 

  • Lindley, D.V., Smith, A.F.M.: Bayes estimates for the linear model. J. R. Stat. Soc. Ser. B 34(1), 1–41 (1972)

    MathSciNet  MATH  Google Scholar 

  • Oravecz, Z., Huentelman, M., Vandekerckhove, J.: Sequential Bayesian updating for big data, Chap. 2. In: Jones, M.N. (ed.) Big Data in Cognitive Science, pp. 13–33. Taylor & Francis (2016)

    Google Scholar 

  • Palit, I., Reddy, C.K.: Scalable and parallel boosting with MapReduce. IEEE Trans. Knowl. Data Eng. 24(10), 1904–1916 (2012)

    Article  Google Scholar 

  • Pyt’ev, Yu.P.: Pseudoinverse operators. Properties and applications. Math. USSR Sb. 46(1), 17–50 (1983)

    Article  Google Scholar 

  • Pyt’ev, Yu.P.: Reduction problems in experimental investigations. Math. USSR Sb. 48(1), 237–272 (1984)

    Article  Google Scholar 

  • Pyt’ev, Yu.P.: Methods of Mathematical Modeling of Measurement-Computer Systems. Fizmatlit, Moscow (2012). (in Russian)

    Google Scholar 

  • Robert, C.P.: On the relevance of the Bayesian approach to statistics. Rev. Econ. Anal. 2(2), 139–152 (2010)

    Google Scholar 

  • Roy, C., Rautaray, S.S., Pandey, M.: Big data optimization techniques: a survey. Int. J. Inf. Eng. Electron. Bus. (IJIEEB) 10(4) (2018)

    Article  Google Scholar 

  • Ryza, S., Laserson, U., Owen, S., Wills J.: Advanced Analytics with Spark: Patterns for Learning from Data at Scale. O’Reilly (2015)

    Google Scholar 

  • Shannon, C.E., Weaver, W.: The Mathematical Theory of Communication. University of Illinois Press (1949)

    Google Scholar 

  • Spiegelhalter, D.J., Lauritzen, S.L.: Sequential updating of conditional probabilities on directed graphical structures. Networks 20, 579–605 (1990)

    Article  MathSciNet  Google Scholar 

  • Spiegelhalter, D.J., Dawid, A.P., Lauritzen, S.L., Cowell, R.G.: Bayesian analysis in expert systems. Stat. Sci. 8, 219–247 (1993)

    Article  MathSciNet  Google Scholar 

  • Tripathy, B.K., Rahul: Fuzzy clustering of sequential data. Int. J. Intell. Syst. Appl. (IJISA), 11(1), 43–54 (2019)

    Article  Google Scholar 

  • Vasudevan, A.: On the a priori and a posteriori assessment of probabilities. J. Appl. Logic 11(4), 440–451 (2013)

    Article  MathSciNet  Google Scholar 

  • White, T.: Hadoop: The Definitive Guide. O’Reilly (2015)

    Google Scholar 

  • Zhu, J., Chen, J., Hu, W.: Big learning with Bayesian methods. Natl. Sci. Rev. 4(4), 627–651 (2017)

    Article  Google Scholar 

Download references

Acknowledgments

The reported study was supported by RFBR, research project number 19-29-09044.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter Golubtsov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Golubtsov, P. (2020). Scalability and Parallelization of Sequential Processing: Big Data Demands and Information Algebras. In: Hu, Z., Petoukhov, S., He, M. (eds) Advances in Intelligent Systems, Computer Science and Digital Economics. CSDEIS 2019. Advances in Intelligent Systems and Computing, vol 1127. Springer, Cham. https://doi.org/10.1007/978-3-030-39216-1_25

Download citation

Publish with us

Policies and ethics