Big Data Thinning: Knowledge Discovery from Relevant Data

  • Naji ShehabEmail author
  • Christos Anagnostopoulos
Part of the Internet of Things book series (ITTCC)


Using statistical learning theory and machine learning techniques surrounding the principles of Rival Penalised Competitive Learning (RPCL), this chapter proposes a novel approach aiming to aid Big Data Thinning, i.e., analysing only the potential data sub-spaces and not the entire extensive data space. Data scientists, data analysts, IoT applications and Edge-centric services are in need for predictive modelling and analytics. This is achieved by learning from past issued analytics queries and exploiting the analytics query access patterns over the large distributed data-sets revealing the most interested and important sub-spaces for further exploratory analysis. By analysing user queries and respectively mapping them into relatively small-scale predictive local regression models, we can yield higher predictive accuracy. This is done by thinning the data space and freeing it of irrelevant and non-popular data sub-spaces; thus, making use of less training data instances. Experimental results and statistical analysis support the research idea proposed in this work.



This research is funded by the EU-H2020 GNFUV Project (#Grant 645220) and the EU-H2020 MSCA INNOVATE Project (#Grant 745829).


  1. 1.
    Ahalt, S.C., Krishnamurthy, A.K., Chen, P., Melton, D.E.: Competitive learning algorithms for vector quantization. Neural Netw. 3(3), 277–290 (1990). ISSN 0893-6080.
  2. 2.
    Anagnostopoulos, C., Kolomvatsos, K.: Predictive intelligence to the edge through approximate collaborative context reasoning. Appl. Intell. 48(4), 966–991 (2018)Google Scholar
  3. 3.
    Anagnostopoulos, C., Triantafillou, P.: Efficient scalable accurate regression queries in In-DBMS analytics. In: IEEE International Conference on Data Engineering (ICDE), San Diego, CA, USA, 19–22 (2017)Google Scholar
  4. 4.
    Anagnostopoulos, C., Triantafillou, P.: Large-scale predictive modeling and analytics through regression queries in data management systems. International Journal of Data Science and Analytics (2018)Google Scholar
  5. 5.
    Anagnostopoulos, C., Triantafillou, P.: Query-driven learning for predictive analytics of data subspace cardinality. ACM Trans Knowl Discov. Data 11(4), 47 (2017)CrossRefGoogle Scholar
  6. 6.
    Anagnostopoulos, C., Savva, F., Triantafillou, P.: Scalable aggregation predictive analytics: a query-driven machine learning approach. Appl. Intell. 48(9), 2546–2567 (2018)CrossRefGoogle Scholar
  7. 7.
    Arthur, D., Vassilvitskii, S.: K-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07. Society for Industrial and Applied Mathematics, pp. 1027–1035. Philadelphia, PA, USA (2007). ISBN 978-0-898716-24-5.
  8. 8.
    Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. Proc. VLDB Endow. 5(7), 622–633 (2012). ISSN 2150-8097.
  9. 9.
    Bohn, R., Short, J.E.: How much information? 2009 report on American consumers, vol. 01 (2009). Information_2009_Report_on_American_Consumers
  10. 10.
    Bohn, R., Short, J.E.: How much information? 2010 report on enterprise server information, p. 7 (2010). 2010_enterprisereport_jan_2011.pdf
  11. 11.
    Botoca, C., Budura, G., Miclau, N.: Competitive learning algorithms for data clustering. Facta Univ. Ser. Electron. Energetics 19, 01 (2005). Scholar
  12. 12.
    Constandinos, X.M., George, M., Jordi, M.B.: Internet of Things (IoT) in 5G Mobile Technologies. Springer International Publishing AG (2016). ISSN 2196-7326.
  13. 13.
    Constandinos X.M. et al.: Socially-oriented edge computing for energy-awareness in IoT architectures. IEEE Commun. (2019)Google Scholar
  14. 14.
    Contandriopoulos, D., Brousselle, A.: Evaluation models and evaluation use. Evaluation 18(1), 61–77 (2012). Scholar
  15. 15.
    Desieno, D.: Adding a conscience to competitive learning. In: IEEE 1988 International Conference on Neural Networks, vol. 1, pp. 117–124 (1988).
  16. 16.
    Georgios, S. et al.: Elasticity debt analytics exploitation for green mobile cloud computing: an equilibrium model. IEEE Trans. Green Commun. Netw. (2019)Google Scholar
  17. 17.
    Grossberg, S.: Adaptive pattern classification and universal recoding: 1. Parallel development and coding of neural feature detectors. Biol. Cybern. 23, 121–134 (1976)Google Scholar
  18. 18.
    Hilbert, M., López, P.: The world’s technological capacity to store, communicate, and compute information. Science 332(6025), 60–65 (2011). ISSN 0036-8075.
  19. 19.
    Jun, L. et al.: D2D communication mode selection and resource optimization algorithm with optimal throughput in 5G network. IEEE Access, pp. 25263–25273 (2019)Google Scholar
  20. 20.
    Kolomvatsos, K., Anagnostopoulos, C.: Reinforcement machine learning for predictive analytics in smart cities. Informatics 4(3), 16 (2017)CrossRefGoogle Scholar
  21. 21.
    Lloyd, S.P.: Least squares quantization in PCM. Information Theory, IEEE Trans. 28(2), 129–137 (1982)Google Scholar
  22. 22.
    Makhoul, L., Rpucos, S., Gish, H.: Vector quantization in speech coding. IEEE Trans. Neural Netw. 73(11), 1551–1558 (1985).
  23. 23.
    Narendra, K.S., Thathachar, M.A.L.: Learning Automata: An Introduction. Prentice-Hall Inc, Upper Saddle River, NJ, USA (1989). ISBN 0-13-485558-2Google Scholar
  24. 24.
    Nasrabadi, N.M., King, R.A.: Image coding using vector quantization: a review. IEEE Trans. Commun. 36, 957–971 (1988). ISSN 0090-6778.
  25. 25.
    Rumelhart, D., McClelland, J.: University of California. Parallel Distributed Processing: Foundations. A Bradford book. MIT Press (1986). ISBN 9780262680530Google Scholar
  26. 26.
    Stelios, P., Evangelos, S., George, M., Constandinos, X.M.: A hyper-box approach using relational databases for large scale machine learning. International conference on telecommunications and multimedia TEMU 2014. IEEE Communications Society proceedings, pp. 69–73, 28–30 July, Crete, GreeceGoogle Scholar
  27. 27.
    Xu, L., Krzyzak, A., Oja, E.: Rival penalized competitive learning for clustering analysis, RBF net, and curve detection. IEEE Trans. Neural Netw. 4(4), 636–649 (1993). ISSN 1045-9227.
  28. 28.
    Yannis, N. et al.: Vulnerability assessment as a Service for Fog-Centric Healthcare ICT ecosystems. J. Peer-to-Peer Netw. Appl. Springer (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.School of Computing Science, University of GlasgowGlasgowUK

Personalised recommendations