Big data aggregation in the case of heterogeneity: a feasibility study for digital health

  • Alex Adim Obinikpo
  • Burak KantarciEmail author
Original Article


In big data applications, an important factor that may affect the value of the acquired data is the missing data, which arises when data is lost either during acquisition or during storage. The former can be a result of faulty acquisition devices or non responsive sensors whereas the latter can occur as a result of hardware failures at the storage units. In this paper, we consider human activity recognition as a case study of a typical machine learning application on big datasets. We conduct a comprehensive feasibility study on the fusion of sensory data that is acquired from heterogeneous sources. We present insights on the aggregation of heterogeneous datasets with minimal missing data values for future use. Our experiments on the accuracy, F-1 score, and PPV of various key machine learning algorithms show that sensory data acquired by wearables are less vulnerable to missing data and smaller training sets whereas smart portable devices require larger training sets to reduce the impacts of possibly missing data.


Dedicated sensors Non-dedicated sensors Aggregation 



This work was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC) under RGPIN/2017-04032.


  1. 1.
    Suciu G, Suciu V, Halunga S, Fratu O (2015) Big data, internet of things and cloud convergence for e-Health applications. Adv Intell Syst Comput. Google Scholar
  2. 2.
    Paul A, Rho S (2016) Probabilistic model for M2M in IoT networking and communication. Telecommun Syst 62(1):59–66CrossRefGoogle Scholar
  3. 3.
    Liu W, Park EK (2014) Big data as an e-Health service. In: 2014 international conference on computing, networking and communications ICNC 2014.
  4. 4.
    Wu J, Guo S, Huang H, Liu W, Xiang Y (2018) Information and communications technologies for sustainable development goals: state-of-the-art, needs and perspectives. IEEE Commun Surv Tutor 20:2389–2406CrossRefGoogle Scholar
  5. 5.
    Diaz M, Juan G, Lucas O, Ryuga A (2012) Big data on the internet of things: an example for the e-Health. In: Proceedings—6th international conference on innovative mobile and internet services in ubiquitous computing, IMIS 2012.
  6. 6.
    Naversnik K, Mrhar A (2013) Cost-effectiveness of a Novele-Health depression service. Telemed e-Health. Google Scholar
  7. 7.
    Thuemmler C, Bai C (eds) (2017) Health 4.0: how virtualization and big data are revolutionizing healthcare. Springer, New York, NYGoogle Scholar
  8. 8.
    Shin D, Sahama T, Gajanayake R (2013) Secured e-health data retrieval in DaaS and Big Data. In: 2013 IEEE 15th international conference on e-Health networking, applications and services, Healthcom 2013.
  9. 9.
    Roy S, Conti M, Setia S, Jajodia S (2014) Secure data aggregation in wireless sensor networks: filtering out the attacker’s impact. IEEE Trans Inf Forensics Secur. Google Scholar
  10. 10.
    Daniel A, Subburathinam K, Paul A, Rajkumar N, Rho S (2017) Big autonomous vehicular data classifications: towards procuring intelligence in ITS. Vehic Commun 9:306–312CrossRefGoogle Scholar
  11. 11.
    Quoc Viet Hung N, Tam NT, Tran LN, Aberer K (2013) An evaluation of aggregation techniques in crowd sourcing. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics).
  12. 12.
    Paul A (2014) Real-time power management for embedded M2M using intelligent learning methods. ACM Trans Embed Comput Syst 13(5 s):148Google Scholar
  13. 13.
    Chen MY, Chen BT (2014) Online fuzzy time series analysis based on entropy discretization and a Fast Fourier Transform. Appl Soft Comput. Google Scholar
  14. 14.
    Perkins NJ, Cole SR, Harel O, Tchetgen Tchetgen EJ, Sun B, Mitchell EM, Schisterman EF (2017) Principled approaches to missing data in epidemiologic studies. Am J Epidemiol. Google Scholar
  15. 15.
    Beaulieu-Jones BK, Moore JH, CONSORTIUM T.P.R.O.A.A.C.T. (2017) Missing data imputation in the electronic health record using deeply learned autoencoders. Pacific symposium on biocomputing.$4ng0021
  16. 16.
    Lara OD, Labrador MA (2013) A survey on human activity recognition using wearable sensors. IEEE Commun Surv Tutor 15(3):1192–1209CrossRefGoogle Scholar
  17. 17.
    Su X, Tong H, Ji P (2014) Activity recognition with smartphone sensors. Tsinghua Sci Technol. Google Scholar
  18. 18.
    Davila J, Cretu AM, Zaremba M (2017) Wearable sensor data classification for human activity recognition based on an iterative learning framework. Sensors. Google Scholar
  19. 19.
    Hassanalieragh M, Page A, Soyata T, Sharma G, Aktas M, Mateos G, Kantarci B, Andreescu S (2015) Health monitoring and management using Internet-of-Things (IoT) sensing with cloud-based processing: opportunities and challenges. In: Proceedings—2015 IEEE international conference on services computing, SCC 2015.
  20. 20.
    Lupton D (2013) The commodification of patient opinion: the digital patient experience economy in the age of big data. Sociol Health Illness. Google Scholar
  21. 21.
    Springman MK, Bermeo Y, Limper HM, Tothy AS (2016) Developing an analytic approach to understanding the patient care experience. J Patient Exp. Google Scholar
  22. 22.
    Delen D, Fuller C (2013) An analytic approach to understanding and predicting healthcare coverage. Stud Health Technol Inf. Google Scholar
  23. 23.
    Brownstein JS, Freifeld CC, Madoff LC (2009) Digital disease detection harnessing the web for public health surveillance. N Engl J Med. Google Scholar
  24. 24.
    Barrett Ma, Humblet O, Hiatt RA, Adler NE (2013) Big data and disease prevention: from quantified self to quantified communities. Big Data. Google Scholar
  25. 25.
    Zhang M, Sawchuk AA (2013) Human daily activity recognition with sparse representation using wearable sensors. IEEE J Biomed Health Inform. Google Scholar
  26. 26.
    Din S, Paul A (2019) Smart health monitoring and management system: toward autonomous wearable sensing for internet of things using big data analytics. Future Gener Comput Syst 91:611–619CrossRefGoogle Scholar
  27. 27.
    Paul A, Ahmad A, Rathore MM, Jabbar S (2016) Smartbuddy: defining human behaviors using big data analytics in social internet of things. IEEE Wirel Commun 23(5):68–74CrossRefGoogle Scholar
  28. 28.
    Chernbumroong S, Cang S, Atkins A, Yu H (2013) Elderly activities recognition and classification for applications in assisted living. Expert Syst Appl 40(5):1662–1674CrossRefGoogle Scholar
  29. 29.
    Gjoreski H, Kozina S, Gams M, Lustrek M (2014) RAReFall—real-time activity recognition and fall detection system. In: Pervasive computing and communications workshops (PERCOM workshops), 2014 IEEE international conference on. IEEE, pp 145–147Google Scholar
  30. 30.
    Zhou B, Sundholm M, Cheng J, Cruz H, Lukowicz P (2017) Measuring muscle activities during gym exercises with textile pressure mapping sensors. Pervasive Mob Comput 38:331–345CrossRefGoogle Scholar
  31. 31.
    O’Donovan T, O’Donoghue J, Sreenan C, Sammon D, O’Reilly P, O’Connor K (2009) A context aware wireless body area network (BAN). Pervasive computing technologies for healthcare (2009) PervasiveHealth 2009. 3rd international conference onGoogle Scholar
  32. 32.
    Rutherford JJ (2010) Wearable technology. IEEE Eng Med Biol Mag. Google Scholar
  33. 33.
    Piwek L, Ellis DA, Andrews S, Joinson A (2016) The rise of consumer health wearables: promises and barriers. PLoS Med. Google Scholar
  34. 34.
    Cahyani NDW, Martini B, Choo KKR, Al-Azhar AMN (2017) Forensic data acquisition from cloud-of-things devices: windows smartphones as a case study. Concurr Comput. Google Scholar
  35. 35.
    Rehman M, Liew C, Wah T, Shuja J, Daghighi B (2015) Mining personal data using smartphones and wearable devices: a survey. Sensors. Google Scholar
  36. 36.
    Feng M, Fukuda Y, Mizuta M, Ozer E (2015) Citizen sensors for SHM: use of accelerometer data from smartphones. Sensors (Switzerland). Google Scholar
  37. 37.
    Habibzadeh H, Qin Z, Soyata T, Kantarci B (2017) Largescale distributed dedicated- and non-dedicated smart city sensing systems. IEEE Sens J 17(23):7649–7658. CrossRefGoogle Scholar
  38. 38.
    Pouryazdan M, Kantarci B, Soyata T, Foschini L, Song H (2017) Quantifying user reputation scores, data trustworthiness, and user incentives in mobile crowdsensing. IEEE Access 5:1382–1397. CrossRefGoogle Scholar
  39. 39.
    Yang D, Xue G, Fang X, Tang J (2016) Incentive mechanisms for crowdsensing: crowdsourcing with smartphones. IEEE ACM Trans Netw 24(3):1732–1744. CrossRefGoogle Scholar
  40. 40.
    Predic B, Zhixian Y, Eberle J, Stojanovic D, Aberer K (2013) ExposureSense: integrating daily activities with air quality using mobile participatory sensing. In: 2013 IEEE international conference on pervasive computing and workshops C (PERCOM Workshops).
  41. 41.
    Obinikpo AA, Zhang Y, Song H, Luan TH, Kantarcih B (2017) Queuing algorithm for effective target coverage in mobile crowd sensing. IEEE Internet Things J. Google Scholar
  42. 42.
    Kantarci B, Mouftah HT (2014) Trustworthy sensing for public safety in cloud-centric internet of things. IEEE Internet Things J 1(4):360–368CrossRefGoogle Scholar
  43. 43.
    Hao T, Xing G, Zhou G (2013) iSleep: unobtrusive sleep quality monitoring using smartphones. In: Proceedings of the 11th ACM conference on embedded networked sensor systems.
  44. 44.
    Linkov I, Massey O, Keisler J, Rusyn I, Hartung T (2015) From “weight of evidence” to quantitative data integration using multicriteria decision analysis and Bayesian methods. Altex. Google Scholar
  45. 45.
    Chen Y, Cook WD, Du J, Hu H, Zhu J (2015) Bounded and discrete data and Likert scales in data envelopment analysis: application to regional energy efficiency in China. Ann Oper Res. Google Scholar
  46. 46.
    Pargett M, Umulis DM (2013) Quantitative model analysis with diverse biological data: applications in developmental pattern formation. Methods. Google Scholar
  47. 47.
    Vosloo J, Taylor-Powell E, Renner M, Research-part B, Reid S, Punch KF, O‘connor H, Gibson N, Miles MB, Huberman Ma, Saldana J, Mellish L, Morris S, Do M, Mcnair R, Taft A, Hegarty K, Lacey A, Luff D, Hunn A, Fox N, Hunn A, Free R, For D, Data Q, Miles A, Framework U, Framework U, Flick U, Data ACI (2014) Qualitative data analysis qualitative data. The SAGE handbook of qualitative data analysis.
  48. 48.
    Raghupathi W, Raghupathi V (2014) Big data analytics in healthcare: promise and potential. Health Inf Sci Syst. zbMATHGoogle Scholar
  49. 49.
    Nakamura J (2005) Image sensors and signal processing for digital still cameras.
  50. 50.
    Bouveyron C, Brunet-Saumard C (2014) Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal. MathSciNetzbMATHGoogle Scholar
  51. 51.
    Tomasev N, Radovanovic M, Mladenic D, Ivanovic M (2014) The role of hubness in clustering high-dimensional data. IEEE Trans Knowl Data Eng. Google Scholar
  52. 52.
    Graham JW (2012) Analysis of missing data. Miss Data. CrossRefzbMATHGoogle Scholar
  53. 53.
    Zhou P, Fan LW, Zhou DQ (2010) Data aggregation in constructing composite indicators: a perspective of information loss. Expert Syst Appl. Google Scholar
  54. 54.
    Ladra S, Torra V (2010) Information loss for synthetic data through fuzzy clustering. Int J Uncertain Fuzziness Knowl Based Syst. Google Scholar
  55. 55.
    Hsieh TS, Noyes D, Liu H, Fiondella L (2015) Quantifying the impact of data loss incidents on publicly-traded organizations. In: 2015 IEEE international symposium on technologies for homeland security, HST 2015.
  56. 56.
    Obinikpo AA, Kantarci B (2017) Big sensed data meets deep learning for smarter health care in smart cities. J Sens Actuator Netw. Google Scholar
  57. 57.
    Neubeck L, Lowres N, Benjamin EJ, Freedman SB, Coorey G, Redfern J (2015) The mobile revolution using smartphone apps to prevent cardiovascular disease.
  58. 58.
    Velasco E, Agheneza T, Denecke K, Kirchner G, Eckmanns T (2014) Social media and internet-based data in global systems for public health surveillance: a systematic review.
  59. 59.
    Shwe HY, Jet TK, Chong PHJ (2016) An IoT-oriented data storage framework in smart city applications. In: 2016 international conference on information and communication technology convergence (ICTC), pp 106–108Google Scholar
  60. 60.
    Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, CaliforniaGoogle Scholar
  61. 61.
    Wu J, Guo S, Li J, Zeng D (2016) Big data meet green challenges: big data toward green applications. IEEE Syst J 10(3):888–900CrossRefGoogle Scholar
  62. 62.
    Chen CP, Zhang CyY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf Sci. Google Scholar
  63. 63.
    Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd international conference on machine learning ICML 06.
  64. 64.
    Mesnil G, Dauphin Y, Glorot X, Rifai S, Bengio Y, Goodfellow I, Lavoie E, Muller X, Desjardins G, Warde-Farley D, Vincent P (2011) Unsupervised and transfer learning challenge: a deep learning approach. In: Proceedings of the 2011 international conference on unsupervised and transfer learning workshop, Vol 27, pp 97–111, JMLR. orgGoogle Scholar
  65. 65.
    Hahne F, Huber W, Gentleman R, Falcon S (2008) Unsupervised machine learning. Bioconduct Case Stud.$4 CrossRefGoogle Scholar
  66. 66.
    Libbrecht MW, Noble WS (2015) Machine learning applications in genetics and genomics.
  67. 67.
    Grys BT, Lo DS, Sahin N, Kraus OZ, Morris Q, Boone C, Andrews BJ (2017) Machine learning and computer vision approaches for phenotypic profiling. J Cell Biol. Google Scholar
  68. 68.
    Hijazi S, Page A, Kantarci B, Soyata T (2016) Machine learning in cardiac health monitoring and decision support. IEEE Comput Mag 49(11):38–48. CrossRefGoogle Scholar
  69. 69.
    Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn. zbMATHGoogle Scholar
  70. 70.
    Squares L, Vector S (2010) 4 variants of support vector machines. Advances. Google Scholar
  71. 71.
    Wang Z, Xue X (2014) Multi-class support vector machine. Support Vector Mach Appl.$4ng2 Google Scholar
  72. 72.
    Hamedani K, Liu L, Atat R, Wu J, Yi Y (2018) Reservoir computing meets smart grids: attack detection using delayed feedback networks. IEEE Trans Ind Inf 14(2):734–743CrossRefGoogle Scholar
  73. 73.
    Murty MN, Raghava R (2016) Linear support vector machines. In: Support vector machines and perceptrons. Springer, Cham. CrossRefGoogle Scholar
  74. 74.
    Paul S, Boutsidis C, Magdon-Ismail M, Drineas P (2013) Random projections for support vector machines. In: Proceedings of the sixteenth international conference on artificial intelligence and statistics.
  75. 75.
    Raghavendra S, Deka PC (2014) Support vector machine applications in the field of hydrology: a review.
  76. 76.
    Fischetti M (2016) Fast training of support vector machines with Gaussian kernel. Discret Optim. MathSciNetzbMATHGoogle Scholar
  77. 77.
    Shinde A, Sahu A, Apley D, Runger G (2014) Preimages for variation patterns from kernel PCA and bagging. IIE Trans. Google Scholar
  78. 78.
    Breiman L (1996) Bagging predictors. Mach Learn. zbMATHGoogle Scholar
  79. 79.
    Kozak K, Kozak M, Stapor K (2006) Weighted k-nearest-neighbor techniques for high throughput screening data. Int J Biomed Sci 1:155–160Google Scholar
  80. 80.
    Xu Y, Zhu Q, Fan Z, Qiu M, Chen Y, Liu H (2013) Coarse to fine K nearest neighbor classifier. Pattern Recogn Lett. Google Scholar
  81. 81.
    Yadav S, Kaur A, Bhauryal NS (2016) Resolving the celestial classification using fine k-NN classifier. In: 2016 4th international conference on parallel, distributed and grid computing, PDGC 2016.
  82. 82.
    Chavarriaga R, Sagha H, Calatroni A, Digumarti ST, Tröster G, Millan JDR, Roggen D (2013) The opportunity challenge: a benchmark database for on-body sensor-based activity recognition. Pattern Recogn Lett. Google Scholar
  83. 83.
    Stisen A, Blunck H, Bhattacharya S, Prentow TS, Kjaergaard MB, Dey A, Sonne T, Jensen MM (2015) Smart devices are different: assessing and mitigating-mobile sensing heterogeneities for activity recognition. In: Proceedings of the 13th ACM conference on embedded networked sensor systems—SenSys ’15.
  84. 84.
    Auria L, Moro RA (2008) Support vector machines (SVM) as a technique for solvency analysis. DIW Berlin German Institute for economic Research.

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.University of OttawaOttawaCanada

Personalised recommendations