Distributed and Parallel Databases

, Volume 37, Issue 3, pp 351–384 | Cite as

DataSynapse: A Social Data Curation Foundry

  • Amin BeheshtiEmail author
  • Boualem Benatallah
  • Alireza Tabebordbar
  • Hamid Reza Motahari-Nezhad
  • Moshe Chai Barukh
  • Reza Nouri
Part of the following topical collections:
  1. Special Issue on Extending Data Warehouses to Big Data Analytics


Social data analytics have become a vital asset for organizations and governments. For example, over the last few years, governments started to extract knowledge and derive insights from vastly growing open data to personalize the advertisements in elections, improve government services, predict intelligence activities, as well as to improve national security and public health. A key challenge in analyzing social data is to transform the raw data generated by social actors into curated data, i.e., contextualized data and knowledge that is maintained and made available for use by end-users and applications. To address this challenge, we present the notion of knowledge lake, i.e., a contextualized Data Lake, to provide the foundation for big data analytics by automatically curating the raw social data and to prepare them for deriving insights. We present a social data curation foundry, namely DataSynapse, to enable analysts engage with social data to uncover hidden patterns and generate insight. In DataSynapse, we present a scalable algorithm to transform social items (e.g., a Tweet in Twitter) into semantic items, i.e., contextualized and curated items. This algorithm offers customizable feature extraction to harness desired features from diverse data sources. To link contextualized information items to the domain knowledge, we present a scalable technique which leverages cross document coreference resolution assisting analysts to derive targeted insights. DataSynapse is offered as an extensible and scalable microservice-based architecture that are publicly available on GitHub supporting networks such as Twitter, Facebook, GooglePlus and LinkedIn. We adopt a typical scenario for analyzing urban social issues from Twitter as it relates to the government budget, to highlight how DataSynapse significantly improves the quality of extracted knowledge compared to the classical curation pipeline (in the absence of feature extraction, enrichment and domain-linking contextualization).


Social networks analytics Big data analytics Knowledge lake Data curation Feature engineering 



We Acknowledge the data to decisions CRC (D2D CRC) and the cooperative research centres program for funding this research.


  1. 1.
    Aggarwal, C.C.: An Introduction to Social Network Data Analytics, pp. 1–15. Springer, Berlin (2011)CrossRefGoogle Scholar
  2. 2.
    Anderson, M.R., Antenucci, D., Bittorf, V., Burgess, M., Cafarella, M.J., Kumar, A., Niu, F. et al.: Brainwash: a data system for feature engineering. In: CIDR (2013)Google Scholar
  3. 3.
    Beheshti, S.-M.-R., Nezhad, H.R.M., Benatallah, B.: Temporal provenance model (TPM): model and query language. CoRR, abs/1211.5009 (2012)Google Scholar
  4. 4.
    Beheshti, S.-M.-R. et al.: Galaxy: a platform for explorative analysis of open data sources. In: Proceedings of the 19th International Conference on Extending Database Technology, (EDBT), pp. 640–643 (2016).
  5. 5.
    Beheshti, S.-M.-R., Benatallah, B., Motahari-Nezhad, H.R.: Scalable graph-based OLAP analytics over process execution data. Distrib. Parallel Databases 34(3), 379–423 (2016)CrossRefGoogle Scholar
  6. 6.
    Beheshti, S.-M.-R., Benatallah, B., Sakr, S., Grigori, D., Motahari-Nezhad, H.R., Barukh, M.C., Gater, A., Ryu, S.H.: Process Analytics—Concepts and Techniques for Querying and Analyzing Process Data. Springer, Berlin (2016)Google Scholar
  7. 7.
    Arocena, P.C., Glavic, B., Mecca, G., Miller, R.J., Papotti, P., Santoro, D.: Benchmarking data curation systems. IEEE Data Eng. Bull. 39(2), 47–62 (2016)Google Scholar
  8. 8.
    Beheshti, S.-M.-R., Tabebordbar, A., Benatallah, B., Nouri, R.: On automating basic data curation tasks. In: Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, April 3–7, 2017, pp. 165–169 (2017)Google Scholar
  9. 9.
    Beheshti, S.-M.-R., Benatallah, B., Venugopal, S., Ryu, S.H., Motahari-Nezhad, H.R., Wang, Wei: A systematic review and comparative analysis of cross-document coreference resolution methods and tools. Computing 99(4), 313–349 (2017)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Beheshti, A., Benatallah, B., Nouri, R., Chhieng, Van M., Xiong, H., Zhao, X.: Coredb: a data lake service. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, November 06–10, 2017, pp. 2451–2454 (2017)Google Scholar
  11. 11.
    Beheshti, A., Benatallah, B., Nouri, R., Tabebordbar, A.: Corekg: a knowledge lake service. PVLDB 11(12), 1942–1945 (2018).
  12. 12.
    Beheshti, A., Schiliro, F., Ghodratnama, S., Amouzgar, F., Benatallah, B., Yang, J., Sheng, Q.Z., Casati, F., Motahari-Nezhad, H.R.: iprocess: Enabling iot platforms in data-driven knowledge-intensive processes. In: Business Process Management Forum - BPM Forum 2018 (2018)Google Scholar
  13. 13.
    Beheshti, A., Vaghani, K., Benatallah, B., Tabebordbar, A.: Crowdcorrect: A curation pipeline for social data cleansing and curation. In: Information Systems in the Big Data Era—CAiSE Forum 2018, Tallinn, Estonia, June 11–15, 2018, Proceedings, pp. 24–38 (2018)Google Scholar
  14. 14.
    Chai, X., Deshpande, O., Garera, N., Gattani, A., Lam, W., Lamba, D.S., Liu, L., Tiwari, M., Tourn, M., Vacheri, Z., Prasad, S.T.S., Subramaniam, S., Harinarayan, V., Rajaraman, A., Ardalan, A., Das, S., Suganthan, G.C.P., Doan, A.: Social media analytics: the kosmix story. IEEE Data Eng. Bull. 36(3), 4–12 (2013)Google Scholar
  15. 15.
    Chen, H., Chiang, R.H.L., Storey, V.C.: Business intelligence and analytics: from big data to big impact. MIS Q. 36(4), 1165–1188 (2012)CrossRefGoogle Scholar
  16. 16.
    Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan, S.: Systemt: an algebraic approach to declarative information extraction. In: ACL 2010, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, July 11–16, 2010, Uppsala, pp. 128–137 (2010)Google Scholar
  17. 17.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM. 51(1), 107 (2008)Google Scholar
  18. 18.
    Deshpande, M., Ray, D., Dixit, S., Agasti, A.: Shareinsights: an unified approach to full-stack data processing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31–June 4, 2015, pp. 1925–1940 (2015)Google Scholar
  19. 19.
    Doan, A., Domingos, P.M., Halevy, A.Y.: Reconciling schemas of disparate data sources: a machine-learning approach. In: Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21–24, 2001, pp. 509–520 (2001)Google Scholar
  20. 20.
    Ferrucci, D.A.: Introduction to ’this is watson’. IBM J. Res. Dev. 56(3.4), 4:1–4:11 (2012)Google Scholar
  21. 21.
    Freitas, A., Curry, E.: Big data curation. In: Cavanillas, J.M., (ed.), New Horizons for a Data-Driven Economy, pp. 87–118. Springer, Berlin (2016)Google Scholar
  22. 22.
    Terrizzano, I. et al.: Data wrangling: the challenging journey from the wild to the lake. In: CIDR (2015)Google Scholar
  23. 23.
    Kim, N.W., Jung, J., Ko, E.-Y., Han, S., Lee, C.W., Kim, J., Kim, J.: Budgetmap: engaging taxpayers in the issue-driven classification of a government budget. In: Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, CSCW 2016, San Francisco, CA, USA, February 27–March 2, 2016, pp. 1026–1037 (2016)Google Scholar
  24. 24.
    Lee, K., Agrawal, A., Choudhary, A.: Real-time disease surveillance using twitter data: demonstration on flu and cancer. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, pages 1474–1477, New York, NY, USA (2013). ACMGoogle Scholar
  25. 25.
    Lohr, S.: The age of big data. New York Times, 11 (2012)Google Scholar
  26. 26.
    Nakov, P., Ritter, A., Rosenthal, S., Sebastiani, F., Stoyanov, V.: Semeval-2016 task 4: sentiment analysis in twitter. In: Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2016, San Diego, CA, USA, June 16–17, 2016, pp. 1–18 (2016)Google Scholar
  27. 27.
    Pandey, N., Natarajan, S.: How social media can contribute during disaster events? case study of chennai floods 2015. In: 2016 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2016, Jaipur, India, September 21–24, 2016, pp. 1352–1356 (2016)Google Scholar
  28. 28.
    Paul Suganthan, G.C., Sun, C., Krishna Gayatri, K., Zhang, H., Yang, F., Rampalli, N., Prasad, S., Arcaute, E., Krishnan, G., Deep, R., Raghavendra, V., Doan, A.: Why big data industrial systems need rules and what we can do about it. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31–June 4, 2015, pp. 265–276 (2015)Google Scholar
  29. 29.
    Pu, X., Jin, R., Wu, G., Han, D., Xue, G.-R.: Topic modeling in semantic space with keywords. In: Proceedings of the 24th ACM International Conference on Information and Knowledge Management, CIKM 2015, Melbourne, VIC, Australia, October 19–23, 2015, pp. 1141–1150 (2015)Google Scholar
  30. 30.
    Ritter, A., Clark, S., Mausam, E., Oren: named entity recognition in tweets: an experimental study. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27–31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1524–1534 (2011)Google Scholar
  31. 31.
    Ruder, T.D., Hatch, G.M., Ampanozi, G., Thali, M.J., Fischer, N.: Suicide announcement on facebook. Crisis (2011)Google Scholar
  32. 32.
    Russom, P., et al.: Big data analytics. TDWI best practices report, fourth quarter 19, 40 (2011)Google Scholar
  33. 33.
    Sellam, T., Müller, E., Kersten, M.L.: Semi-automated exploration of data warehouses. In: Proceedings of the 24th ACM International Conference on Information and Knowledge Management, CIKM 2015, Melbourne, VIC, Australia, October 19–23, 2015, pp. 1321–1330 (2015)Google Scholar
  34. 34.
    Stonebraker, M. et al.: Data curation at scale: the data tamer system. In: CIDR (2013)Google Scholar
  35. 35.
    Fabian, M.: Suchanek and Gerhard Weikum. Knowledge bases in the age of big data analytics. Proc. VLDB Endow. 7(13), 1713–1714 (2014)CrossRefGoogle Scholar
  36. 36.
    Tabebordbar, A., Beheshti, A.: Adaptive rule monitoring system. In: 40th International Conference on Software Engineering (ICSE), International Workshop on Software Engineering for Cognitive Services (SE4COG) (2018)Google Scholar
  37. 37.
    Tene, O., Polonetsky, J.: Big data for all: Privacy and user control in the age of analytics. N. J. Tech. Intell. Prop. 11, xxvii (2012)Google Scholar
  38. 38.
    Troncy, R.: Linking entities for enriching and structuring social media content. In: WWW, pp. 597–597 (2016)Google Scholar
  39. 39.
    Karlgren, J., Bohman, M., Ekgren, A., Isheden, G., Kullmann, E., Nilsson, D.: Semantic topology. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2014, Shanghai, China, November 3–7, 2014, pp. 1939–1942 (2014)Google Scholar
  40. 40.
    Wang, S., Tang, J., Aggarwal, C.C., Liu, H.: Linked document embedding for classification. In: Proceedings of the 25th ACM International Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, IN, USA, October 24–28, 2016, pp. 115–124 (2016)Google Scholar
  41. 41.
    Zarras, A.V., Vassiliadis, P., Dinos, I.: Keep calm and wait for the spike! insights on the evolution of amazon services. In: Advanced Information Systems Engineering - 28th International Conference, CAiSE 2016, Ljubljana, Slovenia, June 13-17, 2016. Proceedings, pp. 444–458 (2016)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Macquarie UniversitySydneyAustralia
  2. 2.University of New South WalesSydneyAustralia
  3. 3.EY AI LabPalo AltoUSA

Personalised recommendations