Journal of Intelligent Information Systems

, Volume 51, Issue 2, pp 389–414 | Cite as

Unified domain-specific language for collecting and processing data of social media

  • Nikolay ButakovEmail author
  • Maxim Petrov
  • Ksenia Mukhina
  • Denis Nasonov
  • Sergey Kovalchuk


Data provided by social media becomes an increasingly important analysis material for social scientists, market analysts, and other stakeholders. Diversity of interests leads to the emergence of a variety of crawling techniques and programming solutions. Nevertheless, these solutions have a lack of flexibility to satisfy requirements of different users and individual crawling scenarios, that can range from a simple query to a complex workflow containing multiple steps and requiring data from different networks to be collected. To address this problem, our paper proposes an approach based on a developed domain specific language (DSL) and architecture of distributed crawling system. The DSL has a declarative style that requires the user to define the description of needed data and based on an ontological model of social networks and the essential crawling techniques. Thus, the crawling system can be applied to collect the data from different online social networks within complex workflows along with the exploitation of various crawling methods implemented in a distributed computing environment.


Social networks Social media Crawling Domain-specific language Ontology 



This research financially supported by Ministry of Education and Science of the Russian Federation, Agreement #14.578.21.0196 (03.10.2016). Unique Identification RFMEFI57816X0196.


  1. Arnaboldi, V., Conti, M., Passarella, A., Pezzoni, F. (2013). Ego networks in twitter: an experimental analysis. In INFOCOM, 2013 Proceedings IEEE (pp. 3459–3464): IEEE.Google Scholar
  2. Avrachenkov, K.E., Mazalov, V.V., Tsynguev, B.T. (2015). Beta Current Flow Centrality for Weighted Networks. In Computational Social Networks (pp. 216–227): Springer International Publishing.Google Scholar
  3. Bansal, N., & Koudas, N. (2007). Blogscope: spatio-temporal analysis of the blogosphere. In Proceedings of the 16th international conference on World Wide Web (pp. 1269–1270): ACM.Google Scholar
  4. Boanjak, M., Oliveira, E., Martins, J., Mendes Rodrigues, E., Sarmento, L. (2012). TwitterEcho: a distributed focused crawler to support open research with twitter data. In Proceedings of the 21st international conference companion on World Wide Web (pp. 1233–1240): ACM.Google Scholar
  5. Buccafurri, F., Lax, G., Nocera, A., Ursino, D. (2012). Crawling social internetworking systems. In 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) (pp. 506–510): IEEE. - (BFS, Random Walk and others).Google Scholar
  6. Buccafurri, F., Lax, G., Nocera, A., Ursino, D. (2015). A system for extracting structural information from Social Network accounts. Software: Practice and Experience, 45(9), 1251–1275.Google Scholar
  7. Buccafurri, F., Lax, G., Nicolazzo, S., Nocera, A. (2016). A model to support design and development of multiple-social-network applications. Information Sciences, 331, 99–119.MathSciNetCrossRefGoogle Scholar
  8. Buraya, K., Farseev, A., Filchenkov, A., Chua, T.S. (2017). Towards User Personality Profiling from Multiple Social Networks. In AAAI (pp. 4909–4910).Google Scholar
  9. Butakov, N., Chuprova, Y., Knyazkov, K., Shindyapina, N., Boukhanovsky, A. (2015). Evolutionary-based Framework for Optimizing the Spread of Information on Twitter. Procedia Computer Science, 66, 287–296.CrossRefGoogle Scholar
  10. Dunbar, R.I.M., Arnaboldi, V., Conti, M., Passarella, A. (2015). The structure of online social networks mirrors those in the offline world. Social Networks, 43, 39–47.CrossRefGoogle Scholar
  11. Duvanova, D., Nikolaev, A., Nikolsko-Rzhevskyy, A., Semenov, A. (2015). Violent conflict and online segregation: An analysis of social network communication across Ukraine’s regions. Journal of Comparative Economics.Google Scholar
  12. Farseev, A., Nie, L., Akbari, M., Chua, T.S. (2015). Harvesting multiple sources for user profile learning: a big data study. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval (pp. 235–242): ACM.Google Scholar
  13. Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A. (2010). Walking in Facebook: A case study of unbiased sampling of OSNs. In IEEE (pp. 1–9).Google Scholar
  14. Hicks, A., & BE, D.F. (2015). Mining Twitter as a First Step toward Assessing the Adequacy of Gender Identification Terms on Intake Forms.Google Scholar
  15. Kahanda, I., & Neville, J. (2009). Using Transactional Information to Predict Link Strength in Online Social Networks. ICWSM, 9, 74–81.Google Scholar
  16. Knyazkov, K.V., Kovalchuk, S.V., Tchurov, T.N., Maryin, S.V., Boukhanovsky, A.V. (2012). CLAVIRE: e-Science infrastructure for data-driven computing. Journal of Computational Science, 3(6), 504–510.CrossRefGoogle Scholar
  17. Kwak, H., Lee, C., Park, H., Moon, S. (2010). What is Twitter, a social network or a news media?. In Proceedings of the 19th international conference on World wide web (pp. 591–600): ACM.Google Scholar
  18. Li, R., Lei, K.H., Khadiwala, R., Chang, K.C.C. (2012). Tedas: A twitter-based event detection and analysis system. In 2012 ieee 28th international conference on Data engineering (icde) (pp. 1273–1276): IEEE.Google Scholar
  19. Marcus, A., Bernstein, M.S., Badar, O., Karger, D.R., Madden, S., Miller, R.C. (2012). Processing and visualizing the data in tweets. ACM SIGMOD Record, 40(4), 21–27.CrossRefGoogle Scholar
  20. Mathioudakis, M., & Koudas, N. (2010). Twittermonitor: trend detection over the twitter stream. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (pp. 1155–1158): ACM.Google Scholar
  21. METRA, I. (2014). Influence based exploration of twitter social network.Google Scholar
  22. Papadakis, G., Tserpes, K., Sardis, E., Kardara, M., Papaoikonomou, A., Aisopos, F. (2012). Social media meta-API: leveraging the content of social networks. In Proceedings of the 21st international conference companion on World Wide Web (pp. 271–274): ACM.Google Scholar
  23. Psallidas, F., Ntoulas, A., Delis, A. (2013). Soc web: Efficient monitoring of social network activities. In Web Information Systems Engineering–WISE 2013 (pp. 118–136): Springer Berlin Heidelberg.Google Scholar
  24. Serrano, D., Stroulia, E., Barbosa, D., Guana, V. (2012). Sociql: A query language for the socialweb, Springer Berlin Heidelberg.Google Scholar
  25. Shuai, H.H., Yang, D.N., Shen, C.Y., Yu, P.S., Chen, M.S. (2015). QMSampler: Joint Sampling of Multiple Networks with Quality Guarantee. arXiv:1502.07439.
  26. Teng, S.Y., Yeh, M.Y., Chuang, K.T. (2015). Toward Understanding the Mobile Social Properties: An Analysis on Instagram Photo-Sharing Network. In Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015 (pp. 266–269): ACM.Google Scholar
  27. Toshniwal, A., Taneja, S., Shukla, A., Ramasamy, K., Patel, J. M., Kulkarni, S., Bhagat, N. (2014). Storm@ twitter. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data (pp. 147–156): ACM.Google Scholar
  28. Valkanas, G., & Gunopulos, D. (2013). How the live web feels about events. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management (pp. 639–648): ACM.Google Scholar
  29. Valkanas, G., Saravanou, A., Gunopulos, D. (2014). A faceted crawler for the twitter service. In Web Information Systems Engineering–WISE 2014 (pp. 178–188): Springer International Publishing.Google Scholar
  30. Wang, X., Tokarchuk, L., Cuadrado, F., Poslad, S. (2013). Exploiting hashtags for adaptive microblog crawling. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (pp. 311–315): ACM.Google Scholar
  31. Wachowicz, M., Arteaga, M.D., Cha, S., Bourgeois, Y. (2015). Developing a streaming data processing workflow for querying space–time activities from geotagged tweets. Computers, Environment and Urban Systems.Google Scholar
  32. Xiong, F., Liu, Y., Zhang, Z. J., Zhu, J., Zhang, Y. (2012). An information diffusion model based on retweeting mechanism for online social media. Physics Letters A, 376(30), 2103–2108.CrossRefGoogle Scholar
  33. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Stoica, I. (2012a). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (pp. 2–2): USENIX Association.Google Scholar
  34. Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I. (2012b). Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In Presented as part of the.Google Scholar
  35. Zou, J., Fekri, F., McLaughlin, S. W. (2015). Mining Streaming Tweets for Real-Time Event Credibility Prediction in Twitter. In Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015 (pp. 1586–1589): ACM.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Nikolay Butakov
    • 1
    Email author
  • Maxim Petrov
    • 1
  • Ksenia Mukhina
    • 1
  • Denis Nasonov
    • 1
  • Sergey Kovalchuk
    • 1
  1. 1.ITMO UniversitySaint-PetersburgRussian Federation

Personalised recommendations