Detection of Sociolinguistic Features in Digital Social Networks for the Detection of Communities


The emergence of digital social networks has transformed society, social groups, and institutions in terms of the communication and expression of their opinions. Determining how language variations allow the detection of communities, together with the relevance of specific vocabulary (proposed by the National Council of Accreditation of Colombia (Consejo Nacional de Acreditación - CNA) to determine the quality evaluation parameters for universities in Colombia) in digital assemblages could lead to a better understanding of their dynamics and social foundations, thus resulting in better communication policies and intervention where necessary. The approach presented in this paper intends to determine what are the semantic spaces (sociolinguistic features) shared by social groups in digital social networks. It includes five layers based on Design Science Research, which are integrated with Natural Language Processing techniques (NLP), Computational Linguistics (CL), and Artificial Intelligence (AI). The approach is validated through a case study wherein the semantic values of a series of “Twitter” institutional accounts belonging to Colombian Universities are analyzed in terms of the 12 quality factors established by CNA. In addition, the topics and the sociolect used by different actors in the university communities are also analyzed. The current approach allows determining the sociolinguistic features of social groups in digital social networks. Its application allows detecting the words or concepts to which each actor of a social group (university) gives more importance in terms of vocabulary.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11


  1. 1.

    Dumbill E. A revolution that will transform how we live, work, and think: An interview with the authors of big data. Big data. 2013;1(2):73–7.

    Article  Google Scholar 

  2. 2.

    Meyerhoff M. Introducing sociolinguistics. Taylor & Francis Group: Routledge; 2015.

    Google Scholar 

  3. 3.

    Meyerhoff M. Introducing sociolinguistics. Routledge; 2018.

  4. 4.

    Scott J. Social network analysis: developments, advances, and prospects. Social network analysis and mining. 2011;1(1):21–6.

    Article  Google Scholar 

  5. 5.

    Zeinab Kafi, Khalil Motallebzadeh. An introduction to sociolinguistics. International Journal of Society, Culture & Language. 2016;4(2):134–40.

    Google Scholar 

  6. 6.

    Bryden J, Funk S, Jansen VA. Word usage mirrors community structure in the online social network twitter. EPJ Data Science, 2013;2(1):3.

  7. 7.

    Ríos SA, Muñoz R. Dark web portal overlapping community detection based on topic models. In Proceedings of the ACM SIGKDD workshop on intelligence and security informatics. 2012. p. 1–7.

  8. 8.

    Nguyen D. A Seza Doğruöz, Carolyn P Rosé, and Franciska de Jong. Computational sociolinguistics: A survey Computational linguistics. 2016;42(3):537–93.

    Article  Google Scholar 

  9. 9.

    Reynolds WN, Salter WJ, Farber RM, Corley C, Dowling CP, Beeman WO, et al. Sociolect-based community detection. In 2013 IEEE International Conference on Intelligence and Security Informatics. 2013. p. 221–226, IEEE.

  10. 10.

    Mansouri F, Abdelalim S, Ikram EA. A modeling framework for the moroccan sociolect recognition used on the social media. In Proceedings of the 2nd international Conference on Big Data, Cloud and Applications. ACM. 2017. p. 34. 

  11. 11.

    Gibson KR. Tool use, language and social behavior in relationship to information processing capacities. Tools, language and cognition in human evolution. 1993. p. 251-269.

  12. 12.

    K Adnan, R Akbar. An analytical study of information extraction from unstructured and multidimensional big data. Journal of Big Data. 2019;6(1):91.

    Article  Google Scholar 

  13. 13.

    Louwerse MM. Semantic variation in idiolect and sociolect: Corpus linguistic evidence from literary texts. Computers and the Humanities. 2004;38(2):207–21.

    Article  Google Scholar 

  14. 14.

    Paradis RD, Davenport D, Menaker D, Taylor SM. Detection of groups in non-structured data. Procedia Computer Science. 2012;12:412–7.

    Article  Google Scholar 

  15. 15.

    A Hussain, E Cambria. Semi-supervised learning for big social data analysis. Neurocomputing. 2018;275:1662–733.

    Article  Google Scholar 

  16. 16.

    Li L, Wu L, Evans JA. Social centralization and semantic collapse: Hyperbolic embeddings of networks and text. CoRR, abs/2001.09493, 2020.

  17. 17.

    Balaanand M, Karthikeyan N, Karthik S, Varatharajan R, Manogaran G, Sivaparthipan C. An enhanced graph-based semi-supervised learning algorithm to detect fake users on twitter. The Journal of Supercomputing. 2019;75(9):6085–105.

    Article  Google Scholar 

  18. 18.

    Cavallari S, Cambria E, Cai H, Chang KC, Zheng VW. Embedding both finite and infinite communities on graphs [application notes]. IEEE Computational Intelligence Magazine. 2019;14(3):39–50.

    Article  Google Scholar 

  19. 19.

    H Fani, E Jiang, E Bagheri, F Al-Obeidat, W Du, M Kargar. User community detection via embedding of social network structure and temporal content. Information Processing & Management. 2020;57(2):102056.

    Article  Google Scholar 

  20. 20.

    Park C, Han J, Yu H. Deep multiplex graph infomax: Attentive multiplex network embedding using global information. Knowledge-Based Systems. 2020. p.105861.

  21. 21.

    Liu P, Zhang L, Gulla JA. Real-time social recommendation based on graph embedding and temporal context. International Journal of Human-Computer Studies. 2019;121:58–72.

    Article  Google Scholar 

  22. 22.

    Tkachenko N, Guo W. Conflict detection in linguistically diverse on-line social networks: A russia-ukraine case study. In Proceedings of the 11th International Conference on Management of Digital EcoSystems, MEDES ’19. Association for Computing Machinery. New York, NY, USA. 2019. p. 23-28.

  23. 23.

    E Cambria. Affective computing and sentiment analysis. IEEE intelligent systems. 2016;31(2):102–7.

    Google Scholar 

  24. 24.

    Poria S, Chaturvedi I, Cambria E, Bisio F. Sentic lda: Improving on lda with semantic similarity for aspect-based sentiment analysis. In 2016 international joint conference on neural networks (IJCNN). 2016. p. 4465–4473, IEEE.

  25. 25.

    Hevner A, Chatterjee S. Design research in information systems: theory and practice. Springer Science & Business Media. 2010;2.

  26. 26.

    González RA, Pomares A. La investigación científica basada en el diseño como eje de proyectos de investigación en ingeniería. Reunión Nacional ACOFI. 2012. p. 12–14.

  27. 27.

    Kietzmann JH, Hermkens K, McCarthy IP, Silvestre BS. Social media? get serious! understanding the functional building blocks of social media. Business horizons. 2011;54(3):241–51.

    Article  Google Scholar 

  28. 28.

    Española RA. Banco de datos (CREA). Corpus de referencia del español actual. 2015. p. 2011–10.

  29. 29.

    Spitkovsky VI, Alshawi H, Chang AX, Jurafsky D. Unsupervised dependency parsing without gold part-of-speech tags. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Edinburgh, Scotland, UK. 2011. p. 1281–1290.

  30. 30.

    Khurshid A, Gillam L, Tostevin L. University of surrey participation in trec8: Weirdness indexing for logical document extrapolation and retrieval (wilder). In The Eighth Text REtrieval Conference (TREC-8). Gaithersburg, Maryland. 1999. p. 1–8.

  31. 31.

    Joseph K, Carley KM, Hong JI. Check-ins in blau space applying blau macrosociological theory to foursquare check-ins from new york city. ACM Transactions on Intelligent Systems and Technology (TIST). 2014;5(3):1–22.

    Article  Google Scholar 

  32. 32.

    Park Y, Alam MH, Ryu WJ, and Sangkeun Lee. Bl-lda: Bringing bigram to supervised topic model. In 2015 International Conference on Computational Science and Computational Intelligence (CSCI). 2015. p. 83–88, IEEE.

  33. 33.

    Camacho D, Panizo-LLedot A, Bello-Orgaz G, Gonzalez-Pardo A, Cambria E. The four dimensions of social network analysis: An overview of research methods, applications, and software tools. Information Fusion. 2020;63:88–120.

    Article  Google Scholar 

  34. 34.

    Varelo AR. Hacia un modelo de aseguramiento de la calidad en la educación superior en colombia: estándares básicos y acreditación de excelencia. Educación superior, calidad y acreditación. CNA., 2003.

  35. 35.

    Beeferman D, Berger A, Lafferty J. Statistical models for text segmentation. Machine learning. 1999;34(1–3):177–21010.

    Article  Google Scholar 

  36. 36.

    Damani OP, Ghonge S. Appropriately incorporating statistical significance in pmi. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. 2013. p. 163–169.

  37. 37.

    Arora S, Li Y, Liang Y, Ma T, Risteski A. A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics. 2016;4:385–99.

    Article  Google Scholar 

  38. 38.

    Ahmad K, Gillman L, Tostevin L. Weirdness indexing for logical document extrapolation and retrieval. In Proceedings of the Eighth Text Retrieval Conference (TREC-8). 2000. p. 1–8.

Download references


We would like to thank the Center for Excellence and Appropriation in Big Data and Data Analytics (CAOBA), Pontificia Universidad Javeriana, and the Ministry of Information Technologies and Telecommunications of the Republic of Colombia (MinTIC). The models and results presented in this challenge contributed to the building of the research capabilities of CAOBA. Also, the author Edwin Puertas gives thanks to the Universidad Tecnológica de Bolívar.

Author information



Corresponding author

Correspondence to Edwin Puertas.

Ethics declarations

Conflicts of Interest

The authors declare that they have no conflict of interest.

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent

Informed consent was obtained from all individual participants included in the study.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Puertas, E., Moreno-Sandoval, L.G., Redondo, J. et al. Detection of Sociolinguistic Features in Digital Social Networks for the Detection of Communities. Cogn Comput 13, 518–537 (2021).

Download citation


  • Sociolinguistic
  • Community discovery
  • Natural language processing
  • Social networks
  • Community detection.