Comparative Analysis of Scientific Papers Collections via Topic Modeling and Co-authorship Networks

  • Fedor KrasnovEmail author
  • Alexander Dimentov
  • Mikhail Shvartsman
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1119)


In this paper, the authors present an approach to benchmarking the collections of scientific journals based on the analysis of co-authorship graphs and a text models. The main methodical result is Comparative Topic Modeling (CTM) technique. The application of time series to the metrics of co-authorship graphs allowed trends in the development of author collaborations in scientific journals to be analyzed. A text model was created using machine learning methods. The content of journals was classified to determine the degree of authenticity both in various journals and their issues. Experiments was conducted on the archives of two journals in the field of Rheumatology. The authors used public data sets from the SNAP research laboratory at Stanford University to benchmark the co-authorship network metrics. The application of the research results is improving editorial strategies for development of co-authorship collaborations and scientific content excellence.


Comparative text mining Additive regularization of topic models Social network analysis Comparative graphs metrics Text benchmarking 


  1. 1.
    Aizawa, A.: An information-theoretic perspective of TF-IDF measures. Inf. Process. Manage. 39(1), 45–65 (2003)zbMATHCrossRefGoogle Scholar
  2. 2.
    Alba, R.D.: A graph-theoretic definition of a sociometric clique. J. Math. Sociol. 3(1), 113–126 (1973)MathSciNetzbMATHCrossRefGoogle Scholar
  3. 3.
    Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)Google Scholar
  4. 4.
    Bholowalia, P., Kumar, A.: EBK-means: a clustering technique based on elbow method and K-means in WSN. Int. J. Comput. Appl. 105(9), 17–24 (2014)Google Scholar
  5. 5.
    Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media Inc., Sebastopol (2009)zbMATHGoogle Scholar
  6. 6.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)zbMATHGoogle Scholar
  7. 7.
    Bondy, J.A., Murty, U.S.R., et al.: Graph Theory with Applications, vol. 290. Citeseer (1976)Google Scholar
  8. 8.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)zbMATHCrossRefGoogle Scholar
  9. 9.
    Cunningham, S.J., Dillon, S.M.: Authorship patterns in information systems. Scientometrics 39(1), 19 (1997)CrossRefGoogle Scholar
  10. 10.
    Egghe, L., Rousseau, R., Van Hooydonk, G.: Methods for accrediting publications to authors or countries: consequences for evaluation studies. J. Am. Soc. Inf. Sci. 51(2), 145–157 (2000)CrossRefGoogle Scholar
  11. 11.
    Farkas, I., Derényi, I., Jeong, H., Neda, Z., Oltvai, Z., Ravasz, E., Schubert, A., Barabási, A.L., Vicsek, T.: Networks in life: scaling properties and eigenvalue spectra. Physica A: Stat. Mech. Appl. 314(1–4), 25–34 (2002)MathSciNetzbMATHCrossRefGoogle Scholar
  12. 12.
    Garfield, E.: Is citation analysis a legitimate evaluation tool? Scientometrics 1(4), 359–375 (1979)CrossRefGoogle Scholar
  13. 13.
    Hofmann, T.: Probabilistic latent semantic indexing. In: ACM SIGIR Forum, vol. 51, pp. 211–218. ACM (2017)Google Scholar
  14. 14.
    Kleene, S.C.: Representation of events in nerve nets and finite automata. Technical report, RAND PROJECT AIR FORCE SANTA MONICA CA (1951)Google Scholar
  15. 15.
    Korobov, M.: Morphological analyzer and generator for Russian and Ukrainian languages. In: Khachay, M.Y., Konstantinova, N., Panchenko, A., Ignatov, D.I., Labunets, V.G. (eds.) AIST 2015. CCIS, vol. 542, pp. 320–332. Springer, Cham (2015). Scholar
  16. 16.
    Krasnov, F., Sen, A.: The number of topics optimization: clustering approach. Mach. Learn. Knowl. Extr. 1(1), 416–426 (2019)CrossRefGoogle Scholar
  17. 17.
    Krasnov, F., Ushmaev, O.: Exploration of hidden research directions in oil and gas industry via full text analysis of OnePetro digital library. Int. J. Open Inf. Technol. 6(5), 7–14 (2018)Google Scholar
  18. 18.
    Kucera, H., Francis, W.N.: Computational Analysis of Present - Day American English. Dartmouth Publishing Group, Hanover (1967)Google Scholar
  19. 19.
    Law, J., Zhuo, H.H., He, J.H., Rong, E.: LTSG: latent topical skip-gram for mutually improving topic model and vector representations. In: Lai, J.-H., et al. (eds.) PRCV 2018. LNCS, vol. 11258, pp. 375–387. Springer, Cham (2018). Scholar
  20. 20.
    Leskovec, J., Kleinberg, J., Faloutsos, C.: Graph evolution: densification and shrinking diameters. ACM Trans. Knowl. Discovery Data (TKDD) 1(1), 2 (2007)CrossRefGoogle Scholar
  21. 21.
    Lovins, J.B.: Development of a stemming algorithm. Mech. Translat. Comp. Linguist. 11(2), 22–31 (1968)Google Scholar
  22. 22.
    Lu, X., Zheng, X., Li, X.: Latent semantic minimal hashing for image retrieval. IEEE Trans. Image Process. 26(1), 355–368 (2016)MathSciNetzbMATHCrossRefGoogle Scholar
  23. 23.
    Lucas, C., Nielsen, R.A., Roberts, M.E., Stewart, B.M., Storer, A., Tingley, D.: Computer-assisted text analysis for comparative politics. Polit. Anal. 23(2), 254–277 (2015)CrossRefGoogle Scholar
  24. 24.
    Naik, R.R., Landge, M.B., Mahender, C.N.: A review on plagiarism detection tools. Int. J. Comput. Appl. 125(11) (2015) Google Scholar
  25. 25.
    Newman, M.E.: Scientific collaboration networks. i. Network construction and fundamental results. Phys. Rev. E 64(1), 016131 (2001)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Newman, M.E.: Analysis of weighted networks. Phys. Rev. E 70(5), 056131 (2004)CrossRefGoogle Scholar
  27. 27.
    Packard, D.: Computer-assisted morphological analysis of ancient Greek. In: COLING 1973 Volume 2: Computational And Mathematical Linguistics: Proceedings of the International Conference on Computational Linguistics, vol. 2 (1973)Google Scholar
  28. 28.
    Porter, M.F.: Snowball: a language for stemming algorithms (2001)Google Scholar
  29. 29.
    Schwenk, H., Gauvain, J.L.: Connectionist language modeling for large vocabulary continuous speech recognition. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, p. I-765. IEEE (2002)Google Scholar
  30. 30.
    Segalovich, I.: A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In: MLMTA, pp. 273–280. Citeseer (2003)Google Scholar
  31. 31.
    Sharoff, S., Nivre, J.: The proper place of men and machines in language technology: processing Russian without any linguistic knowledge. In: Proceedings of Dialogue 2011, Russian Conference on Computational Linguistics (2011)Google Scholar
  32. 32.
    Smeaton, A.F., Keogh, G., Gurrin, C., McDonald, K., Sødring, T.: Analysis of papers from twenty-five years of SIGIR conferences: what have we been doing for the last quarter of a century? In: ACM SIGIR Forum, vol. 37, pp. 49–53. ACM (2003)CrossRefGoogle Scholar
  33. 33.
    Teahan, W.J., Cleary, J.G.: The entropy of English using PPM-based models. In: DCC, p. 53. IEEE (1996)Google Scholar
  34. 34.
    Teahan, W., Cleary, J.G.: Models of English text. In: 1997 Proceedings of Data Compression Conference, DCC’97, pp. 12–21. IEEE (1997)Google Scholar
  35. 35.
    Thompson, K.: Programming techniques: regular expression search algorithm. Commun. ACM 11(6), 419–422 (1968)zbMATHCrossRefGoogle Scholar
  36. 36.
    Vorontsov, K., Potapenko, A.: Additive regularization of topic models. Mach. Learn. 101(1–3), 303–323 (2015)MathSciNetzbMATHCrossRefGoogle Scholar
  37. 37.
    Wang, X., Ren, J., Zhang, Y., Zhu, D., Qiu, P., Huang, M.: China’s patterns of international technological collaboration 1976–2010: a patent analysis study. Technol. Anal. Strateg. Manag. 26(5), 531–546 (2014)CrossRefGoogle Scholar
  38. 38.
    Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications, vol. 8. Cambridge University Press, Cambridge (1994)zbMATHCrossRefGoogle Scholar
  39. 39.
    Weizenbaum, J.: Eliza–a computer program for the study of natural language communication between man and machine. Commun. ACM 9(1), 36–45 (1966)CrossRefGoogle Scholar
  40. 40.
    Wiederhold, G.: Intelligent integration of information. In: ACM SIGMOD Record, vol. 22, pp. 434–437. ACM (1993)Google Scholar
  41. 41.
    Willett, P.: The porter stemming algorithm: then and now. Program 40(3), 219–223 (2006)CrossRefGoogle Scholar
  42. 42.
    Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington (2016)Google Scholar
  43. 43.
    Zhao, W.X., et al.: Comparing Twitter and traditional media using topic models. In: Clough, P., et al. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Gazpromneft STCSaint-PetersburgRussia
  2. 2.NEICONMoscowRussia

Personalised recommendations