Probabilistic Topic Modelling for Controlled Snowball Sampling in Citation Network Collection

  • Hennadii DobrovolskyiEmail author
  • Nataliya Keberle
  • Olga Todoriko
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 786)


The paper presents a probabilistic topic model (PTM) application to citation network collection. Snowball sampling method is moderated with the selection of the most relevant papers by means of the PTM. The PTM used in the paper is modified to treat collections of short texts. It is constructed from the titles of seed papers collection united with the papers obtained through unrestricted snowball sampling. The objective of the research is to propose and to experimentally verify the approach of application of PTM of short text documents for improvement of a citation network collection. The preliminary analysis has shown that the method is robust: seed paper collection variations do not affect the most influencing papers subset in the collected citation network.


Citation network Snowball sampling Text mining Short text document Topic modelling 


  1. 1.
    Ahad, A., Fayaz, M., Shah, A.S.: Navigation through citation network based on content similarity using cosine similarity algorithm. Int. J. Database Theory Appl. 9(5), 9–20 (2016)CrossRefGoogle Scholar
  2. 2.
    Aletras, N., Stevenson, M.: Evaluating topic coherence using distributional semantics. IWCS 13, 13–22 (2013)Google Scholar
  3. 3.
    Barabási, A.L.: Scale-free networks: a decade and beyond. Science 325(5939), 412–413 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  5. 5.
    Ermolayev, V., Batsakis, S., Keberle, N., Tatarintseva, O., Antoniou, G.: Ontologies of time: Review and trends. Int. J. Comput. Sci. Appl. 11(3), 57–115 (2014)Google Scholar
  6. 6.
    Fouz-González, J.: Trends and directions in computer-assisted pronunciation training. In: Mompean, J.A., Fouz-González, J. (eds.) Investigating English Pronunciation, pp. 314–342. Palgrave Macmillan UK, London (2015). doi: 10.1057/9781137509437_14 CrossRefGoogle Scholar
  7. 7.
    Garfield, E.: From computational linguistics to algorithmic historiography. In: Symposium in Honor of Casimir Borkowski at the University of Pittsburgh School of Information Sciences (2001)Google Scholar
  8. 8.
    Garfield, E., Merton, R.K.: Citation Indexing: Its Theory and Application in Science, Technology, and Humanities, vol. 8. Wiley, New York (1979)Google Scholar
  9. 9.
    Gillis, N.: Introduction to nonnegative matrix factorization. arXiv preprint arXiv:1703.00663 (2017)
  10. 10.
    Harris, J.K., Beatty, K.E., Lecy, J.D., Cyr, J.M., Shapiro, R.M.: Mapping the multidisciplinary field of public health services and systems research. Am. J. Prev. Med. 41(1), 105–111 (2011)CrossRefGoogle Scholar
  11. 11.
    Hoyer, P.O.: Non-negative sparse coding. In: Proceedings of the 2002 12th IEEE Workshop on Neural Networks for Signal Processing, pp. 557–565. IEEE (2002)Google Scholar
  12. 12.
    Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data (TKDD) 2(2), 10 (2008)Google Scholar
  13. 13.
    Jijkoun, V., de Rijke, M.: Recognizing textual entailment: is word similarity enough? In: Quiñonero-Candela, J., Dagan, I., Magnini, B., d’Alché-Buc, F. (eds.) MLCW 2005. LNCS, vol. 3944, pp. 449–460. Springer, Heidelberg (2006). doi: 10.1007/11736790_25 CrossRefGoogle Scholar
  14. 14.
    Jolliffe, I.T.: Principal component analysis and factor analysis. Principal component analysis. Springer Series in Statistics, pp. 115–128. Springer, New York (1986). doi: 10.1007/978-1-4757-1904-8_7 CrossRefGoogle Scholar
  15. 15.
    Kajikawa, Y., Ohno, J., Takeda, Y., Matsushima, K., Komiyama, H.: Creating an academic landscape of sustainability science: an analysis of the citation network. Sustain. Sci. 2(2), 221 (2007)CrossRefGoogle Scholar
  16. 16.
    Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1188–1196 (2014)Google Scholar
  17. 17.
    Lecy, J.D., Beatty, K.E.: Representative literature reviews using constrained snowball sampling and citation network analysis (2012)Google Scholar
  18. 18.
    Lee, A., et al.: Language-independent methods for computer-assisted pronunciation training. Ph.D. thesis, Massachusetts Institute of Technology (2016)Google Scholar
  19. 19.
    Lin, C.J.: Projected gradient methods for nonnegative matrix factorization. Neural Comput. 19(10), 2756–2779 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Liu, J.S., Lu, L.Y., Lu, W.M., Lin, B.J.: Data envelopment analysis 1978–2010: a citation-based literature survey. Omega 41(1), 3–15 (2013)CrossRefGoogle Scholar
  21. 21.
    López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)CrossRefGoogle Scholar
  22. 22.
    Lu, Z., Li, H.: A deep architecture for matching short texts. In: Advances in Neural Information Processing Systems, pp. 1367–1375 (2013)Google Scholar
  23. 23.
    MacKay, D.J.: Information Theory. Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)zbMATHGoogle Scholar
  24. 24.
    Meho, L.I.: The rise and rise of citation analysis. Phys. World 20(1), 32 (2007)CrossRefGoogle Scholar
  25. 25.
    Mihalcea, R., Corley, C., Strapparava, C., et al.: Corpus-based and knowledge-based measures of text semantic similarity. AAAI 6, 775–780 (2006)Google Scholar
  26. 26.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  27. 27.
    Moya-Anegón, F., Vargas-Quesada, B., Herrero-Solana, V., Chinchilla-Rodríguez, Z., Corera-Álvarez, E., Munoz-Fernández, F.: A new technique for building maps of large scientific domains based on the cocitation of classes and categories. Scientometrics 61(1), 129–145 (2004)CrossRefGoogle Scholar
  28. 28.
    Newman, M.E.: The structure of scientific collaboration networks. Proc. Natl. Acad. Sci. 98(2), 404–409 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    Newman, M.E.: Coauthorship networks and patterns of scientific collaboration. Proc. Natl. Acad. Sci. 101(suppl 1), 5200–5205 (2004)CrossRefGoogle Scholar
  30. 30.
    Pang, J., Li, X., Xie, H., Rao, Y.: SBTM: topic modeling over short texts. In: Gao, H., Kim, J., Sakurai, Y. (eds.) DASFAA 2016. LNCS, vol. 9645, pp. 43–56. Springer, Cham (2016). doi: 10.1007/978-3-319-32055-7_4 CrossRefGoogle Scholar
  31. 31.
    Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. EMNLP 14, 1532–1543 (2014)Google Scholar
  32. 32.
    Petticrew, M., Gilbody, S.: Planning and conducting systematic reviews. In: Health Psychology in Practice, pp. 150–179 (2004)Google Scholar
  33. 33.
    Popova, S., Khodyrev, I., Egorov, A., Logvin, S., Gulyaev, S., Karpova, M., Mouromtsev, D.: Sci-search: academic search and analysis system based on keyphrases. In: Klinov, P., Mouromtsev, D. (eds.) KESW 2013. CCIS, vol. 394, pp. 281–288. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-41360-5_24 CrossRefGoogle Scholar
  34. 34.
    Price, D.: Citation measures of hard science, soft science, technology, and nonscience. In: Nelson, C.E., Pollack, D.K. (eds.) Communication Among Scientists and Engineers. Heath Lexington Books Massachusetts (1970)Google Scholar
  35. 35.
    Ramage, D., Rafferty, A.N., Manning, C.D.: Random walks for text semantic similarity. In: Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, pp. 23–31. Association for Computational Linguistics (2009)Google Scholar
  36. 36.
    Small, H.: Visualizing science by citation mapping. J. Associat. Inf. Sci. Technol. 50(9), 799 (1999)Google Scholar
  37. 37.
    Socher, R., Huang, E.H., Pennin, J., Manning, C.D., Ng, A.Y.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems, pp. 801–809 (2011)Google Scholar
  38. 38.
    de Solla Price, D.J.: Networks of scientific papers. Science 149(3683), 510–515 (1965)CrossRefGoogle Scholar
  39. 39.
    Vorontsov, K., Potapenko, A.: Tutorial on probabilistic topic modeling: additive regularization for stochastic matrix factorization. In: Ignatov, D.I., Khachay, M.Y., Panchenko, A., Konstantinova, N., Yavorskiy, R.E. (eds.) AIST 2014. CCIS, vol. 436, pp. 29–46. Springer, Cham (2014). doi: 10.1007/978-3-319-12580-0_3 Google Scholar
  40. 40.
    Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013)Google Scholar
  41. 41.
    Yan, X., Guo, J., Liu, S., Cheng, X., Wang, Y.: Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In: Proceedings of the 2013 SIAM International Conference on Data Mining, pp. 749–757. SIAM (2013)Google Scholar
  42. 42.
    Yang, K., Meho, L.I.: Citation analysis: a comparison of Google Scholar, Scopus, and web of science. Proc. Am. Soc. Inf. Sci. Technol. 43(1), 1–15 (2006). Google Scholar
  43. 43.
    Yang, Z., Yang, D., Dyer, C., He, X., Smola, A.J., Hovy, E.H.: Hierarchical attention networks for document classification. In: HLT-NAACL, pp. 1480–1489 (2016)Google Scholar
  44. 44.
    Zuo, Y., Zhao, J., Xu, K.: Word network topic model: a simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48(2), 379–398 (2016)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Hennadii Dobrovolskyi
    • 1
    Email author
  • Nataliya Keberle
    • 1
  • Olga Todoriko
    • 1
  1. 1.Department of Computer ScienceZaporizhzhya National UniversityZaporizhzhyaUkraine

Personalised recommendations