Skip to main content

On Convergence of Controlled Snowball Sampling for Scientific Abstracts Collection

  • Conference paper
  • First Online:
Book cover Information and Communication Technologies in Education, Research, and Industrial Applications (ICTERI 2018)

Abstract

This paper presents evidences concerned to convergence of controlled snowball sampling iterations applied to collecting seminal papers in a selected domain of research. Iterations start from the seed paper selection, plain snowball sampling and probabilistic topic modelling, then greedy controlled snowball sampling and analysis of the collected citation network are performed in rotation until the list of seminal papers becomes stable. The topic model is built on the base of word-word co-occurrence probability with combination of sparse symmetric nonnegative matrix factorization and principal component approximation. Experiments show that the number of topics in the model is determined in natural way and the Kullback-Leibler (KL) divergence provides the upper bound of the cosine similarity calculated from keywords assigned by publication authors. Several citation networks are collected and analysed. The analysis shows that all networks are “small worlds” and therefore the observed saturation of the controlled snowball sampling can provide the complete set of publications in domains of interest. Experiments with KL-divergence, symmetric KL-divergence and Jensen-Shannon divergence show that KL-divergence produces less connected citation network but provides better convergence of snowball iterations. Multiple runs of the sampling confirm the hypothesis that the set of seminal publications is stable with respect to variations of the seed papers. The modified main path analysis allows to distinguish the seminal papers including new publications following main stream of research. The comparison of different ranking criterion is made. It shows that Search Path Count provides better lists of seminal papers than citation index, PageRank and indegree.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The presented work is extended version of [12] which is available publicly online at http://ceur-ws.org/Vol-2105/10000179.pdf.

  2. 2.

    Google Scholar https://scholar.google.com.ua/.

  3. 3.

    Microsoft Academic https://academic.microsoft.com/.

  4. 4.

    Semantic Scholar https://www.semanticscholar.org/.

  5. 5.

    NetworkX, https://networkx.github.io.

  6. 6.

    https://github.com/gendobr/snowball.

References

  1. Ahad, A., Fayaz, M., Shah, A.S.: Navigation through citation network based on content similarity using cosine similarity algorithm. Int. J. Database Theory Appl. 9(5), 9–20 (2016)

    Article  Google Scholar 

  2. Akavipat, R., Wu, L.S., Menczer, F., Maguitman, A.G.: Emerging semantic communities in peer web search. In: Proceedings of the International Workshop on Information Retrieval in Peer-to-Peer Networks, pp. 1–8. ACM (2006)

    Google Scholar 

  3. Baez, M., Mirylenka, D., Parra, C.: Understanding and supporting search for scholarly knowledge. In: Proceeding of the 7th European Computer Science Summit, pp. 1–8 (2011)

    Google Scholar 

  4. Barabási, A.L.: Scale-free networks: a decade and beyond. Science 325(5939), 412–413 (2009)

    Article  MathSciNet  Google Scholar 

  5. Barbosa, M.W., Costa, M.M., Almeida, J.M., Almeida, V.A.: Using locality of reference to improve performance of peer-to-peer applications. In: ACM SIGSOFT Software Engineering Notes, vol. 29, pp. 216–227. ACM (2004)

    Article  Google Scholar 

  6. Batagelj, V.: Efficient algorithms for citation network analysis. arXiv preprint cs/0309023 (2003)

    Google Scholar 

  7. Batagelj, V., Mrvar, A.: Pajek-program for large network analysis. Connections 21(2), 47–57 (1998)

    MATH  Google Scholar 

  8. Beel, J., Gipp, B., Langer, S., Breitinger, C.: Paper recommender systems: a literature survey. Int. J. Digit. Librar. 17(4), 305–338 (2016)

    Article  Google Scholar 

  9. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)

    MATH  Google Scholar 

  10. Crespo, A., Garcia-Molina, H.: Routing indices for peer-to-peer systems. In: Proceedings 22nd International Conference on Distributed Computing Systems, pp. 23–32. IEEE (2002)

    Google Scholar 

  11. De Bruijn, N.G.: Asymptotic Methods in Analysis, vol. 4. Courier Corporation, Chelmsford (1981)

    MATH  Google Scholar 

  12. Dobrovolskyi, H., Keberle, N.: Collecting the seminal scientific abstracts with topic modelling, snowball sampling and citation analysis. In: Proceedings of the 14th International Conference on ICT in Education, Research and Industrial Applications. Integration, Harmonization and Knowledge Transfer. Volume I: Main Conference, vol. 2105, pp. 179–192. CEUR-WS (2018)

    Google Scholar 

  13. Dobrovolskyi, H., Keberle, N., Todoriko, O.: Probabilistic topic modelling for controlled snowball sampling in citation network collection. In: Różewski, P., Lange, C. (eds.) KESW 2017. CCIS, vol. 786, pp. 85–100. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69548-8_7

    Chapter  Google Scholar 

  14. Dong, R., Tokarchuk, L., Ma, A.: Digging friendship: paper recommendation in social network. In: Proceedings of Networking and Electronic Commerce Research Conference, NAEC 2009, pp. 21–28 (2009)

    Google Scholar 

  15. Doulamis, N.D., Karamolegkos, P.N., Doulamis, A., Nikolakopoulos, I.: Exploiting semantic proximities for content search over P2P networks. Comput. Commun. 32(5), 814–827 (2009)

    Article  Google Scholar 

  16. Endres, D.M., Schindelin, J.E.: A new metric for probability distributions. IEEE Trans. Inf. Theory (2003)

    Google Scholar 

  17. Ermolayev, V., Batsakis, S., Keberle, N., Tatarintseva, O., Antoniou, G.: Ontologies of time: review and trends. Int. J. Comput. Sci. Appl. 11(3) (2014)

    Google Scholar 

  18. Even, S.: Graph Algorithms. Cambridge University Press, Cambridge (2011)

    Book  Google Scholar 

  19. Golumbic, M.C.: Algorithmic Graph Theory and Perfect Graphs, vol. 57. Elsevier, Amsterdam (2004)

    MATH  Google Scholar 

  20. Gori, M., Pucci, A.: Research paper recommender systems: a random-walk based approach. In: IEEE/WIC/ACM International Conference on Web Intelligence, WI 2006, pp. 778–781. IEEE (2006)

    Google Scholar 

  21. Hamilton, D.P., et al.: Publishing by–and for?–the numbers. Science 250(4986), 1331–1332 (1990)

    Article  Google Scholar 

  22. Huang, Z., Chung, W., Ong, T.H., Chen, H.: A graph-based recommender system for digital library. In: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 65–73. ACM (2002)

    Google Scholar 

  23. Küçüktunç, O., Saule, E., Kaya, K., Çatalyürek, Ü.V.: Recommendation on academic networks using direction aware citation analysis. arXiv preprint arXiv:1205.1143 (2012)

  24. Lao, N., Cohen, W.W.: Relational retrieval using a combination of path-constrained random walks. Mach. Learn. 81(1), 53–67 (2010)

    Article  MathSciNet  Google Scholar 

  25. Lecy, J.D., Beatty, K.E.: Representative literature reviews using constrained snowball sampling and citation network analysis (2012)

    Google Scholar 

  26. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)

    Book  Google Scholar 

  27. Liang, Y., Li, Q., Qian, T.: Finding relevant papers based on citation relations. In: Wang, H., Li, S., Oyama, S., Hu, X., Qian, T. (eds.) WAIM 2011. LNCS, vol. 6897, pp. 403–414. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23535-1_35

    Chapter  Google Scholar 

  28. Lops, P., de Gemmis, M., Semeraro, G.: Content-based recommender systems: state of the art and trends. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook, pp. 73–105. Springer, Boston, MA (2011). https://doi.org/10.1007/978-0-387-85820-3_3

    Chapter  Google Scholar 

  29. Lucio-Arias, D., Leydesdorff, L.: Main-path analysis and path-dependent transitions in histcite™-based historiograms. J. Assoc. Inf. Sci. Technol. 59(12), 1948–1962 (2008)

    Article  Google Scholar 

  30. MacKay, D.J.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)

    MATH  Google Scholar 

  31. Mendenhall, W.M., Sincich, T.L., Boudreau, N.S.: Statistics for Engineering and the Sciences, Student Solutions Manual. Chapman and Hall/CRC, Boca Raton (2016)

    Book  Google Scholar 

  32. Molloy, M., Reed, B.: A critical point for random graphs with a given degree sequence. Random Struct. Algorithms 6(2–3), 161–180 (1995)

    Article  MathSciNet  Google Scholar 

  33. Moya-Anegón, F., Vargas-Quesada, B., Herrero-Solana, V., Chinchilla-Rodríguez, Z., Corera-Álvarez, E., Munoz-Fernández, F.: A new technique for building maps of large scientific domains based on the cocitation of classes and categories. Scientometrics 61(1), 129–145 (2004)

    Article  Google Scholar 

  34. Newman, M.E.: The structure of scientific collaboration networks. Proc. Natl. Acad. Sci. 98(2), 404–409 (2001)

    Article  MathSciNet  Google Scholar 

  35. Newman, M.E.: Coauthorship networks and patterns of scientific collaboration. Proc. Natl. Acad. Sci. 101(Suppl. 1), 5200–5205 (2004)

    Article  Google Scholar 

  36. Nicolini, A.L., Lorenzetti, C.M., Maguitman, A.G., Chesñevar, C.I.: Intelligent algorithms for improving communication patterns in thematic P2P search. Inf. Proces. Manag. 53(2), 388–404 (2017)

    Article  Google Scholar 

  37. Nikulin, M.S.: Hellinger distance. In: Encyclopedia of Mathematics, vol. 78 (2001)

    Google Scholar 

  38. Osborne, F., Motta, E.: Klink-2: integrating multiple web sources to generate semantic topic networks. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 408–424. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25007-6_24

    Chapter  Google Scholar 

  39. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, vol. 14, pp. 1532–1543 (2014)

    Google Scholar 

  40. Petticrew, M., Gilbody, S.: Planning and conducting systematic reviews. Health Psychol. Pract. 150–179 (2004)

    Google Scholar 

  41. Pohl, S., Radlinski, F., Joachims, T.: Recommending related papers based on digital library access records. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 417–418. ACM (2007)

    Google Scholar 

  42. Ráez, A.M., López, L.A.U., Steinberger, R.: Adaptive selection of base classifiers in one-against-all learning for large multi-labeled collections. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 1–12. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30228-5_1

    Chapter  Google Scholar 

  43. Ricci, F., Rokach, L., Shapira, B.: Recommender systems: introduction and challenges. In: Ricci, F., Rokach, L., Shapira, B. (eds.) Recommender Systems Handbook, pp. 1–34. Springer, Boston, MA (2015). https://doi.org/10.1007/978-1-4899-7637-6_1

    Chapter  MATH  Google Scholar 

  44. Salganik, M.J., Heckathorn, D.D.: Sampling and estimation in hidden populations using respondent-driven sampling. Sociol. Methodol. 34(1), 193–240 (2004)

    Article  Google Scholar 

  45. Small, H.: Co-citation in the scientific literature: a new measure of the relationship between two documents. J. Am. Soc. Inf. Sci. 24(4), 265–269 (1973)

    Article  Google Scholar 

  46. de Solla Price, D.J.: Networks of scientific papers. Science 149(3683), 510–515 (1965)

    Article  Google Scholar 

  47. Tan, P.N., et al.: Introduction to Data Mining. Pearson Education India, Delhi (2007)

    Google Scholar 

  48. Trudeau, R.J.: Introduction to Graph Theory. Courier Corporation, Chelmsford (2013)

    Google Scholar 

  49. Valenzuela, M., Ha, V., Etzioni, O.: Identifying meaningful citations. In: AAAI Workshop: Scholarly Big Data (2015)

    Google Scholar 

  50. Varela, A.R., et al.: Mapping the historical development of physical activity and health research: a structured literature review and citation network analysis. Prev. Med. 111, 466–472 (2018)

    Article  Google Scholar 

  51. Vellino, A.: Usage-based vs. citation-based methods for recommending scholarly research articles. arXiv preprint arXiv:1303.7149 (2013)

  52. Vorontsov, K., Potapenko, A.: Tutorial on probabilistic topic modeling: additive regularization for stochastic matrix factorization. In: Ignatov, D.I., Khachay, M.Y., Panchenko, A., Konstantinova, N., Yavorskiy, R.E. (eds.) AIST 2014. CCIS, vol. 436, pp. 29–46. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12580-0_3

    Chapter  Google Scholar 

  53. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440 (1998)

    Article  Google Scholar 

  54. Woodruff, A., Gossweiler, R., Pitkow, J., Chi, E.H., Card, S.K.: Enhancing a digital book with a reading recommender. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 153–160. ACM (2000)

    Google Scholar 

  55. Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013)

    Google Scholar 

  56. Zeinalipour-Yazti, D., Kalogeraki, V., Gunopulos, D.: Information retrieval techniques for peer-to-peer networks. Comput. Sci. Eng. 6(4), 20–26 (2004)

    Article  Google Scholar 

  57. Zeinalipour-Yazti, D., Kalogeraki, V., Gunopulos, D.: Exploiting locality for scalable information retrieval in peer-to-peer networks. Inf. Syst. 30(4), 277–298 (2005)

    Article  Google Scholar 

  58. Zhou, D., et al.: Learning multiple graphs for document recommendations. In: Proceedings of the 17th International Conference on World Wide Web, pp. 141–150. ACM (2008)

    Google Scholar 

  59. Zuo, Y., Zhao, J., Xu, K.: Word network topic model: a simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48(2), 379–398 (2016)

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to express their gratitude to anonymous reviewers whose comments and suggestions helped improve the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hennadii Dobrovolskyi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dobrovolskyi, H., Keberle, N. (2019). On Convergence of Controlled Snowball Sampling for Scientific Abstracts Collection. In: Ermolayev, V., Suárez-Figueroa, M., Yakovyna, V., Mayr, H., Nikitchenko, M., Spivakovsky, A. (eds) Information and Communication Technologies in Education, Research, and Industrial Applications. ICTERI 2018. Communications in Computer and Information Science, vol 1007. Springer, Cham. https://doi.org/10.1007/978-3-030-13929-2_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-13929-2_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-13928-5

  • Online ISBN: 978-3-030-13929-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics