Knowledge discovery through text-based similarity searches for astronomy literature

  • Wolfgang E. KerzendorfEmail author


The increase in the number of researchers coupled with the ease of publishing and distribution of scientific papers (due to technological advancements) has resulted in a dramatic increase in astronomy literature. This has likely led to the predicament that the body of the literature is too large for traditional human consumption and that related and crucial knowledge is not discovered by researchers. In addition to the increased production of astronomical literature, recent decades have also brought several advancements in computational linguistics. Especially, the machine-aided processing of literature dissemination might make it possible to convert this stream of papers into a coherent knowledge set. In this paper, we present the application of computational linguistics techniques to astronomy literature. In particular, we developed a tool that will find similar articles purely based on text content f rom an input paper. We find that our technique performs robustly in comparison with other tools recommending articles given a reference paper (known as recommender system). Our novel tool shows great power in combining computational linguistics with astronomy literature and suggests that additional research in this endeavor will likely produce even better tools that will help researchers cope with vast amounts of knowledge being produced.


Natural language processing methods: statistical 



The author was supported by an ESO Fellowship and the Excellence Cluster Universe, Technische Universität München, Boltzmannstrasse 2, D-85748 Garching, Germany. He would also like to thank the detailed discussions, encouragement and suggestions from Felix Stoehr and Jason Spyromillio. The support from the library team (Uta Grothkopf, Dominic Bordelon and Silvia Meakins) was invaluable to get an insight into the field of knowledge discovery. Christine Borgman and Bernie Randles (at UCLA) gave suggestions from an Information Sciences point-of-view and the visit would not have been possible if not for the generousity of the UCLA Galactic Center Group (especially Tuan Do and Andrea Ghez). He also thanks Bruno Leibundgut, Kathatrina Immer and Ivan Cabrera-Ziri for testing the algorithm on some well-known papers. Finally, he would like to thank Hinrich Schütze for useful discussion of tools and techniques in the NLP field.


  1. Achakulvisut T., Acuna D. E., Ruangrong T., Kording K. 2016, PLoS ONE, 11, e0158423, CrossRefGoogle Scholar
  2. Baeza-Yates R., Ribeiro B. d. A. N. et al. 2011, Modern Information Retrieval, ACM Press, New YorkGoogle Scholar
  3. Bastian N., Cabrera-Ziri I., Salaris M. 2015, MNRAS, 449, 3333, CrossRefADSGoogle Scholar
  4. Bird S., Klein E., Loper, E. 2009, Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit, O’Reilly Media, Inc.Google Scholar
  5. Bornmann L., Mutz R. 2015, Journal of the Association for Information Science and Technology, 66, 2215CrossRefGoogle Scholar
  6. Chyla R., Accomazzi A., Holachek A. et al. 2015, in Taylor A. R., & Rosolowsky E., eds., Astronomical Society of the Pacific Conference Series, Volume 495, Astronomical Data Analysis Software an Systems XXIV (ADASS XXIV), 401Google Scholar
  7. de Gregorio-Monsalvo I., Ménard F., Dent W. et al. 2013, A&A, 557, A133, CrossRefGoogle Scholar
  8. Fellbaum C. 1998, WordNet, Wiley Online LibraryGoogle Scholar
  9. Goulden R., Nation P., Read J. 1990, Applied Linguistics, 11, 341CrossRefGoogle Scholar
  10. Harris Z. S. 1954, Word, 10, 146CrossRefGoogle Scholar
  11. Henneken E., Kurtz M. 2010, in APS March Meeting AbstractsGoogle Scholar
  12. Kerzendorf W. E., Schmidt B. P., Laird J. B., Podsiadlowski P., Bessell M. S. 2012, ApJ, 759, 7, CrossRefADSGoogle Scholar
  13. Krstovski K., Smith D. A., Kurtz M. J. 2016, arXiv e-prints, arXiv:1601.01611.
  14. Kurtz M. J. 2011, Astrophysics and Space Science Proceedings, 24, 23, CrossRefADSGoogle Scholar
  15. Kurtz M. J., Eichhorn G., Accomazzi A., et al. 2000, A&AS, 143, 41, CrossRefADSGoogle Scholar
  16. Liu H., Christiansen T., Baumgartner W. A., Verspoor K. 2012, Journal of Biomedical Semantics, 3, 3CrossRefGoogle Scholar
  17. Luhn H. P. 1957, IBM Journal of Research and Development, 1, 309, MathSciNetCrossRefGoogle Scholar
  18. Manning C. D., Raghavan P., Schütze H. 2008, Introduction to Information Retrieval, 100, 2Google Scholar
  19. Page L., Brin S., Motwani R., Winograd T. 1999, The PageRank citation ranking: Bringing order to the web., Tech. rep., Stanford InfoLabGoogle Scholar
  20. Pedregosa F., Varoquaux G., Gramfort A., et al. 2011, Journal of Machine Learning Research, 12, 2825Google Scholar
  21. Schlegel D. J., Finkbeiner D. P., Davis M. 1998, ApJ, 500, 525, CrossRefADSGoogle Scholar
  22. Simpson J., Weiner E. S. 1989, Clarendon Press, Oxford. Retrieved March, 6, 2008Google Scholar
  23. Sparck Jones K. 1972, Journal of Documentation, 28, 11CrossRefGoogle Scholar
  24. van Wesel M., Wyatt S., ten Haaf J. 2014, Scientometrics, 98, 1601, CrossRefGoogle Scholar

Copyright information

© Indian Academy of Sciences 2019

Authors and Affiliations

  1. 1.Center for Cosmology and Particle PhysicsNew York UniversityNew YorkUSA
  2. 2.European Southern ObservatoryGarchingGermany

Personalised recommendations