Document Analysis and Retrieval Tasks in Scientific Digital Libraries

  • Sujatha Das GollapalliEmail author
  • Cornelia Caragea
  • Xiaoli Li
  • C. Lee Giles
Part of the Communications in Computer and Information Science book series (CCIS, volume 505)


Machine Learning (ML) algorithms have opened up new possibilities for the acquisition and processing of documents in Information Retrieval (IR) systems. Indeed, it is now possible to automate several labor-intensive tasks related to documents such as categorization and entity extraction. Consequently, the application of machine learning techniques for various large-scale IR tasks has gathered significant research interest in both the ML and IR communities. This tutorial provides a reference summary of our research in applying machine learning techniques to diverse tasks in Digital Libraries (DL). Digital library portals are specialized IR systems that work on collections of documents related to particular domains. We focus on open-access, scientific digital libraries such as CiteSeer\(^x\), which involve several crawling, ranking, content analysis, and metadata extraction tasks. We elaborate on the challenges involved in these tasks and highlight how machine learning methods can successfully address these challenges.


Classification Focused crawling PageRank Citations Topic modeling Information extraction 


  1. 1.
    Hood, W.W., Wilson, C.S.: The literature of bibliometrics, scientometrics, and informetrics. Scientometrics 52(2), 291–314 (2001)CrossRefGoogle Scholar
  2. 2.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  3. 3.
    Boudin, F.: A comparison of centrality measures for graph-based keyphrase extraction. In: IJCNLP (2013)Google Scholar
  4. 4.
    Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998)CrossRefGoogle Scholar
  5. 5.
    Caragea, C., Wu, J., Williams, K., Gollapalli, S.D., Khabsa, M., Teregowda, P., Giles, C.L.: Automatic identification of research articles from crawled documents. In: Web-Scale Classification: Classifying Big Data from the Web, Co-Located with WSDM (2014)Google Scholar
  6. 6.
    Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan-Kauffman, Burlington (2002)Google Scholar
  7. 7.
    Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11–16), 1623–1640 (1999)CrossRefGoogle Scholar
  8. 8.
    Chen, B., Zhu, L., Kifer, D., Lee, D.: What is an opinion about? exploring political standpoints using opinion scoring model. In: AAAI (2010)Google Scholar
  9. 9.
    Councill, I.G., Giles, C.L., Kan, M.-Y.: Parscit: an open-source crf reference string parsing package. In: LREC (2008)Google Scholar
  10. 10.
    Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JASIS 41(6), 391–407 (1990)CrossRefGoogle Scholar
  11. 11.
    Deng, H., King, I., Lyu, M.R.: Formal models for expert finding on dblp bibliography data. In: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 163–172. IEEE Computer Society, Washington, DC, USA (2008)Google Scholar
  12. 12.
    Druck, G., Mann, G., McCallum, A.: Learning from labeled features using generalized expectation criteria. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, pp. 595–602. ACM, New York (2008)Google Scholar
  13. 13.
    Firdhous, M.: Automating legal research through data mining. CoRR, abs/1211.1861 (2012)Google Scholar
  14. 14.
    Frank, E., Paynter, G.W., Witten, I.H., Gutwin, G., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: IJCAI (1999)Google Scholar
  15. 15.
    Ganchev, K., Graça, J., Gillenwater, J., Taskar, B.: Posterior regularization for structured latent variable models. J. Mach. Learn. Res. 11, 2001–2049 (2010)MathSciNetzbMATHGoogle Scholar
  16. 16.
    Gollapalli, S.D., Caragea, C.: Extracting keyphrases from research papers using citation networks. In: AAAI, pp. 1629–1635 (2014)Google Scholar
  17. 17.
    Gollapalli, S.D., Caragea, C., Mitra, P., Giles, C.L.: Researcher homepage classification using unlabeled data. In: Proceedings of the 22nd International Conference on World Wide Web, WWW 2013, pp. 471–482. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland (2013)Google Scholar
  18. 18.
    Gollapalli, S.D., Giles, C.L., Mitra, P., Caragea, C.: On identifying academic homepages for digital libraries. In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL 2011, pp. 123–132. ACM, New York (2011)Google Scholar
  19. 19.
    Gollapalli, S.D., Mitra, P., Giles, C.L.: Learning to rank homepages for researcher-name queries. In: SIGIR Workshop on Entity Oriented Search (2011)Google Scholar
  20. 20.
    Gollapalli, S.D., Mitra, P., Giles, C.L.: Ranking experts using author-document-topic graphs. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital libraries, JCDL 2013, pp. 87–96, ACM, New York (2011)Google Scholar
  21. 21.
    Gollapalli, S.D., Qi, Y., Mitra, P., Giles, C.L.: Extracting researcher metadata with labeled features. In: SDM, pp. 740–748 (2014)Google Scholar
  22. 22.
    Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. U.S.A. 101(Suppl 1), 5228–5235 (2004)CrossRefGoogle Scholar
  23. 23.
    Hammouda, K.M., Matute, D.N., Kamel, M.S.: Corephrase: keyphrase extraction for document clustering. In: Machine Learning and Data Mining in Pattern Recognition (2005)Google Scholar
  24. 24.
    Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, JCDL 2003, pp. 37–48. IEEE Computer Society, Washington, DC, USA (2003)Google Scholar
  25. 25.
    Han, J.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., Burlington (2005)Google Scholar
  26. 26.
    Haveliwala, T., Kamvar, S., Klein, D., Manning, C., Golub, G.: Computing pagerank using power extrapolation. Number 2003–45. Stanford (2003)Google Scholar
  27. 27.
    He, Q., Chen, B., Pei, J., Qiu, B., Mitra, P., Giles, C.L.: Detecting topic evolution in scientific literature: how can citations help? In: CIKM, pp. 957–966 (2009)Google Scholar
  28. 28.
    Heinrich, G.: Parameter estimation for text analysis. Technical report (2008)Google Scholar
  29. 29.
    Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, pp. 50–57. ACM, New York (1999)Google Scholar
  30. 30.
    Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: EMNLP, pp. 216–223 (2003)Google Scholar
  31. 31.
    Jakulin, A., Buntine, W., La Pira, T., Brasher, H.: Analyzing the U.S. senate in 2003: similarities, clusters, and blocs. Polit. Anal. 17(3), 10 (2009)CrossRefGoogle Scholar
  32. 32.
    Jones, S., Staveley, M.S.: Phrasier: a system for interactive document retrieval using keyphrases. In: SIGIR (1999)Google Scholar
  33. 33.
    Kataria, S., Kumar, K.S., Rastogi, R., Sen, P., Sengamedu, S.H.: Entity disambiguation with hierarchical topic models. In: KDD, pp. 1037–1045 (2011)Google Scholar
  34. 34.
    Kataria, S., Mitra, P., Bhatia, S.: Utilizing context in generative bayesian models for linked corpus. In: AAAI (2010)Google Scholar
  35. 35.
    Kataria, S., Mitra, P., Caragea, C., Giles, C.L.: Context sensitive topic models for author influence in document networks. In: IJCAI, pp. 2274–2280 (2011)Google Scholar
  36. 36.
    Kim, S.N., Kan, M.-Y.: Re-examining automatic keyphrase extraction approaches in scientific articles. In: Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, MWE 2009 (2009)Google Scholar
  37. 37.
    Kim, S.N., Medelyan, O., Kan, M.-Y., Baldwin, T.: Automatic keyphrase extraction from scientific articles. Lang. Resour. Eval. 47(3), 723–742 (2013)CrossRefGoogle Scholar
  38. 38.
    Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 282–289, Morgan Kaufmann Publishers Inc., San Francisco (2001)Google Scholar
  39. 39.
    Li, H., Councill, I.G., Bolelli, L., Zhou, D., Song, Y., Lee, W.-C., Sivasubramaniam, A., Giles, C.L.: Citeseerx: a scalable autonomous scientific digital library. In: Proceedings of the 1st International Conference on Scalable Information Systems, InfoScale 2006. ACM, New York (2006)Google Scholar
  40. 40.
    Li, X., Ng, S.-K., Wang, J.T.L.: Biological Data Mining and Its Applications in Healthcare, 1st edn. World Scientific Publishing Co., Inc., Singapore (2013)Google Scholar
  41. 41.
    Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications). Springer-Verlag New York Inc., New York (2006)Google Scholar
  42. 42.
    Liu, F., Pennell, D., Liu, F., Liu, Y.: Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Proceedings of NAACL 2009, pp. 620–628 (2009)Google Scholar
  43. 43.
    Liu, X., Croft, W.B.: Statistical language modeling for information retrieval. ARIST 39(1), 1–31 (2005)Google Scholar
  44. 44.
    Mann, G.S., McCallum, A.: Generalized expectation criteria for semi-supervised learning with weakly labeled data. J. Mach. Learn. Res. 11, 955–984 (2010)MathSciNetzbMATHGoogle Scholar
  45. 45.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)zbMATHCrossRefGoogle Scholar
  46. 46.
    Marujo, L., Ribeiro, R., de Matos, D.M., Neto, J.P., Gershman, A., Carbonell, J.G.: Key phrase extraction of lightly filtered broadcast news. CoRR (2013)Google Scholar
  47. 47.
    Nguyen, T.D., Kan, M.-Y.: Keyphrase extraction in scientific publications. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007) Google Scholar
  48. 48.
    Ortega-Priego, J.-L., Aguillo, I.F., Prieto-Valverde, J.A.: Longitudinal study of contents and elements in the scientific web environment. J. Inf. Sci. 32(4), 344–351 (2006)CrossRefGoogle Scholar
  49. 49.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical report (1999)Google Scholar
  50. 50.
    Pudota, N., Dattolo, A., Baruzzo, A., Ferrara, F., Tasso, C.: Automatic keyphrase extraction and ontology mining for content-based tag recommendation. Int. J. Intell. Syst. 25(12), 1158–1186 (2010)zbMATHCrossRefGoogle Scholar
  51. 51.
    Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)zbMATHGoogle Scholar
  52. 52.
    Sarawagi, S.: Information extraction. Found. Trends Databases 1(3), 261–377 (2008)zbMATHCrossRefGoogle Scholar
  53. 53.
    Tang, J., Jin, R., Zhang, J.: A topic modeling approach and its integration into the random walk framework for academic search. In: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 1055–1060. IEEE Computer Society, Washington, DC, USA (2008)Google Scholar
  54. 54.
    Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: Arnetminer: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery nd Data Mining, KDD 2008, pp. 990–998. ACM, New York (2008)Google Scholar
  55. 55.
    Teregowda, P.B., Councill, I.G., Fernández, R.J.P., Khabsa, M., Zheng, S., Giles, C.L.: Seersuite: developing a scalable and reliable application framework for building digital libraries by crawling the web. In: Proceedings of the 2010 USENIX Conference on Web Application Development WebApps 2010 (2010)Google Scholar
  56. 56.
    Tuarob, S., Pouchard, L.C., Giles, C.L.: Automatic tag recommendation for metadata annotation using probabilistic topic modeling. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2013, pp. 239–248. ACM (2013)Google Scholar
  57. 57.
    Wu, J., Williams, K., Chen, H.-H., Khabsa, M., Caragea, C., Ororbia, A., Jordan, D., Giles, C.L.: Citeseerx: Ai in a digital library search engine. In: IAAI (2014)Google Scholar
  58. 58.
    Zha, H.: Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In: SIGIR (2002)Google Scholar
  59. 59.
    Zheng, S., Zhou, D., Li, J., Giles, C.L.: Extracting author meta-data from web using visual features. In: Proceedings of the Seventh IEEE International Conference on Data Mining Workshops, ICDMW 2007, pp. 33–40. IEEE Computer Society, Washington, DC, USA (2007)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (, which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Authors and Affiliations

  • Sujatha Das Gollapalli
    • 1
    Email author
  • Cornelia Caragea
    • 2
  • Xiaoli Li
    • 1
  • C. Lee Giles
    • 3
  1. 1.Institute for Infocomm ResearchAgency for Science and Technology ResearchSingaporeSingapore
  2. 2.Computer Science and EngineeringUniversity of North TexasDentonUSA
  3. 3.Information Sciences and Technology, Computer Science and EngineeringThe Pennsylvania State UniversityState CollegeUSA

Personalised recommendations