Skip to main content
Log in

A statistical approach for modeling inter-document semantic relationships in digital libraries

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

E-Learning repositories and digital libraries are fast becoming important sources for gathering information and learning material. Such systems must therefore provide services to support the learning needs of their users. When a retrieval system shows how its documents relate to each other semantically, a user gets the liberty to choose from different material, and direct his/her study in a focused manner. This calls for a model that identifies types of document relationships, that need to address different aspects of learning. This article defines three such types and a unique statistical model that can automatically identify them in technical/scientific documents. The model defines measures to quantify the degree of relatedness based on distinct statistical patterns exhibited by the common terms in a pair of documents. This approach does not strictly require a knowledge base or hypertext for identifying the characteristic relationship between two documents. Such a statistical model can be extended to build further relatedness types and can be used alongside various other techniques in digital library recommendation engines. Our experiments over a large number of technical documents show that our techniques effectively extract the different types of relationships between documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • MIT (2012). Mit open courseware. http://ocw.mit.edu/.

  • NPTEL (2012). National Programme on Technology Enhanced Learning, NPTEL. http://nptel.iitm.ac.in/.

  • Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., & Soroa, A. (2009). A study on similarity and relatedness using distributional and wordnet-based approaches, Proceedings of Human Language Technologies: NAACL, Association for Computational Linguistics (pp. 19–27).

    Google Scholar 

  • Aletras, N., Stevenson, M., & Clough, P. (2012). Computing similarity between items in a digital library of cultural heritage. Journal on Computing and Cultural Heritage (JOCCH), 5(4), 16.

    Google Scholar 

  • Andrews, K., Gütl, C., Moser, J., Sabol, V., & Lackner, W. (2001). Search result visualisation with xfind, User Interfaces to Data Intensive Systems, 2001. UIDIS 2001. Proceedings. Second International Workshop on, IEEE (pp. 50–58).

    Chapter  Google Scholar 

  • Balagopalan, A., Balasubramanian, L.L., Balasubramanian, V., Chandrasekharan, N., & Damodar, A. (2012). Automatic keyphrase extraction and segmentation of video lectures, Technology Enhanced Education (ICTEE), 2012 IEEE International Conference on, IEEE (pp. 1–10).

    Chapter  Google Scholar 

  • Bean, A., & Green, R. (2001). Relationships in the Organization of Knowledge Vol. 2. Berlin: Springer.

  • Capelle, M., Hogenboom, F., Hogenboom, A., & Frasincar, F. (2013). Semantic news recommendation using wordnet and bing similarities, Proceedings of the 28th Annual ACM Symposium on Applied Computing, ACM (pp. 296–302).

    Chapter  Google Scholar 

  • Chalmers, M., & Chitson, P. (1992). Bead: Explorations in information visualization, Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval, ACM (pp. 330–337).

    Google Scholar 

  • Denning, P., Horning, J., Parnas, D., & Weinstein, L. (2005). Wikipedia risks. Communications of the ACM, 48(12), 152–152.

    Article  Google Scholar 

  • Foltz, P.W., Kintsch, W., & Landauer, T.K. (1998). The measurement of textual coherence with latent semantic analysis. Discourse processes, 25(2–3), 285–307.

    Article  Google Scholar 

  • Frantzi, K.T., & Ananiadou, S. (1996). Extracting nested collocations, Proceedings of the 16th conference on Computational linguistics-Volume 1 (pp. 41–46).

    Chapter  Google Scholar 

  • Frantzi, K., Ananiadou, S., & Mima, H. (2000). Automatic recognition of multi-word terms:. the C-value/NC-value method. International Journal on Digital Libraries, 3(2), 115–130.

    Article  Google Scholar 

  • Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis, IJCAI, (Vol. 7 pp. 1606–1611).

  • Gonzalez-Agirre, A., Rigau, G., Agirre, E., Aletras, N., & Stevenson, M. (2015). Why are these similar? Investigating item similarity types in a large digital library. Journal of the Association for Information Science and Technology.

  • Gouws, S. (2010). Evaluation and development of conceptual document similarity metrics with content-based recommender applications, Stellenbosch: University of Stellenbosch.

  • Hofmann, T. (1999). Probabilistic latent semantic analysis. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence (pp. 289–296): Morgan Kaufmann Publishers Inc.

  • Hopfgartner, F. (2010). Personalised video retrieval: Application of implicit feedback and semantic user profiles, University of Glasgow.

  • Huang, A. (2008). Similarity measures for text document clustering, Proceedings of the Sixth New Zealand Computer Science Research Student Conference (pp. 49–56).

    Google Scholar 

  • Huang, L., Milne, D., Frank, E., & Witten, I.H. (2012). Learning a concept-based document similarity measure. Journal of the American Society for Information Science and Technology, 63(8), 1593–1608.

    Article  Google Scholar 

  • Huynh, T., Hoang, K., Do, L., Tran, H., Luong, H., & Gauch, S. (2012). Scientific publication recommendations based on collaborative citation networks, Collaboration Technologies and Systems (CTS), 2012 International Conference on (pp. 316–321).

    Chapter  Google Scholar 

  • Khoo, C.S.G., & Na, J.C. (2006). Semantic relations in information science. Annual Review of Information Science and Technology, 40, 157–228.

    Article  Google Scholar 

  • Lai, C.H., Liu, D.R., & Lin, C.S. (2013). Novel personal and group-based trust models in collaborative filtering for document recommendation. Information Sciences, 239(0), 31–49.

    Article  Google Scholar 

  • McCormack, A.J., & Yager, R. E. (1989). A new taxonomy of science education. Science Teacher, 56(2), 47–48.

    Google Scholar 

  • Rafi, M., & Shaikh, M.S. (2013). An improved semantic similarity measure for document clustering based on topic maps. arXiv:1303.4087.

  • Schaefer, C., Hienert, D., & Gottron, T. (2014). Normalized Relevance Distance–A Stable Metric for Computing Semantic Relatedness over Reference Corpora, ECAI.

  • Strube, M., & Ponzetto, S.P. (2006). WikiRelate! Computing semantic relatedness using Wikipedia, AAAI, (Vol. 6 pp. 1419–1424).

  • Turdakov, D., & Velikhov, P. (2008). Semantic relatedness metric for wikipedia concepts based on link analysis and its application to word sense disambiguation.

  • Wan, X. (2007). A novel document similarity measure based on earth mover’s distance. Information Sciences, 177(18), 3718–3730.

    Article  Google Scholar 

  • Wan, X.J., & Peng, Y.X. (2005). A new retrieval model based on texttiling for document similarity search. Journal of Computer Science and Technology, 20(4), 552–558.

    Article  Google Scholar 

  • Wu, H.C., Luk, R.W.P., Wong, K.F., & Kwok, K.L. (2008). Interpreting tf-idf term weights as making relevance decisions. ACM Transactions on Information Systems (TOIS), 26(3), 13.

    Article  Google Scholar 

  • Zarrinkalam, F., & Kahani M. (2012). A new metric for measuring relatedness of scientificpapers based on non-textual features: Scientific Research Publishing.

  • Zhang, T., Ramakrishnan, R., & Livny, M. (1996). Birch: an efficient data clustering method for very large databases, ACM SIGMOD Record, (Vol. 25 pp. 103–114).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vidhya Balasubramanian.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Muralikumar, J., Seelan, S.A., Vijayakumar, N. et al. A statistical approach for modeling inter-document semantic relationships in digital libraries. J Intell Inf Syst 48, 477–498 (2017). https://doi.org/10.1007/s10844-016-0423-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-016-0423-6

Keywords

Navigation