Skip to main content
Log in

Practical maintenance of evolving metadata for digital preservation: algorithmic solution and system support

  • REGULAR PAPER
  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

Metadata (i.e., data describing about data) of digital objects plays an important role in digital libraries and archives, and thus its quality needs to be maintained well. However, as digital objects evolve over time, their associated metadata evolves as well, causing a consistency issue. Since various functionalities of applications containing digital objects (e.g., digital library, public image repository) are based on metadata, evolving metadata directly affects the quality of such applications. To make matters worse, modern data applications are often large-scale (having millions of digital objects) and are constructed by software agents or crawlers (thus often having automatically populated and erroneous metadata). In such an environment, it is challenging to quickly and accurately identify evolving metadata and fix them (if needed) while applications keep running. Despite the importance and implications of the problem, the conventional solutions have been very limited. Most of existing metadata-related approaches either focus on the model and semantics of metadata, or simply keep authority file of some sort for evolving metadata, and never fully exploit its potential usage from the system point of view. On the other hand, the question that we raise in this paper is “when millions of digital objects and their metadata are given, (1) how to quickly identify evolving metadata in various context? and (2) once the evolving metadata are identified, how to incorporate them into the system?” The significance of this paper is that we investigate scalable algorithmic solution toward the identification of evolving metadata and emphasize the role of “systems” for maintenance, and argue that “systems” must keep track of metadata changes pro-actively, and leverage on the learned knowledge in their various services.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB (2002)

  2. arXiv.org e Print archive. http://arxiv.org/

  3. Atkins, H., Lyons, C., Ratner, H., Risher, C., Shillum, C., Sidman, D., Stevens, A., Arms, W.: Reference linking with DOIs: a case study. D-Lib Magazine (2000)

  4. Bergmark, D., Lagoze, C.: An architecture for automatic reference linking. In: European Conf. on Digital Libraries (ECDL), Darmstadt, Germany (2001)

  5. Digital Bibliography and Library Project (DBLP). http://dblp. uni-trier.de/

  6. Bilenko M., Mooney R., Cohen W., Ravikumar P. and Fienberg S. (2003). Adaptive name-matching in information integration. IEEE Intell. Syst. 18(5): 16–23

    Article  Google Scholar 

  7. Borkar, V.R., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. In: ACM SIGMOD, Santa Barbara (2001)

  8. Caplan, P., Arms, W.: Reference linking for journal articles. D-Lib Magaz., 5(7/8) (1999) http://www.dlib.org/dlib/july99/ caplan/07caplan.html

  9. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and Efficient Fuzzy Match for Online Data Cleaning. In: ACM SIGMOD (2003)

  10. Cohen W.W. (2000). Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst. (TOIS) 18(3): 288–321

    Article  Google Scholar 

  11. Cruz, J.M.B., Klink, N.J.R., Krichel, T.: Personal data in a large digital library. In: European Conf. on Digital Libraries (ECDL) (2000)

  12. Davis, P.T., Elson, D.K., Klavans, J.L.: Methods for precise named entity matching in digital collection. In: ACM/IEEE Joint Conf. on Digital Libraries (JCDL) (2003)

  13. DCMI. Dublin Core Metadata Initiative. Web page. http://dublincore.org/.

  14. Fellegi I.P. and Sunter A.B. (1969). A theory for record linkage. J. Am. Stati. Soc. 64: 1183–1210

    Article  Google Scholar 

  15. A Library for Support Vector~Machines. http://www.csie.ntu. edu.tw/~cjlin/libsvm/.

  16. Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins for data cleansing and integration in an RDBMS. In: IEEE ICDE, (2003)

  17. Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: Int’l World Wide Web Conf. (WWW) (2003)

  18. Han, H., Giles, C.L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two supervised learning approaches for name disambiguation in author citations. In: ACM/IEEE Joint Conf. on Digital Libraries (JCDL), Jun. (2004)

  19. Hellman, E.: Scholarly Link Specification Framework (SLinkS), Nov. 1998. http://www.openly.com/SLinkS/

  20. Hernandez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: ACM SIGMOD (1995)

  21. Hitchcock, S., Brody, T., Gutteridge, C., Carr, L., Hall, W., Harnad, S., Bergmark, D., Lagoze, C.: Open Citation Linking: The Way Forward. D-Lib Magaz. 8(10) (2002)

  22. Hitchcock, S., Carr, L., Hall, W., Harris, S., Probets, S., Evans, D., Brailsford, D.: Linking electronic journals: lessons from the open journal project. D-Lib Magaz (1998)

  23. Hong, Y., On, B.-W., Lee, D.: System support for name authority control problem in digital libraries: OpenDBLP approach. In: European Conf. on Digital Libraries (ECDL), Bath (2004)

  24. Hylton, J.A.: Identifying and Merging Related Bibliographic Records. PhD thesis, Dept. of EECS, MIT, LCS (1996) Technical Report MIT/LCS/TR-678

  25. ISI/Science Citation Index. http://www.isinet.com/

  26. Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Ame. Stat. Assoc, 84(406) (1989)

  27. Lawrence S., Giles C.L. and Bollacker K. (1999). Digital Libraries and autonomous citation indexing. IEEE Comput. 32(6): 67–71

    Article  Google Scholar 

  28. Lee, D., On, B.-W., Kang, J., Park, S.: Effective and scalable solutions for mixed and split citation problems in digital libraries. In: ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS), Jun. (2005)

  29. CiteSeer: Scientific Literature Digital Library. http://citeseer. ist.psu.edu/

  30. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: ACM KDD, Boston (2000)

  31. Miner, R.: Enhancing the Searching of Mathematics, Jun. (2004) http://www.ima.umn.edu/complex/spring/math-searching.html

  32. Monge, A.E.: Adaptive detection of approximately duplicate database records and the database integration approach to information discovery. PhD Thesis, University of California, San Diego (1997)

  33. OCLC. Persistent Uniform Resource Locator. Web page. http://purl.oclc.org/

  34. Library of Congress. LC Digital Repository Development Core Metadata Elements Introduction Page. Web page, (2004) http://www.loc.gov/standards/metadata.html

  35. On, B.-W., Elmacioglu, E., Lee, D., Kang, J., Pei, J.: An effective approach to entity resolution problem using quasi-clique and its application to digital libraries. In: ACM/IEEE Joint Conf. on Digital Libraries (JCDL), Jun (2006)

  36. On, B.-W., Lee, D., Kang, J., Mitra, P.: Comparative study of name disambiguation problem using a scalable blocking-based framework. In: ACM/IEEE Joint Conf. on Digital Libraries (JCDL), Jun. (2005)

  37. Paskin, N.: DOI: a 2003 Progress Report. D-Lib Magaz 9(6) (2003)

  38. Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.: Identity uncertainty and citation matching. In: Advances in Neural Information Processing Systems. MIT Press, Cambridge (2003)

  39. Petinot, Y., Teregowda, P.B., Han, H., Giles, C.L., Lawrence, S., Rangaswamy, A., Pal, N.: eBizSearch: An OAI-compliant digital library for ebusiness. In: ACM/IEEE Joint Conf. on Digital Libraries (JCDL), Houston, May (2003)

  40. The Open Citation Project. http://opcit.eprints.org/

  41. SecondString: Open source Java-based Package~of Approximate String-Matching. http://secondstring.sourceforge.net/

  42. Synman, M.M.M., van Rensburg, M.J.: Revolutionizing Name Authority Control. In ACM Int’l Conference on Digital Libraries (DL) (2000)

  43. Takasu, A.: Bibliographic attribute extraction from erroneous references based on a statistical model. In: ACM/IEEE Joint Conf. on Digital Libraries (JCDL), Houston, May (2003)

  44. Tejada S., Knoblock C.A. and Minton S. (2001). Learning object identification rules for information integration. Inf. Sys. 26(8): 607–633

    Article  Google Scholar 

  45. Tillett, B.: FRBR: A conceptual model for the bibliographic universe. Library of Congress Cataloging Distribution Service, 2004. http://www.loc.gov/cds/downloads/FRBR.PDF.

  46. VIAF. Virtual International Authority File (VIAF) project. Web page. http://www.oclc.org/research/projects/viaf/default.htm

  47. Warnner, J.W., Brown, E.W.: Automated name authority control. In: ACM/IEEE Joint Conf. on Digital Libraries (JCDL) (2001)

  48. Winkler, W.E.: The state of record linkage and current research problems. Technical report, US Bureau of the Census, Apr (1999)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dongwon Lee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, D. Practical maintenance of evolving metadata for digital preservation: algorithmic solution and system support. Int J Digit Libr 6, 313–326 (2007). https://doi.org/10.1007/s00799-007-0014-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-007-0014-9

Keywords

Navigation