Skip to main content

Automated Template-Based Metadata Extraction Architecture

  • Conference paper
Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers (ICADL 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4822))

Included in the following conference series:

Abstract

This paper describes our efforts to develop a toolset and process for automated metadata extraction from large, diverse, and evolving document collections. A number of federal agencies, universities, laboratories, and companies are placing their collections online and making them searchable via metadata fields such as author, title, and publishing organization. Manually creating metadata for a large collection is an extremely time-consuming task, but is difficult to automate, particularly for collections consisting of documents with diverse layout and structure. Our automated process enables many more documents to be available online than would otherwise have been possible due to time and cost constraints. We describe our architecture and implementation and illustrate the effectiveness of the tool-set by providing experimental results on two major collections DTIC (Defense Technical Information Center) and NASA (National Aeronautics and Space Administration).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Crystal, A., Land, P.: Metadata and Search: Global Corporate Circle. In: DCMI 2003 Workshop, Seattle, Washington, USA (2003), http://dublincore.org/groups/corporate/Seattle/

  2. Library of Congress, Bibliographic Control of Web Resources: A Library of Congress Action Plan, http://www.loc.gov/catdir/bibcontrol/actionplan.html

  3. Greenburg, J., Spurgin, K., Crystal, A.: Final Report for the Automatic Metadata Generation Applications (AMeGA) Project (2005), UNC School of Information and Library Science, http://ils.unc.edu/mrc/amega/

  4. Defense Technical Information Center. Public Scientific and Technical Information Network. (2007) http://stinet.dtic.mil/str/index.html

  5. National Aeronautics and Space Administration. NASA Technical Reports Server. (2007) http://ntrs.nasa.gov/search.jsp

  6. U.S. Government Printing Office. A Strategic Vision for the 21st Century. Technical report (2004)

    Google Scholar 

  7. Han, H., Manavoglu, E., Zha, H., Tsioutsiouliklis, K., Giles, C.L., Zhang, X.: Rule-based word clustering for document metadata extraction. In: Preneel, B., Tavares, S. (eds.) SAC 2005. LNCS, vol. 3897, pp. 1049–1053. Springer, Heidelberg (2006)

    Google Scholar 

  8. Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries. International Conference on Digital Libraries, pp. 37–48. IEEE Computer Society Press, Washington, DC (2003)

    Google Scholar 

  9. Seymore, K., McCallum, A., Rosenfeld, R.: Learning hidden Markov model structure for information extraction. In: AAAI 1999. Workshop on Machine Learning for Information Extraction (1999)

    Google Scholar 

  10. Tang, J., Maly, K., Zeil, S., Zubair, M.: Automated Building of OAI Compliant Repository from Legacy Collection. In: ELPUB. Proceedings of the 10th International Conference on Electronic Publishing (June 2006)

    Google Scholar 

  11. Mao, S., Kim, J.W., Thoma, G.R.: A Dynamic Feature Generation System for Automated Metadata Extraction in Preservation of Digital Materials. In: Dial 2004. Proceedings of the First international Workshop on Document Image Analysis For Libraries, vol. 225, IEEE Computer Society, Los Alamitos (2004)

    Google Scholar 

  12. Bergmark, D.: Automatic Extraction of Reference Linking Information from Online Documents. CSTR 2000-1821 (November 2000)

    Google Scholar 

  13. Klink, S., Dengel, A., Kieninger, T.: Document structure analysis based on layout and textual features. In: Proc. of Fourth IAPR International Workshop on Document Analysis Systems, pp. 99–111 (2000)

    Google Scholar 

  14. Marciniak, J.J. (ed.): Encyclopedia of Software Engineering, pp. 131–165. John Wiley & Sons, New York (1994)

    MATH  Google Scholar 

  15. Tang, J.: Template-based Metadata Extraction for Heterogeneous Collections. PhD thesis, Old Dominion University (2006)

    Google Scholar 

  16. Steward, Sid, pdftk – the PDF toolkit (2007) http://www.accesspdf.com/pdftk/

  17. Maly, K., Zeil, S., Zubair, M.: Exploiting Dynamic Validation for Document Layout Classification During Metadata Extraction (2007), http://dtic.cs.odu.edu/publications/validationreal07.doc

  18. Maly, K., Zeil, S., Zubair, M., Amrou, A., Aazhar, A., Ratkal, N.: A Scriptable, Statistical Oracle for a Metadata Extraction System. In: First International Workshop on Software Test Evaluation (STEV 2007), Portland, OR (October 11/12, 2007), (to appear, 2007), http://dtic.cs.odu.edu/publications/stev07.pdf

Download references

Author information

Authors and Affiliations

Authors

Editor information

Dion Hoe-Lian Goh Tru Hoang Cao Ingeborg Torvik Sølvberg Edie Rasmussen

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Flynn, P., Zhou, L., Maly, K., Zeil, S., Zubair, M. (2007). Automated Template-Based Metadata Extraction Architecture. In: Goh, D.HL., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds) Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers. ICADL 2007. Lecture Notes in Computer Science, vol 4822. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77094-7_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-77094-7_42

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-77093-0

  • Online ISBN: 978-3-540-77094-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics