Automated Template-Based Metadata Extraction Architecture

Flynn, Paul; Zhou, Li; Maly, Kurt; Zeil, Steven; Zubair, Mohammad

doi:10.1007/978-3-540-77094-7_42

Paul Flynn¹,
Li Zhou¹,
Kurt Maly¹,
Steven Zeil¹ &
…
Mohammad Zubair¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4822))

Included in the following conference series:

International Conference on Asian Digital Libraries

1770 Accesses
12 Citations

Abstract

This paper describes our efforts to develop a toolset and process for automated metadata extraction from large, diverse, and evolving document collections. A number of federal agencies, universities, laboratories, and companies are placing their collections online and making them searchable via metadata fields such as author, title, and publishing organization. Manually creating metadata for a large collection is an extremely time-consuming task, but is difficult to automate, particularly for collections consisting of documents with diverse layout and structure. Our automated process enables many more documents to be available online than would otherwise have been possible due to time and cost constraints. We describe our architecture and implementation and illustrate the effectiveness of the tool-set by providing experimental results on two major collections DTIC (Defense Technical Information Center) and NASA (National Aeronautics and Space Administration).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Crystal, A., Land, P.: Metadata and Search: Global Corporate Circle. In: DCMI 2003 Workshop, Seattle, Washington, USA (2003), http://dublincore.org/groups/corporate/Seattle/
Library of Congress, Bibliographic Control of Web Resources: A Library of Congress Action Plan, http://www.loc.gov/catdir/bibcontrol/actionplan.html
Greenburg, J., Spurgin, K., Crystal, A.: Final Report for the Automatic Metadata Generation Applications (AMeGA) Project (2005), UNC School of Information and Library Science, http://ils.unc.edu/mrc/amega/
Defense Technical Information Center. Public Scientific and Technical Information Network. (2007) http://stinet.dtic.mil/str/index.html
National Aeronautics and Space Administration. NASA Technical Reports Server. (2007) http://ntrs.nasa.gov/search.jsp
U.S. Government Printing Office. A Strategic Vision for the 21st Century. Technical report (2004)
Google Scholar
Han, H., Manavoglu, E., Zha, H., Tsioutsiouliklis, K., Giles, C.L., Zhang, X.: Rule-based word clustering for document metadata extraction. In: Preneel, B., Tavares, S. (eds.) SAC 2005. LNCS, vol. 3897, pp. 1049–1053. Springer, Heidelberg (2006)
Google Scholar
Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries. International Conference on Digital Libraries, pp. 37–48. IEEE Computer Society Press, Washington, DC (2003)
Google Scholar
Seymore, K., McCallum, A., Rosenfeld, R.: Learning hidden Markov model structure for information extraction. In: AAAI 1999. Workshop on Machine Learning for Information Extraction (1999)
Google Scholar
Tang, J., Maly, K., Zeil, S., Zubair, M.: Automated Building of OAI Compliant Repository from Legacy Collection. In: ELPUB. Proceedings of the 10th International Conference on Electronic Publishing (June 2006)
Google Scholar
Mao, S., Kim, J.W., Thoma, G.R.: A Dynamic Feature Generation System for Automated Metadata Extraction in Preservation of Digital Materials. In: Dial 2004. Proceedings of the First international Workshop on Document Image Analysis For Libraries, vol. 225, IEEE Computer Society, Los Alamitos (2004)
Google Scholar
Bergmark, D.: Automatic Extraction of Reference Linking Information from Online Documents. CSTR 2000-1821 (November 2000)
Google Scholar
Klink, S., Dengel, A., Kieninger, T.: Document structure analysis based on layout and textual features. In: Proc. of Fourth IAPR International Workshop on Document Analysis Systems, pp. 99–111 (2000)
Google Scholar
Marciniak, J.J. (ed.): Encyclopedia of Software Engineering, pp. 131–165. John Wiley & Sons, New York (1994)
MATH Google Scholar
Tang, J.: Template-based Metadata Extraction for Heterogeneous Collections. PhD thesis, Old Dominion University (2006)
Google Scholar
Steward, Sid, pdftk – the PDF toolkit (2007) http://www.accesspdf.com/pdftk/
Maly, K., Zeil, S., Zubair, M.: Exploiting Dynamic Validation for Document Layout Classification During Metadata Extraction (2007), http://dtic.cs.odu.edu/publications/validationreal07.doc
Maly, K., Zeil, S., Zubair, M., Amrou, A., Aazhar, A., Ratkal, N.: A Scriptable, Statistical Oracle for a Metadata Extraction System. In: First International Workshop on Software Test Evaluation (STEV 2007), Portland, OR (October 11/12, 2007), (to appear, 2007), http://dtic.cs.odu.edu/publications/stev07.pdf

Download references

Author information

Authors and Affiliations

Department of Computer Science, Old Dominion University, Norfolk, VA. 23529,
Paul Flynn, Li Zhou, Kurt Maly, Steven Zeil & Mohammad Zubair

Authors

Paul Flynn
View author publications
You can also search for this author in PubMed Google Scholar
Li Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Kurt Maly
View author publications
You can also search for this author in PubMed Google Scholar
Steven Zeil
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Zubair
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Dion Hoe-Lian Goh Tru Hoang Cao Ingeborg Torvik Sølvberg Edie Rasmussen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Flynn, P., Zhou, L., Maly, K., Zeil, S., Zubair, M. (2007). Automated Template-Based Metadata Extraction Architecture. In: Goh, D.HL., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds) Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers. ICADL 2007. Lecture Notes in Computer Science, vol 4822. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77094-7_42

Download citation

DOI: https://doi.org/10.1007/978-3-540-77094-7_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77093-0
Online ISBN: 978-3-540-77094-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics