Business Specific Online Information Extraction from German Websites

Lee, Yeong Su; Geierhos, Michaela

doi:10.1007/978-3-642-00382-0_30

Yeong Su Lee¹⁷ &
Michaela Geierhos¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5449))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1779 Accesses

Abstract

This paper presents a system that uses the domain name of a German business website to locate its information pages (e.g. company profile, contact page, imprint) and then identifies business specific information. We therefore concentrate on the extraction of characteristic vocabulary like company names, addresses, contact details, CEOs, etc. Above all, we interpret the HTML structure of documents and analyze some contextual facts to transform the unstructured web pages into structured forms. Our approach is quite robust in variability of the DOM, upgradeable and keeps data up-to-date. The evaluation experiments show high efficiency of information access to the generated data. Hence, the developed technique is adaptive to non-German websites with slight language-specific modifications, and experimental results on real-life websites confirm the feasibility of the approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering 18, 1411–1428 (2006)
Article Google Scholar
Krötzsch, S., Rösner, D.: Ontology based extraction of company profiles. In: Proceedings of the 2nd International Workshop on Databases, Documents, and Information Fusion, Karlsruhe, Germany (2002)
Google Scholar
Labský, M., Svátek, V.: On the design and exploitation of presentation ontologies for information extraction. In: ESWC 2006 Workhshop on Mastering the Gap: From Information Extraction to Semantic Representation, Budva, Montenegro (2006)
Google Scholar
Svátek, V., Berka, P., Kavalec, M., Kosek, J., Vavra, V.: Discovering company descriptions on the web by multiway analysis. In: New Trends in Intelligent Information Processing and Web Mining (IIPWM 2003), Zakopane, Poland. Advances in Soft Computing series. Springer, Heidelberg (2003)
Google Scholar
Bsiri, S., Geierhos, M., Ringlstetter, C.: Structuring job search via local grammars. In: Advances in Natural Language Processing and Applications. Research in Computing Science (RCS), vol. 33, pp. 201–212 (2008)
Google Scholar
Gross, M.: The Construction of Local Grammars. In: Roche, E., Schabès, Y. (eds.) Finite-State Language Processing. Language, Speech, and Communication, pp. 329–354. MIT Press, Cambridge (1997)
Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: KDD 2003: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, Washington, D.C., USA, pp. 601–606 (2003)
Google Scholar
Harris, Z.S.: Mathematical Structures of Language. Interscience Tracts in Pure and Applied Mathematics 21, 152–156 (1968)
MathSciNet MATH Google Scholar
Harris, Z.S.: Language and Information. Bampton Lectures in America 28, 33–56 (1988)
Google Scholar
Grishman, R.: Adaptive information extraction and sublanguage analysis. In: Proceedings of Workshop on Adaptive Text Extraction and Mining at Seventeenth International Joint Conference on Artificial Intelligence, Seattle, USA (2001)
Google Scholar
Liu, W., Meng, X., Meng, W.: Vision-based web data records extraction. In: Ninth International Workshop on the Web and Databases (WebDB 2006), Chicago, USA, pp. 20–25 (2006)
Google Scholar
Zhu, J., Nie, Z., Wen, J.R., Zhang, B., Ma, W.Y.: Simultaneous record detection and attribute labeling in web data extraction. In: KDD 2006: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, Philadelphia, PA, USA, pp. 494–503 (2006)
Google Scholar
Wu, S., Manber, U.: Agrep – a fast approximate pattern-matching tool. In: Proceedings USENIX Winter, Technical Conference, San Francisco, CA, USA, pp. 153–162 (1992)
Google Scholar
McDonald, D.: Internal and external evidence in the identification and semantic categorization of proper names. In: Boguraev, B., Pustejovsky, J. (eds.) Corpus processing for lexical acquisition, pp. 21–39. MIT Press, Cambridge (1996)
Google Scholar
Mikheev, A., Moens, M., Grover, C.: Named entity recognition without gazetteers. In: Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics, pp. 1–8 (1999)
Google Scholar
Embley, D.W., Lopresti, D.P., Nagy, G.: Notes on contemporary table recognition. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 164–175. Springer, Heidelberg (2006)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

CIS, University of Munich, Germany
Yeong Su Lee & Michaela Geierhos

Authors

Yeong Su Lee
View author publications
You can also search for this author in PubMed Google Scholar
Michaela Geierhos
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lee, Y.S., Geierhos, M. (2009). Business Specific Online Information Extraction from German Websites. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2009. Lecture Notes in Computer Science, vol 5449. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00382-0_30

Download citation

DOI: https://doi.org/10.1007/978-3-642-00382-0_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00381-3
Online ISBN: 978-3-642-00382-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics