Abstract
This paper presents a system that uses the domain name of a German business website to locate its information pages (e.g. company profile, contact page, imprint) and then identifies business specific information. We therefore concentrate on the extraction of characteristic vocabulary like company names, addresses, contact details, CEOs, etc. Above all, we interpret the HTML structure of documents and analyze some contextual facts to transform the unstructured web pages into structured forms. Our approach is quite robust in variability of the DOM, upgradeable and keeps data up-to-date. The evaluation experiments show high efficiency of information access to the generated data. Hence, the developed technique is adaptive to non-German websites with slight language-specific modifications, and experimental results on real-life websites confirm the feasibility of the approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering 18, 1411–1428 (2006)
Krötzsch, S., Rösner, D.: Ontology based extraction of company profiles. In: Proceedings of the 2nd International Workshop on Databases, Documents, and Information Fusion, Karlsruhe, Germany (2002)
Labský, M., Svátek, V.: On the design and exploitation of presentation ontologies for information extraction. In: ESWC 2006 Workhshop on Mastering the Gap: From Information Extraction to Semantic Representation, Budva, Montenegro (2006)
Svátek, V., Berka, P., Kavalec, M., Kosek, J., Vavra, V.: Discovering company descriptions on the web by multiway analysis. In: New Trends in Intelligent Information Processing and Web Mining (IIPWM 2003), Zakopane, Poland. Advances in Soft Computing series. Springer, Heidelberg (2003)
Bsiri, S., Geierhos, M., Ringlstetter, C.: Structuring job search via local grammars. In: Advances in Natural Language Processing and Applications. Research in Computing Science (RCS), vol. 33, pp. 201–212 (2008)
Gross, M.: The Construction of Local Grammars. In: Roche, E., Schabès, Y. (eds.) Finite-State Language Processing. Language, Speech, and Communication, pp. 329–354. MIT Press, Cambridge (1997)
Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: KDD 2003: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, Washington, D.C., USA, pp. 601–606 (2003)
Harris, Z.S.: Mathematical Structures of Language. Interscience Tracts in Pure and Applied Mathematics 21, 152–156 (1968)
Harris, Z.S.: Language and Information. Bampton Lectures in America 28, 33–56 (1988)
Grishman, R.: Adaptive information extraction and sublanguage analysis. In: Proceedings of Workshop on Adaptive Text Extraction and Mining at Seventeenth International Joint Conference on Artificial Intelligence, Seattle, USA (2001)
Liu, W., Meng, X., Meng, W.: Vision-based web data records extraction. In: Ninth International Workshop on the Web and Databases (WebDB 2006), Chicago, USA, pp. 20–25 (2006)
Zhu, J., Nie, Z., Wen, J.R., Zhang, B., Ma, W.Y.: Simultaneous record detection and attribute labeling in web data extraction. In: KDD 2006: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, Philadelphia, PA, USA, pp. 494–503 (2006)
Wu, S., Manber, U.: Agrep – a fast approximate pattern-matching tool. In: Proceedings USENIX Winter, Technical Conference, San Francisco, CA, USA, pp. 153–162 (1992)
McDonald, D.: Internal and external evidence in the identification and semantic categorization of proper names. In: Boguraev, B., Pustejovsky, J. (eds.) Corpus processing for lexical acquisition, pp. 21–39. MIT Press, Cambridge (1996)
Mikheev, A., Moens, M., Grover, C.: Named entity recognition without gazetteers. In: Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics, pp. 1–8 (1999)
Embley, D.W., Lopresti, D.P., Nagy, G.: Notes on contemporary table recognition. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 164–175. Springer, Heidelberg (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lee, Y.S., Geierhos, M. (2009). Business Specific Online Information Extraction from German Websites. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2009. Lecture Notes in Computer Science, vol 5449. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00382-0_30
Download citation
DOI: https://doi.org/10.1007/978-3-642-00382-0_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00381-3
Online ISBN: 978-3-642-00382-0
eBook Packages: Computer ScienceComputer Science (R0)