Skip to main content

Business Specific Online Information Extraction from German Websites

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2009)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5449))

  • 1779 Accesses

Abstract

This paper presents a system that uses the domain name of a German business website to locate its information pages (e.g. company profile, contact page, imprint) and then identifies business specific information. We therefore concentrate on the extraction of characteristic vocabulary like company names, addresses, contact details, CEOs, etc. Above all, we interpret the HTML structure of documents and analyze some contextual facts to transform the unstructured web pages into structured forms. Our approach is quite robust in variability of the DOM, upgradeable and keeps data up-to-date. The evaluation experiments show high efficiency of information access to the generated data. Hence, the developed technique is adaptive to non-German websites with slight language-specific modifications, and experimental results on real-life websites confirm the feasibility of the approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering 18, 1411–1428 (2006)

    Article  Google Scholar 

  2. Krötzsch, S., Rösner, D.: Ontology based extraction of company profiles. In: Proceedings of the 2nd International Workshop on Databases, Documents, and Information Fusion, Karlsruhe, Germany (2002)

    Google Scholar 

  3. Labský, M., Svátek, V.: On the design and exploitation of presentation ontologies for information extraction. In: ESWC 2006 Workhshop on Mastering the Gap: From Information Extraction to Semantic Representation, Budva, Montenegro (2006)

    Google Scholar 

  4. Svátek, V., Berka, P., Kavalec, M., Kosek, J., Vavra, V.: Discovering company descriptions on the web by multiway analysis. In: New Trends in Intelligent Information Processing and Web Mining (IIPWM 2003), Zakopane, Poland. Advances in Soft Computing series. Springer, Heidelberg (2003)

    Google Scholar 

  5. Bsiri, S., Geierhos, M., Ringlstetter, C.: Structuring job search via local grammars. In: Advances in Natural Language Processing and Applications. Research in Computing Science (RCS), vol. 33, pp. 201–212 (2008)

    Google Scholar 

  6. Gross, M.: The Construction of Local Grammars. In: Roche, E., Schabès, Y. (eds.) Finite-State Language Processing. Language, Speech, and Communication, pp. 329–354. MIT Press, Cambridge (1997)

    Google Scholar 

  7. Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: KDD 2003: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, Washington, D.C., USA, pp. 601–606 (2003)

    Google Scholar 

  8. Harris, Z.S.: Mathematical Structures of Language. Interscience Tracts in Pure and Applied Mathematics 21, 152–156 (1968)

    MathSciNet  MATH  Google Scholar 

  9. Harris, Z.S.: Language and Information. Bampton Lectures in America 28, 33–56 (1988)

    Google Scholar 

  10. Grishman, R.: Adaptive information extraction and sublanguage analysis. In: Proceedings of Workshop on Adaptive Text Extraction and Mining at Seventeenth International Joint Conference on Artificial Intelligence, Seattle, USA (2001)

    Google Scholar 

  11. Liu, W., Meng, X., Meng, W.: Vision-based web data records extraction. In: Ninth International Workshop on the Web and Databases (WebDB 2006), Chicago, USA, pp. 20–25 (2006)

    Google Scholar 

  12. Zhu, J., Nie, Z., Wen, J.R., Zhang, B., Ma, W.Y.: Simultaneous record detection and attribute labeling in web data extraction. In: KDD 2006: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, Philadelphia, PA, USA, pp. 494–503 (2006)

    Google Scholar 

  13. Wu, S., Manber, U.: Agrep – a fast approximate pattern-matching tool. In: Proceedings USENIX Winter, Technical Conference, San Francisco, CA, USA, pp. 153–162 (1992)

    Google Scholar 

  14. McDonald, D.: Internal and external evidence in the identification and semantic categorization of proper names. In: Boguraev, B., Pustejovsky, J. (eds.) Corpus processing for lexical acquisition, pp. 21–39. MIT Press, Cambridge (1996)

    Google Scholar 

  15. Mikheev, A., Moens, M., Grover, C.: Named entity recognition without gazetteers. In: Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics, pp. 1–8 (1999)

    Google Scholar 

  16. Embley, D.W., Lopresti, D.P., Nagy, G.: Notes on contemporary table recognition. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 164–175. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lee, Y.S., Geierhos, M. (2009). Business Specific Online Information Extraction from German Websites. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2009. Lecture Notes in Computer Science, vol 5449. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00382-0_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-00382-0_30

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-00381-3

  • Online ISBN: 978-3-642-00382-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics