Skip to main content

A Template-Based Information Extraction from Web Sites with Unstable Markup

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 475))

Abstract

This paper presents results of a work on crawling CEUR Workshop proceedings(CEUR Workshop proceedings web site, URL: http://ceur-ws.org) web site to a Linked Open Data (LOD) dataset in the framework of ESWC 2014 Semantic Publishing Challenge 2014(ESWC 2014 Semantic Publishing Challenge, URL: http://2014.eswc-conferences.org/semantic-publishing-challenge). Our approach is based on using an extensible template-dependent crawler and DBpedia for linking extracted entities, such as the names of universities and countries.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    ESWC 2014 Semantic Publishing Challenge, URL: http://2014.eswc-conferences.org/semantic-publishing-challenge.

  2. 2.

    CEUR Workshop proceedings web site, URL: http://ceur-ws.org

  3. 3.

    The source code and instructions, URL: https://github.com/ailabitmo/sempub challenge2014-task1.

  4. 4.

    Grab framework, URL: http://grablib.org/.

  5. 5.

    Semantic Web Conference Ontology, URL: http://data.semanticweb.org/ns/swc/ontology.

  6. 6.

    Semantic Web for Research Communities, URL: http://ontoware.org/swrc/.

  7. 7.

    The Bibliographic Ontology, URL: http://purl.org/ontology/bibo/.

  8. 8.

    The Timeline Ontology, URL: http://purl.org/NET/c4dm/timeline.owl#.

  9. 9.

    The Friend of a Friend (FOAF), URL: http://www.foaf-project.org/.

  10. 10.

    Dublin Core, URL: http://purl.org/dc/elements/1.1/.

  11. 11.

    DBpedia Ontology, URL: http://dbpedia.org/ontology/.

  12. 12.

    RDF Schema, URL: http://www.w3.org/2000/01/rdf-schema#.

  13. 13.

    PDFMiiner, URL: http://www.unixuser.org/~euske/python/pdfminer/.

  14. 14.

    DBLP, URL: http://www.informatik.uni-trier.de/~ley/db/.

  15. 15.

    Semantic Web Dog Food, URL: http://data.semanticweb.org/.

References

  1. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Seman. Web J. (2014). http://www.semantic-web-journal.net/content/dbpedia-large-scale-multilingual-knowledge-base-extracted-wikipedia-0

  2. Ratcliff, J.W., Metzener, D.E.: Pattern-matching-the gestalt approach. Dr DOBBS J. (DDJ) 13(7), 1–46 (1988)

    Google Scholar 

Download references

Acknowledgments

This work has been partially financially supported by the Government of Russian Federation, Grant #074-U01.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fedor Kozlov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Kolchin, M., Kozlov, F. (2014). A Template-Based Information Extraction from Web Sites with Unstable Markup. In: Presutti, V., et al. Semantic Web Evaluation Challenge. SemWebEval 2014. Communications in Computer and Information Science, vol 475. Springer, Cham. https://doi.org/10.1007/978-3-319-12024-9_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12024-9_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12023-2

  • Online ISBN: 978-3-319-12024-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics