Skip to main content

Self Training Wrapper Induction with Linked Data

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8655))

Abstract

This work explores the usage of Linked Data for Web scale Information Extraction, with focus on the task of Wrapper Induction. We show how to effectively use Linked Data to automatically generate training material and build a self-trained Wrapper Induction method. Experiments on a publicly available dataset demonstrate that for covered domains, our method can achieve F measure of 0.85, which is a competitive result compared against a supervised solution.

Part of this research has been sponsored by the EPSRC funded project LODIE: Linked Open Data for Information Extraction, EP/J019488/1

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: ACM SIGMOD/PODS 2003, pp. 337–348. ACM (2003), http://dl.acm.org/citation.cfm?id=872799

  2. Blanco, R., Halpin, H., Herzig, D., Mika, P.: Entity search evaluation over structured web data. In: SIGIR 2011, pp. 65–71 (2011), http://www.aifb.kit.edu/images/d/d9/EOS-SIGIR2011.pdf

  3. Carlson, A., Schafer, C.: Bootstrapping information extraction from semi-structured web pages. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 195–210. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  4. Crescenzi, V., Mecca, G.: Automatic information extraction from large websites. Journal of the ACM 51(5), 731–779 (2004), http://portal.acm.org/citation.cfm?doid=1017460.1017462

    Article  MATH  MathSciNet  Google Scholar 

  5. Dalvi, N., Kumar, R., Soliman, M.: Automatic wrappers for large scale web extraction. In: VLDB 2011, vol. 4(4), pp. 219–230 (2011), http://dl.acm.org/citation.cfm?id=1938547

  6. Gentile, A.L., Zhang, Z., Augenstein, I., Ciravegna, F.: Unsupervised wrapper induction using linked data. In: K-CAP 2013, pp. 41–48. ACM (2013), http://doi.acm.org/10.1145/2479832.2479845

  7. Hao, Q., Cai, R., Pang, Y., Zhang, L.: From One Tree to a Forest: a Unified Solution for Structured Web Data Extraction. In: SIGIR 2011, pp. 775–784 (2011), http://research.microsoft.com/pubs/152207/StructedDataExtraction_SIGIR2011.pdf

  8. Kobilarov, G., Bizer, C., Auer, S., Lehmann, J.: DBpedia-A Linked Data Hub and Data Source for Web and Enterprise Applications. In: WWW 2009, pp. 1–3 (2009), http://jens-lehmann.org/files/2009/dbpedia_www_developers.pdf

  9. Kushmerick, N.: Wrapper Induction for information Extraction. In: IJCAI 1997, pp. 729–735 (1997), http://www.icst.pku.edu.cn/course/mining/11-12spring/%E5%8F%82%E8%80%83%E6%96%87%E7%8C%AE/10-01WrapperInductionforInformationExtraction.pdf

  10. Mulwad, V., Finin, T., Syed, Z., Joshi, A.: Using linked data to interpret tables. In: COLD 2010, pp. 1–12 (2010)

    Google Scholar 

  11. Muslea, I., Minton, S., Knoblock, C.: Active Learning with Strong and Weak Views: A Case Study on Wrapper Induction. In: IJCAI 2003, pp. 415–420 (2003), http://www.isi.edu/integration/papers/muslea03-ijcai.pdf

  12. Muslea, I., Minton, S., Knoblock, C.: Hierarchical wrapper induction for semistructured information sources. Auton. Agents and Multi-Agent Syst., 1–28 (2001), http://www.springerlink.com/index/XMG5W31380116467.pdf

  13. Soderland, S.: Learning information extraction rules for semi-structured and free text. Mach. Learn. 34(1-3), 233–272 (1999), http://dx.doi.org/10.1023/A:1007562322031

    Article  MATH  Google Scholar 

  14. Wong, T., Lam, W.: Learning to adapt web information extraction knowledge and discovering new attributes via a Bayesian approach. IEEE Knowledge and Data Engineering 22(4), 523–536 (2010), http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4906994

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Gentile, A.L., Zhang, Z., Ciravegna, F. (2014). Self Training Wrapper Induction with Linked Data. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2014. Lecture Notes in Computer Science(), vol 8655. Springer, Cham. https://doi.org/10.1007/978-3-319-10816-2_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10816-2_35

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10815-5

  • Online ISBN: 978-3-319-10816-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics