Abstract
This work explores the usage of Linked Data for Web scale Information Extraction, with focus on the task of Wrapper Induction. We show how to effectively use Linked Data to automatically generate training material and build a self-trained Wrapper Induction method. Experiments on a publicly available dataset demonstrate that for covered domains, our method can achieve F measure of 0.85, which is a competitive result compared against a supervised solution.
Part of this research has been sponsored by the EPSRC funded project LODIE: Linked Open Data for Information Extraction, EP/J019488/1
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: ACM SIGMOD/PODS 2003, pp. 337–348. ACM (2003), http://dl.acm.org/citation.cfm?id=872799
Blanco, R., Halpin, H., Herzig, D., Mika, P.: Entity search evaluation over structured web data. In: SIGIR 2011, pp. 65–71 (2011), http://www.aifb.kit.edu/images/d/d9/EOS-SIGIR2011.pdf
Carlson, A., Schafer, C.: Bootstrapping information extraction from semi-structured web pages. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 195–210. Springer, Heidelberg (2008)
Crescenzi, V., Mecca, G.: Automatic information extraction from large websites. Journal of the ACM 51(5), 731–779 (2004), http://portal.acm.org/citation.cfm?doid=1017460.1017462
Dalvi, N., Kumar, R., Soliman, M.: Automatic wrappers for large scale web extraction. In: VLDB 2011, vol. 4(4), pp. 219–230 (2011), http://dl.acm.org/citation.cfm?id=1938547
Gentile, A.L., Zhang, Z., Augenstein, I., Ciravegna, F.: Unsupervised wrapper induction using linked data. In: K-CAP 2013, pp. 41–48. ACM (2013), http://doi.acm.org/10.1145/2479832.2479845
Hao, Q., Cai, R., Pang, Y., Zhang, L.: From One Tree to a Forest: a Unified Solution for Structured Web Data Extraction. In: SIGIR 2011, pp. 775–784 (2011), http://research.microsoft.com/pubs/152207/StructedDataExtraction_SIGIR2011.pdf
Kobilarov, G., Bizer, C., Auer, S., Lehmann, J.: DBpedia-A Linked Data Hub and Data Source for Web and Enterprise Applications. In: WWW 2009, pp. 1–3 (2009), http://jens-lehmann.org/files/2009/dbpedia_www_developers.pdf
Kushmerick, N.: Wrapper Induction for information Extraction. In: IJCAI 1997, pp. 729–735 (1997), http://www.icst.pku.edu.cn/course/mining/11-12spring/%E5%8F%82%E8%80%83%E6%96%87%E7%8C%AE/10-01WrapperInductionforInformationExtraction.pdf
Mulwad, V., Finin, T., Syed, Z., Joshi, A.: Using linked data to interpret tables. In: COLD 2010, pp. 1–12 (2010)
Muslea, I., Minton, S., Knoblock, C.: Active Learning with Strong and Weak Views: A Case Study on Wrapper Induction. In: IJCAI 2003, pp. 415–420 (2003), http://www.isi.edu/integration/papers/muslea03-ijcai.pdf
Muslea, I., Minton, S., Knoblock, C.: Hierarchical wrapper induction for semistructured information sources. Auton. Agents and Multi-Agent Syst., 1–28 (2001), http://www.springerlink.com/index/XMG5W31380116467.pdf
Soderland, S.: Learning information extraction rules for semi-structured and free text. Mach. Learn. 34(1-3), 233–272 (1999), http://dx.doi.org/10.1023/A:1007562322031
Wong, T., Lam, W.: Learning to adapt web information extraction knowledge and discovering new attributes via a Bayesian approach. IEEE Knowledge and Data Engineering 22(4), 523–536 (2010), http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4906994
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Gentile, A.L., Zhang, Z., Ciravegna, F. (2014). Self Training Wrapper Induction with Linked Data. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2014. Lecture Notes in Computer Science(), vol 8655. Springer, Cham. https://doi.org/10.1007/978-3-319-10816-2_35
Download citation
DOI: https://doi.org/10.1007/978-3-319-10816-2_35
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10815-5
Online ISBN: 978-3-319-10816-2
eBook Packages: Computer ScienceComputer Science (R0)