Abstract
Unsupervised information extraction has been studied a lot in the past decade. However, not much attention has been paid to its wrapper maintenance. In this paper, we study wrapper construction and verification problem based on the given schema and template which is induced from unsupervised page-level wrapper induction system. We model the verification problem as a constraint satisfaction problem (CSP) for leaf node label assignment with respect to constraints specified by a finite state machine (FSM) which is constructed from previous learned schema and template. If there exists no solution to the CSP, i.e. no valid label sequence exists, we say the test page fails the verification; otherwise, we rank all valid label sequences by measuring the fitness of each label sequence for extraction. We evaluate the FSM based approach with XML validation via false positive rate and false negative rate and measure the extraction performance through extraction accuracy. The experimental result shows the proposed method can effectively filter invalid pages (zero false positive rate) and rank the correct label sequence with the highest score with 96.5% accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: ACM SIGMOD International Conference on Management of Data, pp. 337–348 (2003)
Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. on Know. and Data Eng. 18(10), 1411–1428 (2006)
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: Proceedings of the 27th International Conference on Very Large Databases (2001)
Kayed, M., Chang, C.-H.: Page-level web data extraction from template pages. IEEE Trans. on Know. and Data Eng. 22(2), 249–263 (2010)
Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118(1-2), 15–68 (2000)
Kushmerick, N.: Wrapper verification. World Wide Web Journal 3(2), 79–94 (2000)
Lerman, K., Minton, S.N., Knoblock, C.A.: Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research 18(1), 149–181 (2003)
Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems 4(1-2), 93–114 (2001)
Pek, E.-h., Li, C.-M., Liu, M.-l.: Web wrapper validation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 388–393. Springer, Heidelberg (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chang, CH., Lin, YL., Lin, KC., Kayed, M. (2013). Page-Level Wrapper Verification for Unsupervised Web Data Extraction. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds) Web Information Systems Engineering – WISE 2013. WISE 2013. Lecture Notes in Computer Science, vol 8180. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41230-1_38
Download citation
DOI: https://doi.org/10.1007/978-3-642-41230-1_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41229-5
Online ISBN: 978-3-642-41230-1
eBook Packages: Computer ScienceComputer Science (R0)