Abstract
Extracting web content aims at separating web content from web pages since web content is organized and presented by different HTML templates and is surrounded by various information. Knowing little about template structures and noise information before extraction, the variability of page templates, etc., make the extraction process very challenging to guarantee extraction precision and extraction adaptability. This study proposes an effective web content extraction method for various web environments. To ensure extraction performance, we exploited three kinds of characteristics, visual text information, content semantics(instead of HTML tag semantics) and web page structures. These characteristics are then integrated into an extraction framework for extraction decisions for different websites. Comparative experiments on multiple web sites with two popular extraction methods, CETR and CETD, show that our proposed extraction method outperforms CETR on precision when keeping the same advantage on recall, and also gains 4% improvement over CETD on the average F1-score; especially, our method can provide better extraction performance when facing short content than CETD, and presents a better extraction adaptability.
This is a preview of subscription content, log in via an institution.
References
Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: Dom-based content extraction of html documents. In: Proceedings of the 12th International Conference on World Wide Web, WWW 2003, pp. 207–214. ACM, New York (2003)
Gupta, S., Kaiser, G.E., Grimm, P., Chiang, M.F., Starren, J.: Automating content extraction of html documents. World Wide Web 8(2), 179–224 (2005)
Zhang, J., Zhang, C., Qian, W., Zhou, A.: Automatic Extraction Rules Generation Based on XPath Pattern Learning. In: Chiu, D.K.W., Bellatreche, L., Sasaki, H., Leung, H., Cheung, S.-C., Hu, H., Shao, J. (eds.) WISE 2010. LNCS, vol. 6724, pp. 58–69. Springer, Heidelberg (2011). doi:10.1007/978-3-642-24396-7_6
Alam, H., Rahman, A.F.R., Hartono, R.: Content extraction from html documents. In: Proceedings of 1st International Workshop on Web Document Analysis, WDA2001 (2001)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, VLDB 2001, pp. 109–118. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Furche, T., Guo, J., Maneth, S., Schallhart, C.: Robust and noise resistant wrapper induction. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD 2016, pp. 773–784. ACM, New York (2016)
Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In: Proceedings of the 13th International Conference on World Wide Web, WWW 2004, pp. 502–511. ACM, New York (2004)
Wu, G., Li, L., Hu, X., Wu, X.: Web news extraction via path ratios. In: Proceedings of the 22nd ACM international conference on Conference on information & #38; knowledge management, CIKM 2013, pp. 2059–2068. ACM, New York (2013)
Gong-Qing, W., Li, L., Li, L., Xindong, W.: Web news extraction via tag path feature fusion using ds theory. J. Comput. Sci. Technol. 31(4), 661–672 (2016)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting Content Structure for Web Pages Based on Visual Representation. In: Zhou, X., Orlowska, M.E., Zhang, Y. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003). doi:10.1007/3-540-36901-5_42
Song, R., Liu, H., Wen, J.-R., Ma, W.-Y.: Learning block importance models for web pages. In: Proceedings of the 13th International Conference on World Wide Web, WWW 2004, pp. 203–211. ACM, New York (2004)
Fernandes, D., de Moura, E.S., Ribeiro-Neto, B., da Silva, A.S., Gonçalves, M.A.: Computing block importance for searching on web sites. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM 2007, pp. 165–174. ACM, New York (2007)
Sun, F., Song, D., Liao, L.: Dom based content extraction via text density. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, pp. 245–254. ACM, New York (2011)
Qureshi, P.A.R., Memon, N.: Hybrid model of content extraction. J. Comput. Syst. Sci. 78(4), 1248–1257 (2012)
Peters, M.E., Lecocq, D.: Content extraction using diverse feature sets. In: Proceedings of the 22nd International Conference on World Wide Web, WWW 2013 Companion, pp. 89–90. ACM, New York (2013)
Ortona, S., Orsi, G., Buoncristiano, M., Furche, T.: Wadar: joint wrapper and data repair. Proc. VLDB Endow. 8(12), 1996–1999 (2015)
Weninger, T., Hsu, W.H., Han, J.: Cetr: Content extraction via tag ratios. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 971–980. ACM, New York (2010)
Uzun, E., Agun, H.V., Yerlikaya, T.: A hybrid approach for extracting informative content from web pages. Inf. Process. Manage. 49(4), 928–944 (2013)
Weninger, T., Palacios, R., Crescenzi, V., Gottron, T., Merialdo, P.: Web content extraction: a metaanalysis of its past and thoughts on its future. SIGKDD Explor. Newsl. 17(2), 17–23 (2016)
Jsoup. https://jsoup.org/
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, STOC 2002, pp. 380–388. ACM, New York (2002)
Huaping, Z.: Nlpir. http://ictclas.nlpir.org/
Acknowledgments
This work is funded by the National Natural Science Foundation of China (No.61363005, 61462017, U1501252), Guangxi Natural Science Foundation of China(No.2014GXNSFAA118353, 2014GXNSFAA118390), Guangxi Key Laboratory of Automatic Detection Technology and Instrument Foundation(YQ15110), Guangxi Cooperative Innovation Center of Cloud Computing and Big Data.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Wang, Q., Yang, Q., Zhang, J., Zhou, R., Zhang, Y. (2017). Extracting Web Content by Exploiting Multi-Category Characteristics. In: Bouguettaya, A., et al. Web Information Systems Engineering – WISE 2017. WISE 2017. Lecture Notes in Computer Science(), vol 10570. Springer, Cham. https://doi.org/10.1007/978-3-319-68786-5_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-68786-5_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68785-8
Online ISBN: 978-3-319-68786-5
eBook Packages: Computer ScienceComputer Science (R0)