Skip to main content

Extracting Web Content by Exploiting Multi-Category Characteristics

  • Conference paper
  • First Online:
  • 1475 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10570))

Abstract

Extracting web content aims at separating web content from web pages since web content is organized and presented by different HTML templates and is surrounded by various information. Knowing little about template structures and noise information before extraction, the variability of page templates, etc., make the extraction process very challenging to guarantee extraction precision and extraction adaptability. This study proposes an effective web content extraction method for various web environments. To ensure extraction performance, we exploited three kinds of characteristics, visual text information, content semantics(instead of HTML tag semantics) and web page structures. These characteristics are then integrated into an extraction framework for extraction decisions for different websites. Comparative experiments on multiple web sites with two popular extraction methods, CETR and CETD, show that our proposed extraction method outperforms CETR on precision when keeping the same advantage on recall, and also gains 4% improvement over CETD on the average F1-score; especially, our method can provide better extraction performance when facing short content than CETD, and presents a better extraction adaptability.

This is a preview of subscription content, log in via an institution.

References

  1. Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: Dom-based content extraction of html documents. In: Proceedings of the 12th International Conference on World Wide Web, WWW 2003, pp. 207–214. ACM, New York (2003)

    Google Scholar 

  2. Gupta, S., Kaiser, G.E., Grimm, P., Chiang, M.F., Starren, J.: Automating content extraction of html documents. World Wide Web 8(2), 179–224 (2005)

    Article  Google Scholar 

  3. Zhang, J., Zhang, C., Qian, W., Zhou, A.: Automatic Extraction Rules Generation Based on XPath Pattern Learning. In: Chiu, D.K.W., Bellatreche, L., Sasaki, H., Leung, H., Cheung, S.-C., Hu, H., Shao, J. (eds.) WISE 2010. LNCS, vol. 6724, pp. 58–69. Springer, Heidelberg (2011). doi:10.1007/978-3-642-24396-7_6

    Chapter  Google Scholar 

  4. Alam, H., Rahman, A.F.R., Hartono, R.: Content extraction from html documents. In: Proceedings of 1st International Workshop on Web Document Analysis, WDA2001 (2001)

    Google Scholar 

  5. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, VLDB 2001, pp. 109–118. Morgan Kaufmann Publishers Inc., San Francisco (2001)

    Google Scholar 

  6. Furche, T., Guo, J., Maneth, S., Schallhart, C.: Robust and noise resistant wrapper induction. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD 2016, pp. 773–784. ACM, New York (2016)

    Google Scholar 

  7. Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In: Proceedings of the 13th International Conference on World Wide Web, WWW 2004, pp. 502–511. ACM, New York (2004)

    Google Scholar 

  8. Wu, G., Li, L., Hu, X., Wu, X.: Web news extraction via path ratios. In: Proceedings of the 22nd ACM international conference on Conference on information & #38; knowledge management, CIKM 2013, pp. 2059–2068. ACM, New York (2013)

    Google Scholar 

  9. Gong-Qing, W., Li, L., Li, L., Xindong, W.: Web news extraction via tag path feature fusion using ds theory. J. Comput. Sci. Technol. 31(4), 661–672 (2016)

    Article  Google Scholar 

  10. Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting Content Structure for Web Pages Based on Visual Representation. In: Zhou, X., Orlowska, M.E., Zhang, Y. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003). doi:10.1007/3-540-36901-5_42

    Chapter  Google Scholar 

  11. Song, R., Liu, H., Wen, J.-R., Ma, W.-Y.: Learning block importance models for web pages. In: Proceedings of the 13th International Conference on World Wide Web, WWW 2004, pp. 203–211. ACM, New York (2004)

    Google Scholar 

  12. Fernandes, D., de Moura, E.S., Ribeiro-Neto, B., da Silva, A.S., Gonçalves, M.A.: Computing block importance for searching on web sites. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM 2007, pp. 165–174. ACM, New York (2007)

    Google Scholar 

  13. Sun, F., Song, D., Liao, L.: Dom based content extraction via text density. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, pp. 245–254. ACM, New York (2011)

    Google Scholar 

  14. Qureshi, P.A.R., Memon, N.: Hybrid model of content extraction. J. Comput. Syst. Sci. 78(4), 1248–1257 (2012)

    Article  MathSciNet  Google Scholar 

  15. Peters, M.E., Lecocq, D.: Content extraction using diverse feature sets. In: Proceedings of the 22nd International Conference on World Wide Web, WWW 2013 Companion, pp. 89–90. ACM, New York (2013)

    Google Scholar 

  16. Ortona, S., Orsi, G., Buoncristiano, M., Furche, T.: Wadar: joint wrapper and data repair. Proc. VLDB Endow. 8(12), 1996–1999 (2015)

    Article  Google Scholar 

  17. Weninger, T., Hsu, W.H., Han, J.: Cetr: Content extraction via tag ratios. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 971–980. ACM, New York (2010)

    Google Scholar 

  18. Uzun, E., Agun, H.V., Yerlikaya, T.: A hybrid approach for extracting informative content from web pages. Inf. Process. Manage. 49(4), 928–944 (2013)

    Article  Google Scholar 

  19. Weninger, T., Palacios, R., Crescenzi, V., Gottron, T., Merialdo, P.: Web content extraction: a metaanalysis of its past and thoughts on its future. SIGKDD Explor. Newsl. 17(2), 17–23 (2016)

    Article  Google Scholar 

  20. Jsoup. https://jsoup.org/

  21. Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, STOC 2002, pp. 380–388. ACM, New York (2002)

    Google Scholar 

  22. Huaping, Z.: Nlpir. http://ictclas.nlpir.org/

Download references

Acknowledgments

This work is funded by the National Natural Science Foundation of China (No.61363005, 61462017, U1501252), Guangxi Natural Science Foundation of China(No.2014GXNSFAA118353, 2014GXNSFAA118390), Guangxi Key Laboratory of Automatic Detection Technology and Instrument Foundation(YQ15110), Guangxi Cooperative Innovation Center of Cloud Computing and Big Data.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jingwei Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Wang, Q., Yang, Q., Zhang, J., Zhou, R., Zhang, Y. (2017). Extracting Web Content by Exploiting Multi-Category Characteristics. In: Bouguettaya, A., et al. Web Information Systems Engineering – WISE 2017. WISE 2017. Lecture Notes in Computer Science(), vol 10570. Springer, Cham. https://doi.org/10.1007/978-3-319-68786-5_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-68786-5_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-68785-8

  • Online ISBN: 978-3-319-68786-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics