Advertisement

Web2Text: Deep Structured Boilerplate Removal

  • Thijs Vogels
  • Octavian-Eugen Ganea
  • Carsten EickhoffEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10772)

Abstract

Web pages are a valuable source of information for many natural language processing and information retrieval tasks. Extracting the main content from those documents is essential for the performance of derived applications. To address this issue, we introduce a novel model that performs sequence labeling to collectively classify all text blocks in an HTML page as either boilerplate or main content. Our method uses a hidden Markov model on top of potentials derived from DOM tree features using convolutional neural networks. The proposed method sets a new state-of-the-art performance for boilerplate removal on the CleanEval benchmark. As a component of information retrieval pipelines, it improves retrieval performance on the ClueWeb12 collection.

References

  1. 1.
    Baroni, M., Chantree, F., Kilgarriff, A., Sharoff, S.: CleanEval: a competition for cleaning web pages. In: LREC (2008)Google Scholar
  2. 2.
    Bauer, D., Degen, J., Deng, X., Herger, P., Gasthaus, J., Giesbrecht, E., Jansen, L., Kalina, C., Kräger, T., Märtin, R., Schmidt, M., Scholler, S., Steger, J., Stemle, E., Evert, S.: FIASCO: filtering the internet by automatic subtree classification, Osnabruck. In: Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop, Incorporating CleanEval, vol. 4, pp. 111–121 (2007)Google Scholar
  3. 3.
    Chakrabarti, D., Kumar, R., Punera, K.: Page-level template detection via isotonic smoothing. In: Proceedings of the 16th International Conference on World Wide Web, pp. 61–70. ACM (2007)Google Scholar
  4. 4.
    Chakrabarti, D., Kumar, R., Punera, K.: A graph-theoretic approach to webpage segmentation. In Proceedings of the 17th International Conference on World Wide Web, pp. 377–386. ACM (2008)Google Scholar
  5. 5.
    Collins-Thompson, K., Bennett, P., Diaz, F., Clarke, C., Voorhees, E.: Overview of the TREC 2013 web track. In: Proceedings of the 22nd Text Retrieval Conference (TREC 2013) (2013)Google Scholar
  6. 6.
    Debnath, S., Mitra, P., Pal, N., Giles, C.L.: Automatic identification of informative sections of web pages. IEEE Trans. Knowl. Data Eng. 17(9), 1233–1246 (2005)CrossRefGoogle Scholar
  7. 7.
    Finn, A., Kushmerick, N., Smyth, B.: Content classification for digital libraries. Unrefereed, Fact or fiction (2001)Google Scholar
  8. 8.
    Geitgey, A.: Unfluff - an automatic web page content extractor for node.js! (2014)Google Scholar
  9. 9.
    Gibson, J., Wellner, B., Lubar, S.: Adaptive web-page content identification. In: Proceedings of the 9th Annual ACM International Workshop on Web Information and Data Management, pp. 105–112. ACM (2007)Google Scholar
  10. 10.
    Gottron, T.: Content code blurring: a new approach to content extraction. In: 19th International Workshop on Database and Expert Systems Application, DEXA 2008, pp. 29–33. IEEE (2008)Google Scholar
  11. 11.
    Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM-based content extraction of HTML documents. In: Proceedings of the 12th International Conference on World Wide Web, pp. 207–214. ACM (2003)Google Scholar
  12. 12.
    Hedley, J.: Jsoup HTML parser (2009)Google Scholar
  13. 13.
    Jin, R., Hauptmann, A.G., Zhai, C.: Language model for information retrieval. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–48. ACM (2002)Google Scholar
  14. 14.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  15. 15.
    Kohlschütter, C.: A densitometric analysis of web template content. In: Proceedings of the 18th International Conference on World Wide Web, pp. 1165–1166. ACM (2009)Google Scholar
  16. 16.
    Kohlschütter, C., et al.: Boilerpipe - boilerplate removal and fulltext extraction from HTML pages. Google Code (2010)Google Scholar
  17. 17.
    Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 441–450. ACM (2010)Google Scholar
  18. 18.
    Lavrenko, V., Croft, W.B.: Relevance based language models. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 120–127. ACM (2001)Google Scholar
  19. 19.
    Lin, S.-H., Ho, J.-M.: Discovering informative content blocks from web documents. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 588–593. ACM (2002)Google Scholar
  20. 20.
    Spousta, M., Marek, M., Pecina, P.: Victor: the web-page cleaning tool. In: 4th Web as Corpus Workshop (WAC4)-Can We Beat Google, pp. 12–17 (2008)Google Scholar
  21. 21.
    Sun, F., Song, D., Liao, L.: DOM based content extraction via text density. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 245–254. ACM (2011)Google Scholar
  22. 22.
    Vieira, K., Da Silva, A.S., Pinto, N., De Moura, E.S., Cavalcanti, J., Freire, J.: A fast and robust method for web page template detection and removal. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pp. 258–267. ACM (2006)Google Scholar
  23. 23.
    Viterbi, A.J.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. In: The Foundations of the Digital Wireless World: Selected Works of AJ Viterbi, pp. 41–50. World Scientific (2010)Google Scholar
  24. 24.
    Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 296–305. ACM (2003)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Thijs Vogels
    • 1
  • Octavian-Eugen Ganea
    • 1
  • Carsten Eickhoff
    • 1
    Email author
  1. 1.Department of Computer ScienceETH ZurichZürichSwitzerland

Personalised recommendations