Skip to main content

Detecting Off-Topic Pages in Web Archives

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9316))

Abstract

Web archives have become a significant repository of our recent history and cultural heritage. Archival integrity and accuracy is a precondition for future cultural research. Currently, there are no quantitative or content-based tools that allow archivists to judge the quality of the Web archive captures. In this paper, we address the problems of detecting off-topic pages in Web archive collections. We evaluate six different methods to detect when the page has gone off-topic through subsequent captures. Those predicted off-topic pages will be presented to the collection’s curator for possible elimination from the collection or cessation of crawling. We created a gold standard data set from three Archive-It collections to evaluate the proposed methods at different thresholds. We found that combining cosine similarity at threshold 0.10 and change in size using word count at threshold \(-\)0.85 performs the best with accuracy = 0.987, \(F_{1}\) score = 0.906, and AUC = 0.968. We evaluated the performance of the proposed method on several Archive-It collections. The average precision of detecting the off-topic pages is 0.92.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://archive-it.org/.

  2. 2.

    http://archive.org/web/researcher/cdx_file_format.php.

References

  1. Negulescu, K.C.: Web Archiving @ the Internet Archive. Presentation at the 2010 Digital Preservation Partners Meeting (2010). http://www.digitalpreservation.gov/meetings/documents/ndiipp10/NDIIPP072110FinalIA.ppt

  2. Kahle, B.: Wayback Machine Hits 400,000,000,000! http://blog.archive.org/2014/05/09/wayback-machine-hits-400000000000/ (2014)

  3. Mohr, G., Stack, M., Ranitovic, I., Avery, D., Kimpton, M.: An introduction to Heritrix an open source archival quality web crawler. In: Proceedings of IWAW, pp. 43–49 (2004)

    Google Scholar 

  4. Marshall, C., McCown, F., Nelson, M.: Evaluating Personal archiving strategies for internet-based information. In: Proceedings of Archiving, pp. 151–156 (2007)

    Google Scholar 

  5. Bar-Yossef, Z., Broder, A.Z., Kumar, R., Tomkins, A.: Sic transit Gloria Telae: Towards an understanding of the web’s decay. In: Proceedings of WWW, pp. 328–337 (2004)

    Google Scholar 

  6. Jatowt, A., Tanaka, K.: Towards mining past content of Web pages. New Rev. Hypermedia Multimed. 13(1), 77–86 (2007)

    Article  Google Scholar 

  7. Van de Sompel, H., Nelson, M.L., Sanderson, R.: RFC 7089 - HTTP framework for time-based access to resource states - Memento (2013)

    Google Scholar 

  8. Kosala, R., Blockeel, H.: Web mining research: a survey. SIGKDD Explor. Newsl. 2(1), 1–15 (2000)

    Article  Google Scholar 

  9. Arms, W.Y., Aya, S., Dmitriev, P., Kot, B.J., Mitchell, R., Walle, L.: Building a research library for the history of the web. In: Proceedings of ACM/IEEE JCDL, pp. 95–102 (2006)

    Google Scholar 

  10. Jatowt, A., Kawai, Y., Tanaka, K.: Detecting age of page content. In: Proceedings of ACM WIDM, pp. 137–144 (2007)

    Google Scholar 

  11. Jatowt, A., Kawai, Y., Tanaka, K.: Page history explorer: visualizing and comparing page histories. IEICE Trans. Inf. Syst. 94(3), 564–577 (2011)

    Article  Google Scholar 

  12. Francisco-Revilla, L., Shipman, F., Furuta, R., Karadkar, U., Arora, A.: Managing change on the web. In: Proceedings of ACM/IEEE JCDL, pp. 67–76 (2001)

    Google Scholar 

  13. Ben Saad, M., Gançarski, S.: Archiving the web using page changes patterns: a case study. In: Proceedings of ACM/IEEE JCDL, pp. 113–122 (2012)

    Google Scholar 

  14. Spaniol, M., Weikum, G.: Tracking entities in web archives: the LAWA project. In: Proceedings of WWW, pp. 287–290 (2012)

    Google Scholar 

  15. ISO: ISO 28500:2009 - Information and documentation - WARC file format (2009). http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717

  16. Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of ACM WSDM, pp. 441–450 (2010)

    Google Scholar 

  17. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  18. Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of WWW, pp. 377–386 (2006)

    Google Scholar 

  19. Buckley, C., Salton, G., Allan, J., Singhal, A.: Automatic query expansion using SMART: TREC 3. In: Overview of the Third Text REtrieval Conference (TREC-3), pp. 69–80 (1995)

    Google Scholar 

  20. Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874 (2006)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

This work supported in part by the Andrew Mellon Foundation. We thank Kristine Hanna from Internet Archive for facilitating obtaining the data set.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yasmin AlNoamany .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

AlNoamany, Y., Weigle, M.C., Nelson, M.L. (2015). Detecting Off-Topic Pages in Web Archives. In: Kapidakis, S., Mazurek, C., Werla, M. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2015. Lecture Notes in Computer Science(), vol 9316. Springer, Cham. https://doi.org/10.1007/978-3-319-24592-8_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-24592-8_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-24591-1

  • Online ISBN: 978-3-319-24592-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics