Detecting Off-Topic Pages in Web Archives

AlNoamany, Yasmin; Weigle, Michele C.; Nelson, Michael L.

doi:10.1007/978-3-319-24592-8_17

Detecting Off-Topic Pages in Web Archives

Yasmin AlNoamany¹⁶,
Michele C. Weigle¹⁶ &
Michael L. Nelson¹⁶

Conference paper
First Online: 28 November 2015

1279 Accesses
8 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9316))

Abstract

Web archives have become a significant repository of our recent history and cultural heritage. Archival integrity and accuracy is a precondition for future cultural research. Currently, there are no quantitative or content-based tools that allow archivists to judge the quality of the Web archive captures. In this paper, we address the problems of detecting off-topic pages in Web archive collections. We evaluate six different methods to detect when the page has gone off-topic through subsequent captures. Those predicted off-topic pages will be presented to the collection’s curator for possible elimination from the collection or cessation of crawling. We created a gold standard data set from three Archive-It collections to evaluate the proposed methods at different thresholds. We found that combining cosine similarity at threshold 0.10 and change in size using word count at threshold \(-\)0.85 performs the best with accuracy = 0.987, \(F_{1}\) score = 0.906, and AUC = 0.968. We evaluated the performance of the proposed method on several Archive-It collections. The average precision of detecting the off-topic pages is 0.92.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Negulescu, K.C.: Web Archiving @ the Internet Archive. Presentation at the 2010 Digital Preservation Partners Meeting (2010). http://www.digitalpreservation.gov/meetings/documents/ndiipp10/NDIIPP072110FinalIA.ppt
Kahle, B.: Wayback Machine Hits 400,000,000,000! http://blog.archive.org/2014/05/09/wayback-machine-hits-400000000000/ (2014)
Mohr, G., Stack, M., Ranitovic, I., Avery, D., Kimpton, M.: An introduction to Heritrix an open source archival quality web crawler. In: Proceedings of IWAW, pp. 43–49 (2004)
Google Scholar
Marshall, C., McCown, F., Nelson, M.: Evaluating Personal archiving strategies for internet-based information. In: Proceedings of Archiving, pp. 151–156 (2007)
Google Scholar
Bar-Yossef, Z., Broder, A.Z., Kumar, R., Tomkins, A.: Sic transit Gloria Telae: Towards an understanding of the web’s decay. In: Proceedings of WWW, pp. 328–337 (2004)
Google Scholar
Jatowt, A., Tanaka, K.: Towards mining past content of Web pages. New Rev. Hypermedia Multimed. 13(1), 77–86 (2007)
Article Google Scholar
Van de Sompel, H., Nelson, M.L., Sanderson, R.: RFC 7089 - HTTP framework for time-based access to resource states - Memento (2013)
Google Scholar
Kosala, R., Blockeel, H.: Web mining research: a survey. SIGKDD Explor. Newsl. 2(1), 1–15 (2000)
Article Google Scholar
Arms, W.Y., Aya, S., Dmitriev, P., Kot, B.J., Mitchell, R., Walle, L.: Building a research library for the history of the web. In: Proceedings of ACM/IEEE JCDL, pp. 95–102 (2006)
Google Scholar
Jatowt, A., Kawai, Y., Tanaka, K.: Detecting age of page content. In: Proceedings of ACM WIDM, pp. 137–144 (2007)
Google Scholar
Jatowt, A., Kawai, Y., Tanaka, K.: Page history explorer: visualizing and comparing page histories. IEICE Trans. Inf. Syst. 94(3), 564–577 (2011)
Article Google Scholar
Francisco-Revilla, L., Shipman, F., Furuta, R., Karadkar, U., Arora, A.: Managing change on the web. In: Proceedings of ACM/IEEE JCDL, pp. 67–76 (2001)
Google Scholar
Ben Saad, M., Gançarski, S.: Archiving the web using page changes patterns: a case study. In: Proceedings of ACM/IEEE JCDL, pp. 113–122 (2012)
Google Scholar
Spaniol, M., Weikum, G.: Tracking entities in web archives: the LAWA project. In: Proceedings of WWW, pp. 287–290 (2012)
Google Scholar
ISO: ISO 28500:2009 - Information and documentation - WARC file format (2009). http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of ACM WSDM, pp. 441–450 (2010)
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of WWW, pp. 377–386 (2006)
Google Scholar
Buckley, C., Salton, G., Allan, J., Singhal, A.: Automatic query expansion using SMART: TREC 3. In: Overview of the Third Text REtrieval Conference (TREC-3), pp. 69–80 (1995)
Google Scholar
Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874 (2006)
Article MathSciNet Google Scholar

Download references

Acknowledgments

This work supported in part by the Andrew Mellon Foundation. We thank Kristine Hanna from Internet Archive for facilitating obtaining the data set.

Author information

Authors and Affiliations

Department of Computer Science, Old Dominion University, Norfolk, VA, 23529, USA
Yasmin AlNoamany, Michele C. Weigle & Michael L. Nelson

Authors

Yasmin AlNoamany
View author publications
You can also search for this author in PubMed Google Scholar
Michele C. Weigle
View author publications
You can also search for this author in PubMed Google Scholar
Michael L. Nelson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yasmin AlNoamany .

Editor information

Editors and Affiliations

Ionian University, Corfu, Greece
Sarantos Kapidakis
Poznań Supercomputing and Networking Center, Poznań, Poland
Cezary Mazurek
Networking Center, Poznań Supercomputing and, Poznań, Poland
Marcin Werla

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

AlNoamany, Y., Weigle, M.C., Nelson, M.L. (2015). Detecting Off-Topic Pages in Web Archives. In: Kapidakis, S., Mazurek, C., Werla, M. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2015. Lecture Notes in Computer Science(), vol 9316. Springer, Cham. https://doi.org/10.1007/978-3-319-24592-8_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-24592-8_17
Published: 28 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24591-1
Online ISBN: 978-3-319-24592-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics