Advertisement

Dark Web pp 45-69 | Cite as

Forum Spidering

  • Hsinchun ChenEmail author
Chapter
Part of the Integrated Series in Information Systems book series (ISIS, volume 30)

Abstract

The unprecedented growth of the Internet has propagated the escalation of the Dark Web, the problematic facet of the web associated with cyber crime, hate, and extremism. Despite the need for tools to collect and analyze Dark Web forums, the covert nature of this part of the Internet makes traditional web crawling techniques insufficient for capturing such content. In this chapter, we propose a novel crawling system designed to collect Dark Web forum content. The system uses a human-assisted accessibility approach to gain access to Dark Web forums. Several URL ordering features and techniques enable efficient extraction of forum postings. The system also includes an incremental crawler coupled with a recall improvement mechanism intended to facilitate enhanced retrieval and updating of collected content. Experiments conducted to evaluate the effectiveness of the human-assisted accessibility approach and the recall improvement–based incremental update procedure yielded favorable results. The human-assisted approach significantly improved access to Dark Web forums while the incremental crawler with recall improvement also outperformed standard periodic and incremental update approaches. Using the system, we were able to collect over 100 Dark Web forums from three regions.

Keywords

Internet Service Provider Proxy Server Indexable File Anchor Text Multimedia File 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgments

This research has been supported in part by the following grants: (1) NSF Digital Government Program, “COPLINK Center: Social Network Analysis and Identity Deception Detection for Law Enforcement and Homeland Security,” October 2004–September 2007; (2) NSF/CIA, Knowledge Discovery and Dissemination (KDD) Program, “Detecting Identity Concealment,” September 2005–August 2007; and (3) Library of Congress, “Capture of Open Source Web Based Multimedia Multilingual Terrorist Content,” February 2007–February 2008.

References

  1. Abbasi, A. and Chen, H. (2005). Identification and Comparison of Extremist-Group Web Forum Messages using Authorship Analysis. IEEE Intelligent Systems, 20(5), 67–75.CrossRefGoogle Scholar
  2. Aggarwal, C. C., Al-Garawi, F., and Yu, P. S. (2001). Intelligent Crawling on the World Wide Web with Arbitrary Predicates. In Proceedings of the 10th World Wide Web Conference, Hong Kong.Google Scholar
  3. Baeza-Yates, R. (2003). Information Retrieval in the Web: Beyond Current Search Engines. International Journal of Approximate Reasoning, 34, 97–104.MathSciNetCrossRefzbMATHGoogle Scholar
  4. Barbosa, L. and Freire, J. (2004). Siphoning Hidden-Web Data through Keyword-Based Interfaces. In Proceedings of the SBBD.Google Scholar
  5. Bergman, M. K. (2000). The Deep Web: Surfacing Hidden Value. BrightPlanet.com.Google Scholar
  6. Burris, V., Smith, E., and Strahm, A. (2000). White Supremacist Networks on the Internet. Sociological Focus, 33(2), 215–235.CrossRefGoogle Scholar
  7. Chakrabarti, S., Van Den Berg, M., and Dom, B. (1999). Focused Crawling: A New Approach to Topic-Specific Resource Discovery. In Proceedings of the Eighth World Wide Web Conference, Toronto, Canada.Google Scholar
  8. Chau, M. and Chen, H. (2003). Comparison of Three Vertical Search Spiders. IEEE Computer, 36(5), 56–62.CrossRefGoogle Scholar
  9. Chen, H. Chung, Y., Ramsey, M., and Yang, C. (1998a). A Smart Itsy Bitsy Spider for the Web. Journal of the American Society for Information Science, 49(7), 604–619.CrossRefGoogle Scholar
  10. Chen, H. Chung, Y., Ramsey, M., and Yang, C. (1998b). An Intelligent Personal Spider (Agent) for Dynamic Internet/Intranet Searching. Decision Support Systems, 23(1), 41–58.CrossRefGoogle Scholar
  11. Chen, H. and Chau, M. (2003). Web Mining: Machine Learning for Web Applications. Annual Review of Information Science and Technology, (37), 289–329.Google Scholar
  12. Chen, H. (2006). Intelligence and Security Informatics for International Security: Information Sharing and Data Mining. London: Springer Press.CrossRefGoogle Scholar
  13. Cheong, F. C. (1996). Internet Agents: Spiders, Wanderers, Brokers, and Bots. Indianapolis, IN: New Riders Publishing.Google Scholar
  14. Cho, J., Garcia-Molina, H., and Page, L. (1998). Efficient Crawling Through URL Ordering. In Proceedings of the 7th World Wide Web Conference, Brisbane, Australia.Google Scholar
  15. Cho, J and Garcia-Molina, H. (2000). The Evolution of the Web and Implications for an Incremental Crawler. In Proceedings of the 26th International Conference on Very Large Databases.Google Scholar
  16. Cho, J. and Garcia-Molina, H. (2003). Estimating Frequency of Change. ACM Transactions on Internet Technology, 3(3), 256–290.CrossRefGoogle Scholar
  17. Crilley, K. (2001). Information Warfare: New Battle Fields Terrorists, Propaganda, and the Internet. In Proceedings of the Association for Information Management, 53(7), 250–264.Google Scholar
  18. Diligenti, M., Coetzee, F. M., Lawrence, S., Giles, C. L., and Gori, M. (2000). Focused Crawling Using Context Graphs. In Proceedings of the 26th Conference on Very Large Databases, Cairo, Egypt.Google Scholar
  19. Ester, M., Grob, M., and Kriegel, H. (2001). Focused Web Crawling: A Generic Framework for Specifying the User Interest and for Adaptive Crawling Strategies. In Proceedings of the International Conference on Very Large Databases.Google Scholar
  20. Florescu, D., Levy, A. Y., and Mendelzon, A. O. (1998). Database Techniques for the World-Wide Web: A Survey. SIGMOD Record, 27(3), 59–74.CrossRefGoogle Scholar
  21. Glance, N., Hurst, M., and Tomokiyo, T. (2004). BlogPulse: Automated Trend Discovery for Weblogs. In Proceedings of the 13th International World Wide Web Conference, New York, New York.Google Scholar
  22. Glance, N., Hurst, M., Nigam, K. Siegler, M., Stockton, R. and Tomokiyo, T. (2005). Analyzing Online Discussion for Marketing Intelligence, In Proceedings of the 14th International World Wide Web Conference, Chicago, Illinois.Google Scholar
  23. Glaser, J., Dixit, J., and Green, D. P. (2002). Studying Hate Crime with the Internet: What Makes Racists Advocate Racial Violence? Journal of Social Issues, 58(1), 177–193.CrossRefGoogle Scholar
  24. Gustavson, A.T. and Sherkat, D.E. (2004). Elucidating the Web of Hate: The Ideological Structuring of Network Ties among White Supremacist Groups on the Internet. Paper presented at Annual Meeting of American Sociological Association.Google Scholar
  25. Heydon, A. and Najork, M. (1999). Mercator: A Scalable, Extensible Web Crawler. In Proceedings of the International Conference on the World Wide Web, 219–229.Google Scholar
  26. Lage, J. P., Da Silva, A. S., Golgher, P. B., and Laender, A. H. F. (2002). Collecting Hidden Web Pages for Data Extraction. In Proceedings of WIDM.Google Scholar
  27. Lawrence, S. and Giles, C. L. (1998). Searching the World Wide Web. Science, 280(5360), 98.CrossRefGoogle Scholar
  28. Leuski, A. and Allan, J. (2000). Lighthouse: Showing the Way to Relevant Information. In Proceedings of the IEEE Symposium on Information Visualization, 125–130.Google Scholar
  29. Limanto, H. Y., Giang, N. N., Trung, V. T., Huy, N. Q., and He, J. Z. Q. (2005). An Information Extraction Engine for Web Discussion Forums. In Proceedings of the 14th International Conference on the World Wide Web, Chiba, Japan.Google Scholar
  30. Lin, K. and Chen, H. (2002). Automatic Information Discovery from the “Invisible Web.” In Proceedings of the International Conference on Information Technology: Coding and Computing.Google Scholar
  31. Najork, M. and Wiener, J. L. (2001). Breadth-First Search Crawling Yields High-Quality Pages. In Proceedings of the World Wide Web Conference, Hong Kong.Google Scholar
  32. Ntoulas, A., Zerfos, P., and Cho, J. (2005). In Proceedings of the Joint Conference on Digital Libraries, Denver, Colorado.Google Scholar
  33. Pant, G., Srinivasan, P., and Menczer, F. (2002). Exploration versus Exploitation in Topic Driven Crawlers. In Proceedings of the WWW Workshop on Web Dynamics.Google Scholar
  34. Raghavan, S. and Garcia-Molina, H. (2001). Crawling the Hidden Web. In Proceedings of the 27th International Conference on Very Large Databases.Google Scholar
  35. Schafer, J. (2002). Spinning the Web of Hate: Web-Based Hate Propagation by Extremist Organizations. Journal of Criminal Justice and Popular Culture, 9(2), 69–88.Google Scholar
  36. Sizov, S., Graupmann, J., and Theobald, M. (2003). From Focused Crawling to Expert Information: An Application Framework for Web Exploration and Portal Generation. In Proceedings of the 29th International Conference on Very Large Databases, Berlin, Germany.CrossRefGoogle Scholar
  37. Srinivasan, P., Mitchell, J., Bodenreider, O., Pant, G., and Menczer, F. (2002). Web Crawling Agents for Retrieving Biomedical Information. In Proceedings of the International Workshop on Agents in Bioinformatics (NETTAB), Bologna, Italy.Google Scholar
  38. Whine, M. (1997). The Governance of Cyberspace: Politics, Technology, and Global Restructuring., London, U.K: Routledge.Google Scholar
  39. Yih, W., Chang, P., and Kim, W. (2004). Mining Online Deal Forums for Hot Deals. In Proceedings of the Web Intelligence Conference.Google Scholar
  40. Zhou, Y., Reid, E., Qin, J., Chen, H., and Lai, G. (2005). U.S. Extremist Groups on the Web: Link and Content Analysis. IEEE Intelligent Systems, 20(5), 44–51.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Department of Management Information SystemsUniversity of ArizonaTusconUSA

Personalised recommendations