Advertisement

Semi-automatic Information Extraction from Discussion Boards with Applications for Anti-Spam Technology

  • Saeed Sarencheh
  • Vidyasagar Potdar
  • Elham Afsari Yeganeh
  • Nazanin Firoozeh
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6017)

Abstract

Forums (or discussion boards) represent a huge information collection structured under different boards, threads and posts. The actual information entity of a forum is a post, which has the information about authors, date and time of post, actual content etc. This information is significant for a number of applications like gathering market intelligence, analyzing customer perceptions etc. However automatically extracting this information from a forum is an extremely challenging task. There are several customized parsers designed for extracting information from a particular forum platform with a specific template (e.g. SMF or phpBB), however the problem with this approach is that these parsers are dependent upon the forum platform and the template used, which makes it unrealistic to use in practical situations. Hence, in this paper we propose a semi-automatic rule based solution for extracting forum post information and inserting the extracted information to a database for the purpose of analysis. The key challenge with this solution is identifying extraction rules, which are normally forum platform and forum template specific. As a result we analyzed 72 forums to derive these rules and test the performance of the algorithm. The results indicate that we were able to extract all the required information from SMF and phpBB forum platforms, which represent the majority of forums on the web.

Keywords

Information extraction Discussion Forums Anti-Spam phpBB SMF 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Scardamalia, M.: In: Education and technology: An encyclopedia, pp. 183–192 (2004)Google Scholar
  2. 2.
  3. 3.
    Welcome to the new Audiworld!, http://forums.audiworld.com/vbguide.pdf
  4. 4.
    Potdar, V., Hayati, P.: Spammer and Hacker, Two Old Friends. In: 3rd IEEE International Conference on Digital Ecosystems and Technologies (DEST 2009), Istanbul, Turkey (2009)Google Scholar
  5. 5.
    Weld, D.S., Wu, F., Adar, E., Amershi, S., Fogarty, J., Hoffmann, R., Patel, K., Skinner, M.: Intelligence in Wikipedia. In: 23rd AAAI Conference on Artificial Intelligence (2008)Google Scholar
  6. 6.
    McCallum, A.: Information Extraction (2005)Google Scholar
  7. 7.
    Iria, J., Ciravegna, F.: Relation Extraction for Mining the Semantic Web. In: Dagstuhl Seminar on Machine Learning for the Semantic Web, Dagstuhl (2005)Google Scholar
  8. 8.
    Kristjansson, T., Culotta, A., Viola, P., McCallum, A.: Interactive Information Extraction with Constrained Conditional Random Fields. In: 19th National Conference on Artifical Intelligence, California, pp. 412–418 (2004)Google Scholar
  9. 9.
    Potdar, V., Hayati, P.: Toward Spam 2.0: An Evaluation of Web 2.0 Anti-Spam Methods. In: 7th IEEE International Conference on Industrial Informatics (INDIN 2009), Cardiff, Wales (2009)Google Scholar
  10. 10.
    Hammer, J., McHugh, J., Garcia-Molina: Semi structured data: the TSIMMIS experience. In: Proceedings of the 1st East-European Symposium on Advances in Databases and Information Systems,Google Scholar
  11. 11.
    Sahuguet, A., Azavant, F.: Building intelligent Web applications using lightweight wrappers. Data & Knowledge Engineering 36(3), 283–316 (2001)zbMATHCrossRefGoogle Scholar
  12. 12.
    Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI 1997 (1997)Google Scholar
  13. 13.
    Chang, C.-H., Lui, S.-C.: IEPAD: Information Extraction Based on Pattern Discovery. In: 10th international conference on World Wide Web, Hong Kong, pp. 681–688 (2001)Google Scholar
  14. 14.
    Chang, C.-H., Kuo, S.-C.: OLERA: On-Line Extraction Rule Analysis for Semi-structured Documents. In: 22nd IASTED International Multi-Conference on Applied Informatics (2004)Google Scholar
  15. 15.
    Chang, C.-H., Hsu, C.-N., Lui, S.C.: Automatic information extraction from semi-structured Web pages by pattern discovery, vol. 35, pp. 129–147. Elsevier Science Publishers B. V, The Netherlands (2003)Google Scholar
  16. 16.
    Liu, B.: Web data mining: exploring hyperlinks, contents, and usage data, pp. 323–344. Springer, Heidelberg (2006)Google Scholar
  17. 17.
    Zhang, Q., Yang, S., Huang, X., Wu, L.: Template-independent Wrapper for Web Forums. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 794–795. ACM, New York (2009)CrossRefGoogle Scholar
  18. 18.
    Cai, R., Yang, J., Lai, W., Wang, Y., Zhang, L.: iRobot: An Intelligent Crawler for Web Forums. In: 17th international conference on World Wide Web, China, pp. 447–456 (2008)Google Scholar
  19. 19.
    Crescenzi, M., Mecca, V.: Grammars have exceptions. Information Systems 23(8), 539–565 (1998)CrossRefGoogle Scholar
  20. 20.
    Hayati, P., Potdar, V.: Evaluation of Spam Detection and Prevention Frameworks for Email and Image Spam - A State of Art. In: Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services, Linz, Austria (2008)Google Scholar
  21. 21.
    Hayati, P., Potdar, V., Talevski, A., Chai, K.: Web Spambot Detection Based on Navigation Behavior. In: 24th IEEE AINA Conference, Perth, Australia (April 2010) (accepted) Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Saeed Sarencheh
    • 2
  • Vidyasagar Potdar
    • 1
  • Elham Afsari Yeganeh
    • 2
  • Nazanin Firoozeh
    • 2
  1. 1.Anti-Spam Research Lab (ASRL) Digital Ecosystems and Business Intelligence InstituteCurtin UniversityPerthAustralia
  2. 2.Institute for Advanced Studies in Basic SciencesIASBSZanjanIran

Personalised recommendations