Hazardous Document Detection Based on Dependency Relations and Thesaurus

Ikeda, Kazushi; Yanagihara, Tadashi; Hattori, Gen; Matsumoto, Kazunori; Takisima, Yasuhiro

doi:10.1007/978-3-642-17432-2_46

Kazushi Ikeda²⁰,
Tadashi Yanagihara²⁰,
Gen Hattori²⁰,
Kazunori Matsumoto²⁰ &
…
Yasuhiro Takisima²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6464))

Included in the following conference series:

Australasian Joint Conference on Artificial Intelligence

1769 Accesses
3 Citations

Abstract

In this paper, we propose algorithms to increase the accuracy of hazardous Web page detection by correcting the detection errors of typical keyword-based algorithms based on the dependency relations between the hazardous keywords and their neighboring segments. Most typical text-based filtering systems ignore the context where the hazardous keywords appear. Our algorithms automatically obtain segment pairs that are in dependency relations and appear to characterize hazardous documents. In addition, we also propose a practical approach to expanding segment pairs with a thesaurus. Experiments with a large number of Web pages show that our algorithms increase the detection F value by 7.3% compared to the conventional algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Yanagihara, T., Ikeda, K., Matsumoto, K., Takishima, Y.: Fast n-gram Assortment Construction for Filtering Hazardous Information. IPSJ SIG Technical Reports, vol. 3, pp. 1–5 (2009)
Google Scholar
Hoashi, K., Matsumoto, K., Inoue, N., Hashimoto, K.: Document Filtering Method Using Non-Relevant Information Profile. In: Proc. of SIGIR 2000, pp. 176–183 (2000)
Google Scholar
Matsumura, A., Takasu, A., Adachi, J.: The Effect of Information Retrieval Method Using Dependency Relationship Between Words. In: Proc. of RIAO 2000, pp. 1043–1058 (2000)
Google Scholar
Sun, R., Ong, C.H., Chua, T.S.: Mining Dependency Relations for Query Expansion in Passage Retrieval. In: Proc. of SIGIR 2006, pp. 382–389 (2006)
Google Scholar
Liu, Y., Scheuermann, P., Li, X., Zhu, X.: Using WordNet to Disambiguate Word Senses for Text Classification. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4489, pp. 780–788. Springer, Heidelberg (2007)
Google Scholar
Hsu, M.H., Tsai, M.F., Chen, H.H.: Combining WordNet and ConceptNet for Automatic Query Expansion: A Learning Approach. In: Li, H., Liu, T., Ma, W.-Y., Sakai, T., Wong, K.-F., Zhou, G. (eds.) AIRS 2008. LNCS, vol. 4993, pp. 213–224. Springer, Heidelberg (2008)
Chapter Google Scholar
Yoshioka, M., Haraguchi, M.: On a Combination of Probabilistic and Boolean IR Models for WWW Document Retrieval. In: Proc. of TALIP 2005, vol. 4(4), pp. 340–356 (2005)
Google Scholar
Li, S.L., Otsuka, M., Kitsuregawa, M.: Finding Related Search Engine Queries by Web Community Based Query Enrichment. In: Proc. of WWW 2010, pp. 121–142 (2010)
Google Scholar
Ikeda, K., Yanagihara, T., Matsumoto, K., Takisima, Y.: Detection of Illegal and Hazardous Information Using Dependency Relations and Keyword Abstraction (in Japanese). In: Proc. of the Second Forum on Data Engineering and Information Management, C9-5 (2010)
Google Scholar
Akaike, H.: A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control 19(6), 716–723 (2003)
Article MathSciNet MATH Google Scholar
Matsumoto, K., Hashimoto, K.: Schema Design for Causal Law Mining from Incomplete Database. In: Arikawa, S., Furukawa, K. (eds.) DS 1999. LNCS (LNAI), vol. 1721, pp. 92–102. Springer, Heidelberg (1999)
Chapter Google Scholar
National Institute of Information and Communications Technology, “EDR Thesaurus”, http://www2.nict.go.jp/r/r312/EDR/index.html
Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying Conditional Random Fields to Japanese Morphological Analysis. In: Proc. of EMNLP 2004, pp. 230–237 (2004), http://mecab.sourceforge.net/
Kudo, T., Yamamoto, K., Matsumoto, Y.: Japanese Dependency Analysis using Cascaded Chunking. In: Proc. of COLING 2002, pp. 63–69 (2002)
Google Scholar
Kawahara, D., Kurohashi, S.: A fully-lexicalized probabilistic model for Japanese syntactic and case structure analysis. In: Proc. of NAACL 2010, pp. 176–183 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

KDDI R&D Laboratories Inc., 2–1–15 Ohara, Fujimino, Saitama, 356–8502, Japan
Kazushi Ikeda, Tadashi Yanagihara, Gen Hattori, Kazunori Matsumoto & Yasuhiro Takisima

Authors

Kazushi Ikeda
View author publications
You can also search for this author in PubMed Google Scholar
Tadashi Yanagihara
View author publications
You can also search for this author in PubMed Google Scholar
Gen Hattori
View author publications
You can also search for this author in PubMed Google Scholar
Kazunori Matsumoto
View author publications
You can also search for this author in PubMed Google Scholar
Yasuhiro Takisima
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer and Information Science, University of South Australia, 5095, Mawson Lakes, SA, Australia
Jiuyong Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ikeda, K., Yanagihara, T., Hattori, G., Matsumoto, K., Takisima, Y. (2010). Hazardous Document Detection Based on Dependency Relations and Thesaurus. In: Li, J. (eds) AI 2010: Advances in Artificial Intelligence. AI 2010. Lecture Notes in Computer Science(), vol 6464. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17432-2_46

Download citation

DOI: https://doi.org/10.1007/978-3-642-17432-2_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17431-5
Online ISBN: 978-3-642-17432-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics