Improving Cross-Document Knowledge Discovery Through Content and Link Analysis of Wikipedia Knowledge

Yan, Peng; Jin, Wei

doi:10.1007/978-3-662-47804-2_8

Peng Yan²¹ &
Wei Jin²¹

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 9260))

475 Accesses
1 Citations

Abstract

The Vector Space Model (VSM) has been widely used in Natural Language Processing (NLP) for representing text documents as a Bag of Words (BOW). However, only document-level statistical information is recorded (e.g., document frequency, inverse document frequency) and word semantics cannot be captured. Improvement towards understanding the meaning of words in texts is a challenging task and sufficient background knowledge may need to be incorporated to provide a better semantic representation of texts. In this paper, we present a text mining model that can automatically discover semantic relationships between concepts across multiple documents (where the traditional search paradigm such as search engines cannot help much) and effectively integrate various evidences mined from Wikipedia knowledge. We propose this integration may effectively complement existing information contained in text corpus and facilitate the construction of a more comprehensive representation and retrieval framework. The experimental results demonstrate the search performance has been significantly enhanced against two competitive baselines.

This submission is an extended version of the paper published in DaWaK’12, which was selected by the DaWaK’12 program committee for possible publication in the LNCS Transactions on Large-Scale Data- and Knowledge-Centered Systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Jin, W., Srihari, R.K.: Knowledge discovery across documents through concept chain queries. In: Sixth IEEE International Conference on Data Mining Workshops, ICDM Workshops 2006, pp. 448–452. IEEE, December 2006
Google Scholar
Srinivasan, P.: Text mining: generating hypotheses from MEDLINE. J. Am. Soc. Inform. Sci. Technol. 55(5), 396–413 (2004)
Article Google Scholar
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: IJCAI, vol. 7, pp. 1606–1611, January 2007
Google Scholar
Swanson, D.R., Smalheiser, N.R.: Implicit text linkages between Medline records: using Arrowsmith as an aid to scientific discovery. Libr. Trends 48(1), 48–59 (1999)
Google Scholar
Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. In: AAAI, vol. 6, pp. 1301–1306, July 2006
Google Scholar
Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proceedings of the Semantic Web Workshop at SIGIR 2003, November 2003
Google Scholar
Gibson, D., Kleinberg, J., Raghavan, P.: Inferring web communities from link topology. In: Proceedings of the Ninth ACM Conference on Hypertext and Hypermedia: Links, Objects, Time and Space—Structure in Hypermedia Systems: Links, Objects, Time and Space—Structure in Hypermedia Systems, pp. 225–234. ACM, May 1998
Google Scholar
Sorg, P., Cimiano, P.: Cross-lingual information retrieval with explicit semantic analysis. In: CLEF Workshop 2008 (2008)
Google Scholar
Scott, S., Matwin, S.: Text classification using WordNet hypernyms. In: Use of WordNet in Natural Language Processing Systems: Proceedings of the Conference, pp. 38–44, August 1998
Google Scholar
Srihari, R.K., Li, W., Niu, C., Cornell, T.: Infoxtract: a customizable intermediate level information extraction engine. In: Proceedings of the HLT-NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems, vol. 8, pp. 51–58. Association for Computational Linguistics, May 2003
Google Scholar
Faloutsos, C., McCurley, K.S., Tomkins, A.: Fast discovery of connection subgraphs. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 118–127. ACM, August 2004
Google Scholar
MWDumper. Software available at http://www.mediawiki.org/wiki/Manual:MWDumper
Yan, P., Jin, W.: Improving cross-document knowledge discovery using explicit semantic analysis. In: Cuzzocrea, A., Dayal, U. (eds.) DaWaK 2012. LNCS, vol. 7448, pp. 378–389. Springer, Heidelberg (2012)
Chapter Google Scholar
Jin, W., Srihari, R., Singh, A.: Generating hypotheses from the web. In: Proceedings of the 17th International Conference on World Wide Web, pp. 1211–1212. ACM, April 2008
Google Scholar
Luo, G., Tang, C., Tian, Y.L.: Answering relationship queries on the web. In: Proceedings of the 16th International Conference on World Wide Web, pp. 561–570. ACM, May 2007
Google Scholar
Radev, D.R., Libner, K., Fan, W.: Getting answers to natural language questions on the Web. J. Am. Soc. Inform. Sci. Technol. 53(5), 359–364 (2002)
Article Google Scholar
Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. In: WWW 2007, pp. 757–766 (2007)
Google Scholar
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Gonzalo, J., Verdejo, F., Chugur, I., Cigarran, J.: Indexing with WordNet synsets can improve text retrieval. arXiv preprint cmp-lg/9808002 (1998)
Google Scholar
Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th International Conference on World Wide Web, pp. 519–528. ACM, May 2003
Google Scholar
Jing, L., Zhou, L., Ng, M.K., Huang, J.Z.: Ontology-based distance measure for text clustering. In: Proceedings of the Text Mining Workshop, SIAM International Conference on Data Mining (2006)
Google Scholar
Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist. 32(1), 13–47 (2006)
Article Google Scholar
Rodríguez, M.D.B., Hidalgo, J.M.G., Agudo, B.D.: Using WordNet to complement training information in text categorization. arXiv preprint cmp-lg/9709007 (1997)
Google Scholar
Gurevych, I., Müller, C., Zesch, T.: What to be?-electronic career guidance based on semantic relatedness. In: Annual Meeting-Association for Computational Linguistics, vol. 45(1), p. 1032, June 2007
Google Scholar
Müller, C., Gurevych, I.: Using wikipedia and wiktionary in domain-specific information retrieval. In: Peters, C., et al. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 219–226. Springer, Heidelberg (2009)
Chapter Google Scholar
Jin, W., Srihari, R.K., Ho, H.H., Wu, X.: Improving knowledge discovery in document collections through combining text retrieval and link analysis techniques. In: Seventh IEEE International Conference on Data Mining, ICDM 2007, pp. 193–202. IEEE, October 2007
Google Scholar
Yan, P., Jin, W.: Mining semantic relationships between concepts across documents incorporating wikipedia knowledge. In: Perner, P. (ed.) ICDM 2013. LNCS, vol. 7987, pp. 70–84. Springer, Heidelberg (2013)
Google Scholar
Bonifati, A., Cuzzocrea, A.: Efficient fragmentation of large XML documents. In: Wagner, R., Revell, N., Pernul, G. (eds.) DEXA 2007. LNCS, vol. 4653, pp. 539–550. Springer, Heidelberg (2007)
Chapter Google Scholar
Cuzzocrea, A., Darmont, J., Mahboubi, H.: Fragmenting very large xml data warehouses via k-means clustering algorithm. Int. J. Bus. Intell. Data Min. 4(3), 301–328 (2009)
Article Google Scholar
Cuzzocrea, A., Bertino, E.: A secure multiparty computation privacy preserving OLAP framework over distributed XML data. In: Proceedings of the 2010 ACM Symposium on Applied Computing, pp. 1666–1673. ACM (2010)
Google Scholar
Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37(1), 141–188 (2010)
MathSciNet Google Scholar
Deerwester, S.: Improving information retrieval with latent semantic indexing. In: Proceedings of the 51st Annual Meeting of the American Society for Information Science, pp. 36–40 (1988)
Google Scholar
Meng, L., Huang, R., Gu, J.: A review of semantic similarity measures in wordnet. Int. J. Hybrid Inform. Technol. 6(1), 1–12 (2013)
Google Scholar
Rinaldi, A.M.: An ontology-driven approach for semantic information retrieval on the web. ACM Trans. Internet Technol. (TOIT) 9(3), 10 (2009)
Article Google Scholar
Wu, H., Gunopulos, D.: Evaluating the utility of statistical phrases and latent semantic indexing for text classification. In: Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM 2003, pp. 713–716. IEEE (2002)
Google Scholar
Liu, T., Chen, Z., Zhang, B., Ma, W.Y., Wu, G.: Improving text classification using local latent semantic indexing. In: Fourth IEEE International Conference on Data Mining, ICDM 2004, pp. 162–169. IEEE, November 2004
Google Scholar
Salahli, M.A.: An approach for measuring semantic relatedness between words via related terms. Math. Comput. Appl. 14(1), 55 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, North Dakota State University, 1340 Administration Ave., Fargo, ND, 58102, USA
Peng Yan & Wei Jin

Authors

Peng Yan
View author publications
You can also search for this author in PubMed Google Scholar
Wei Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peng Yan .

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
FAW, University of Linz, Linz, Austria
Josef Küng
FAW, University of Linz, Linz, Austria
Roland Wagner
ICAR-CNR and University of Calabria, Rende, Italy
Alfredo Cuzzocrea
Hewlett-Packard Labatories, Palo Alto, California, USA
Umeshwar Dayal

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Yan, P., Jin, W. (2015). Improving Cross-Document Knowledge Discovery Through Content and Link Analysis of Wikipedia Knowledge. In: Hameurlain, A., Küng, J., Wagner, R., Cuzzocrea, A., Dayal, U. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXI. Lecture Notes in Computer Science(), vol 9260. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-47804-2_8

Download citation

DOI: https://doi.org/10.1007/978-3-662-47804-2_8
Published: 17 July 2015
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-47803-5
Online ISBN: 978-3-662-47804-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics