Abstract
The Vector Space Model (VSM) has been widely used in Natural Language Processing (NLP) for representing text documents as a Bag of Words (BOW). However, only document-level statistical information is recorded (e.g., document frequency, inverse document frequency) and word semantics cannot be captured. Improvement towards understanding the meaning of words in texts is a challenging task and sufficient background knowledge may need to be incorporated to provide a better semantic representation of texts. In this paper, we present a text mining model that can automatically discover semantic relationships between concepts across multiple documents (where the traditional search paradigm such as search engines cannot help much) and effectively integrate various evidences mined from Wikipedia knowledge. We propose this integration may effectively complement existing information contained in text corpus and facilitate the construction of a more comprehensive representation and retrieval framework. The experimental results demonstrate the search performance has been significantly enhanced against two competitive baselines.
This submission is an extended version of the paper published in DaWaK’12, which was selected by the DaWaK’12 program committee for possible publication in the LNCS Transactions on Large-Scale Data- and Knowledge-Centered Systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Jin, W., Srihari, R.K.: Knowledge discovery across documents through concept chain queries. In: Sixth IEEE International Conference on Data Mining Workshops, ICDM Workshops 2006, pp. 448–452. IEEE, December 2006
Srinivasan, P.: Text mining: generating hypotheses from MEDLINE. J. Am. Soc. Inform. Sci. Technol. 55(5), 396–413 (2004)
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: IJCAI, vol. 7, pp. 1606–1611, January 2007
Swanson, D.R., Smalheiser, N.R.: Implicit text linkages between Medline records: using Arrowsmith as an aid to scientific discovery. Libr. Trends 48(1), 48–59 (1999)
Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. In: AAAI, vol. 6, pp. 1301–1306, July 2006
Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proceedings of the Semantic Web Workshop at SIGIR 2003, November 2003
Gibson, D., Kleinberg, J., Raghavan, P.: Inferring web communities from link topology. In: Proceedings of the Ninth ACM Conference on Hypertext and Hypermedia: Links, Objects, Time and Space—Structure in Hypermedia Systems: Links, Objects, Time and Space—Structure in Hypermedia Systems, pp. 225–234. ACM, May 1998
Sorg, P., Cimiano, P.: Cross-lingual information retrieval with explicit semantic analysis. In: CLEF Workshop 2008 (2008)
Scott, S., Matwin, S.: Text classification using WordNet hypernyms. In: Use of WordNet in Natural Language Processing Systems: Proceedings of the Conference, pp. 38–44, August 1998
Srihari, R.K., Li, W., Niu, C., Cornell, T.: Infoxtract: a customizable intermediate level information extraction engine. In: Proceedings of the HLT-NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems, vol. 8, pp. 51–58. Association for Computational Linguistics, May 2003
Faloutsos, C., McCurley, K.S., Tomkins, A.: Fast discovery of connection subgraphs. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 118–127. ACM, August 2004
MWDumper. Software available at http://www.mediawiki.org/wiki/Manual:MWDumper
Yan, P., Jin, W.: Improving cross-document knowledge discovery using explicit semantic analysis. In: Cuzzocrea, A., Dayal, U. (eds.) DaWaK 2012. LNCS, vol. 7448, pp. 378–389. Springer, Heidelberg (2012)
Jin, W., Srihari, R., Singh, A.: Generating hypotheses from the web. In: Proceedings of the 17th International Conference on World Wide Web, pp. 1211–1212. ACM, April 2008
Luo, G., Tang, C., Tian, Y.L.: Answering relationship queries on the web. In: Proceedings of the 16th International Conference on World Wide Web, pp. 561–570. ACM, May 2007
Radev, D.R., Libner, K., Fan, W.: Getting answers to natural language questions on the Web. J. Am. Soc. Inform. Sci. Technol. 53(5), 359–364 (2002)
Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. In: WWW 2007, pp. 757–766 (2007)
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Gonzalo, J., Verdejo, F., Chugur, I., Cigarran, J.: Indexing with WordNet synsets can improve text retrieval. arXiv preprint cmp-lg/9808002 (1998)
Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th International Conference on World Wide Web, pp. 519–528. ACM, May 2003
Jing, L., Zhou, L., Ng, M.K., Huang, J.Z.: Ontology-based distance measure for text clustering. In: Proceedings of the Text Mining Workshop, SIAM International Conference on Data Mining (2006)
Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist. 32(1), 13–47 (2006)
Rodríguez, M.D.B., Hidalgo, J.M.G., Agudo, B.D.: Using WordNet to complement training information in text categorization. arXiv preprint cmp-lg/9709007 (1997)
Gurevych, I., Müller, C., Zesch, T.: What to be?-electronic career guidance based on semantic relatedness. In: Annual Meeting-Association for Computational Linguistics, vol. 45(1), p. 1032, June 2007
Müller, C., Gurevych, I.: Using wikipedia and wiktionary in domain-specific information retrieval. In: Peters, C., et al. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 219–226. Springer, Heidelberg (2009)
Jin, W., Srihari, R.K., Ho, H.H., Wu, X.: Improving knowledge discovery in document collections through combining text retrieval and link analysis techniques. In: Seventh IEEE International Conference on Data Mining, ICDM 2007, pp. 193–202. IEEE, October 2007
Yan, P., Jin, W.: Mining semantic relationships between concepts across documents incorporating wikipedia knowledge. In: Perner, P. (ed.) ICDM 2013. LNCS, vol. 7987, pp. 70–84. Springer, Heidelberg (2013)
Bonifati, A., Cuzzocrea, A.: Efficient fragmentation of large XML documents. In: Wagner, R., Revell, N., Pernul, G. (eds.) DEXA 2007. LNCS, vol. 4653, pp. 539–550. Springer, Heidelberg (2007)
Cuzzocrea, A., Darmont, J., Mahboubi, H.: Fragmenting very large xml data warehouses via k-means clustering algorithm. Int. J. Bus. Intell. Data Min. 4(3), 301–328 (2009)
Cuzzocrea, A., Bertino, E.: A secure multiparty computation privacy preserving OLAP framework over distributed XML data. In: Proceedings of the 2010 ACM Symposium on Applied Computing, pp. 1666–1673. ACM (2010)
Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37(1), 141–188 (2010)
Deerwester, S.: Improving information retrieval with latent semantic indexing. In: Proceedings of the 51st Annual Meeting of the American Society for Information Science, pp. 36–40 (1988)
Meng, L., Huang, R., Gu, J.: A review of semantic similarity measures in wordnet. Int. J. Hybrid Inform. Technol. 6(1), 1–12 (2013)
Rinaldi, A.M.: An ontology-driven approach for semantic information retrieval on the web. ACM Trans. Internet Technol. (TOIT) 9(3), 10 (2009)
Wu, H., Gunopulos, D.: Evaluating the utility of statistical phrases and latent semantic indexing for text classification. In: Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM 2003, pp. 713–716. IEEE (2002)
Liu, T., Chen, Z., Zhang, B., Ma, W.Y., Wu, G.: Improving text classification using local latent semantic indexing. In: Fourth IEEE International Conference on Data Mining, ICDM 2004, pp. 162–169. IEEE, November 2004
Salahli, M.A.: An approach for measuring semantic relatedness between words via related terms. Math. Comput. Appl. 14(1), 55 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Yan, P., Jin, W. (2015). Improving Cross-Document Knowledge Discovery Through Content and Link Analysis of Wikipedia Knowledge. In: Hameurlain, A., Küng, J., Wagner, R., Cuzzocrea, A., Dayal, U. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXI. Lecture Notes in Computer Science(), vol 9260. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-47804-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-662-47804-2_8
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-47803-5
Online ISBN: 978-3-662-47804-2
eBook Packages: Computer ScienceComputer Science (R0)