Advertisement

Journal of Computer Science and Technology

, Volume 23, Issue 4, pp 557–572 | Cite as

New Information Distance Measure and Its Application in Question Answering System

  • Xian Zhang
  • Yu Hao
  • Xiao-Yan Zhu
  • Ming LiEmail author
Regular Paper

Abstract

In a question answering (QA) system, the fundamental problem is how to measure the distance between a question and an answer, hence ranking different answers. We demonstrate that such a distance can be precisely and mathematically defined. Not only such a definition is possible, it is actually provably better than any other feasible definitions. Not only such an ultimate definition is possible, but also it can be conveniently and fruitfully applied to construct a QA system. We have built such a system — QUANTA. Extensive experiments are conducted to justify the new theory.

Keywords

information distance normalized information distance question answering system 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

11390_2008_9152_MOESM1_ESM.pdf (133 kb)
(PDF 133 kb)

References

  1. [1]
    Tan P N, Kumar V, Srivastava J. Selecting the right interestingness measure for association patterns. In Proc. SIGKDD’02, Edmonton, Alberta, Canada, pp.32–44.Google Scholar
  2. [2]
    Bennett C H, Gacs P, Li M, Vitányi P, Zurek W. Information distance. IEEE Trans. Inform. Theory (STOC’93), July 1998, 44(4): 1407–1423.Google Scholar
  3. [3]
    Li M, Badger J, Chen X, Kwong S, Kearney P, Zhang H. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 2001, 17(2): 149–154.CrossRefGoogle Scholar
  4. [4]
    Li M, Chen X, Li X, Ma B, Vitányi P. The similarity metric. IEEE Trans. Information Theory, 2004, 50(12): 3250–3264.CrossRefGoogle Scholar
  5. [5]
    Li M, Vitányi P. An Introduction to Kolmogorov Complexity and Its Applications. 2nd Edition, Springer-Verlag, 1997.Google Scholar
  6. [6]
    V’yugin M V. Information distance and conditional complexities. Theoret. Comput. Sci., 2002, 271: 145–150.CrossRefMathSciNetGoogle Scholar
  7. [7]
    Vereshchagin N K, V’yugin M V. Independent minimum length programs to translate between given strings. Theoret. Comput. Sci., 2002, 271: 131–143.zbMATHCrossRefMathSciNetGoogle Scholar
  8. [8]
    Shen A K, Vereshchagin N K. Logical operations and Kolmogorov complexity. Theoret. Comput. Sci., 2002, 271: 125–129.zbMATHCrossRefMathSciNetGoogle Scholar
  9. [9]
    An A Muchnik, N Vereshchagin. Shannon entropy vs. Kolmogorov complexity. In Porc. First International Computer Science Symposium in Russia, CSR 2006, St. Petersburg, Russia, June 8-12, 2006, pp.281–191.Google Scholar
  10. [10]
    Muchnik An A. Conditional complexity and codes. Theoretical Computer Science, 2002, 271(1): 97–109.CrossRefMathSciNetGoogle Scholar
  11. [11]
    Muchnik An A, Vereshchagin N K. Logical operations and Kolmogorov complexity II. In Proc. 16th Conf. Comput. Complexity, Chicago, USA, 2001, pp.256–265.Google Scholar
  12. [12]
    Chernov A V, Muchnik An A, Romashchenko A E, Shen A K, Vereshchagin N K. Upper semi-lattice of binary strings with the relation “x is simple conditional to y”. Theoret. Comput. Sci., 2002, 271: 69–95.zbMATHCrossRefMathSciNetGoogle Scholar
  13. [13]
    Keogh E J, Lonardi S, Ratanamahatana C A. Towards parameter-free data mining. In Proc. KDD’2004, Seattle, WA, USA, pp. 206–215.Google Scholar
  14. [14]
    Benedetto D, Caglioti E, Loreto V. Language trees and zipping. Phys. Rev. Lett., 2002, 88(4): 048702.CrossRefGoogle Scholar
  15. [15]
    Chen X, Francia B, Li M, Mckinnon B, Seker A. Shared information and program plagiarism detection. IEEE Trans. Information Theory, July 2004, 50(7): 1545–1550.CrossRefMathSciNetGoogle Scholar
  16. [16]
    R Cilibrasi, P M B Vitányi, R de Wolf. Algorithmic clustring of music based on string compression. Comput. Music J., 2004, 28(4): 49–67.CrossRefGoogle Scholar
  17. [17]
    Cilibrasi R, Vitányi P M B. The Google similarity distance. IEEE Trans. Knowledge and Data Engineering, 2007, 19(3): 370–383.CrossRefGoogle Scholar
  18. [18]
    Cuturi M, Vert J P. The context-tree kernel for strings. Neural Networks, 2005, 18(4): 1111–1123.CrossRefGoogle Scholar
  19. [19]
    Emanuel K, Ravela S, Vivant E, Risi C. A combined statistical-deterministic approach of hurricane risk assessment. Manuscript, Program in Atmospheres, Oceans, and Climate, MIT, 2005.Google Scholar
  20. [20]
    Kirk S R, Jenkins S. Information theory-based software metrics and obfuscation. J. Systems and Software, 2004, 72: 179–186.CrossRefGoogle Scholar
  21. [21]
    Kraskov A, Stögbauer H, Andrzejak R G, Grassberger P. Hierarchical clustering using mutual information. Europhys. Lett., 2005, 70(2): 278–284.CrossRefMathSciNetGoogle Scholar
  22. [22]
    Kocsor A, Kertesz-Farkas A, Kajan L, Pongor S. Application of compression-based distance measures to protein sequence classification: A methodology study. Bioinformatics, 2006, 22(4): 407–412.CrossRefGoogle Scholar
  23. [23]
    Krasnogor N, Pelta D A. Measuring the similarity of protein structures by means of the universal similarity metric. Bioinformatics, 2004, 20(7): 1015–1021.CrossRefGoogle Scholar
  24. [24]
    Taha W, Crosby S, Swadi K. A new approach to data mining for software design. Manuscript. Rice Univ. 2006.Google Scholar
  25. [25]
    Otu H H, Sayood K. A new sequence distance measure for phylogenetic tree construction. Bioinformatics 2003, 19(6): 2122–2130.CrossRefGoogle Scholar
  26. [26]
    Pao H K, Case J. Computing entropy for ortholog detection. In Proc. Int. Conf. Comput. Intell., Dec. 17–19, 2004, pp.89–92.Google Scholar
  27. [27]
    Parry D. Use of Kolmogorov distance identification of web page authorship, topic and domain. In Proc. Workshop on Open Source Web Inf. Retrieval, Compiègne, France, 2005, pp.47–50.Google Scholar
  28. [28]
    Santos C C, Bernardes J, Vitányi P M B, Antunes L. Clustering fetal heart rate tracings by compression. In Proc. 19th IEEE Int. Symp. Computer-Based Medical Systems, Salt Lake City, Utah, June 22–23, 2006, pp.685–690.Google Scholar
  29. [29]
    Arbuckle T, Balaban A, Peters D K, Lawford M. Software documents: Comparison and measurement. In Proc. SEKE2007, Boston, USA, July 9–11, 2007, pp.740–748.Google Scholar
  30. [30]
    Ané C, Sanderson M J. Missing the forest for the trees: Phylogenetic compression and its implications for inferring complex evolutionary histories. Systematic Biology, 2005, 54(1): 146–157.CrossRefGoogle Scholar
  31. [31]
    Cilibrasi R, Vitányi P M B, Clustering by compression. IEEE Trans. Inform. Theory, 2005, 51(4): 1523–1545.CrossRefMathSciNetGoogle Scholar
  32. [32]
    Zhang X, Hao Y, Zhu X, Li M. Information distance from a question to an answer. In Proc. 13th ACM SIGKDD, San Jose, California, USA, 2007, pp.874–883.Google Scholar
  33. [33]
    Li M. Information distance and its applications. Int. J. Found. Comput. Sci., 2007, 18(4): 669–681.zbMATHCrossRefGoogle Scholar
  34. [34]
    Bennett C H, Li M, Ma B. Chain letters and evolutionary histories. Scientific American, June 2003, feature article, 288(6): 76–81.CrossRefGoogle Scholar
  35. [35]
    Siebes A, Struzik Z. Complex Data: Mining using patterns. In Proc. the ESF Exploratory Workshop on Pattern Detection and Discovery, London, 2002, pp.24–35.Google Scholar
  36. [36]
    Fagin R, Stockmeyer L. Relaxing the triangle inequality in pattern matching. Int. J. Comput. Vision, 1998, 28(3): 219–231.CrossRefGoogle Scholar
  37. [37]
    Veltkamp R C. Shape matching: Similarity measures and algorithms. In Proc. Int. Conf. Shape Modeling Applications, Italy, Invited talk, 2001, pp.188–197.Google Scholar
  38. [38]
    Lin J. The web as a resource for question answering: Perspectives and challenges. In Proc. 3rd Int. Conf. Language Resources and Evolution, Las Palmas, Spain, May, 2002.Google Scholar
  39. [39]
    Clarke C, Cormack G V, Kemkes G, Laszlo M, Lynam T R, Terra E L, Tilker P L. Statistical selection of exact answers (multitext experiments for TREC 2002). Report, University of Waterloo, 2002.Google Scholar
  40. [40]
    Cimiano P, Staab S. Learning by googling. ACM SIGKDD Explorations Newsletter, 2004, 6(2): 24–33.CrossRefGoogle Scholar
  41. [41]
    Lin J, Katz B. Question answering from the web using knowledge annotation and knowledge mining techniques. In Proc. 12th Int. CIKM, New Orleans, Louisiana, USA, 2003, pp.116–123.Google Scholar
  42. [42]
    Li X, Roth D. Learning question classifiers. In Proc. COLING’02, Taipei, Taiwan, China, 2002, pp.556–562.Google Scholar
  43. [43]
    Chang C C, Lin C J. LIBSVM: A library for support vector machines. 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm.
  44. [44]
    Tsuruoka Y, Tsujii J. Bidirectional inference with the easiest-first strategy for tagging sequence data. In Proc. HLT/EMNLP’05, Vancouver, October 2005, pp.467–474.Google Scholar
  45. [45]
    Ramshaw L, Marcus M. Text chunking using transformation-based learning. In Proc. 3rd Workshop on Very Large Corpora, Cambridge, Massachusetts, USA, 1995, pp.82–94.Google Scholar
  46. [46]
    Finkel J R, Grenager T, Manning C. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proc. 43rd Annual Meeting of ACL, Michigan, USA, 2005, pp.363–370.Google Scholar
  47. [47]
    Lin J, Katz B. Building a reusable test collection for question answering. Journal of the American Society for Information Science and Technology, 2006, 57(7): 851–861.CrossRefGoogle Scholar

Copyright information

© Springer 2008

Authors and Affiliations

  1. 1.Department of Computer Science and TechnologyTsinghua UniversityBeijingChina
  2. 2.David R. Cheriton School of Computer ScienceUniversity of WaterlooWaterlooCanada

Personalised recommendations