Text Preparation and Similarity Computation



Text data is often found in highly unstructured environments, and is frequently created by human participants. In many cases, text is embedded within Web documents, which is contaminated with elements such as HyperText Markup Language (HTML) tags, misspellings, ambiguous words, and so on. Furthermore, a single Web page may contain multiple blocks, most of which might be advertisements or other unrelated content.


Hypertext Markup Language (HTML) Inverse Document Frequency Normalization Multidimensional Representation Vector Space Representation Anchor Text 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. [31]
    R. Baeza-Yates, and B. Ribeiro-Neto. Modern information retrieval. ACM press, 2011.Google Scholar
  2. [79]
    S. Chakrabarti. Mining the Web: Discovering knowledge from hypertext data. Morgan Kaufmann, 2003.Google Scholar
  3. [119]
    W. B. Croft and D. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35(4), pp. 285–295, 1979.CrossRefGoogle Scholar
  4. [213]
    M. Hearst. TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1), pp. 33–64, 1997.Google Scholar
  5. [231]
    A. Huang. Similarity measures for text document clustering. Sixth New Zealand Computer Science Research Student Conference, pp. 49–56, 2008.Google Scholar
  6. [300]
    L. V. Lita, A. Ittycheriah, S. Roukos, and N. Kambhatla. Truecasing. ACL Conference, pp. 152–159, 2003.Google Scholar
  7. [303]
    B. Liu. Web data mining: exploring hyperlinks, contents, and usage data. Springer, New York, 2007.Google Scholar
  8. [316]
    C. Mackenzie. Coded character sets: History and development. Addison-Wesley Longman Publishing Co., Inc., 1980.Google Scholar
  9. [321]
    C. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, Cambridge, 2008.Google Scholar
  10. [322]
    C. Manning and H. Schütze. Foundations of statistical natural language processing. MIT Press, 1999.Google Scholar
  11. [325]
    A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering., 1996.
  12. [337]
    D. Metzler, S. Dumais, and C. Meek. Similarity measures for short segments of text. European Conference on Information Retrieval, pp. 16-27, 2007.Google Scholar
  13. [411]
    S. Robertson and K. Spärck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), pp. 129–146, 1976.CrossRefGoogle Scholar
  14. [418]
    M. Sahami and T. D. Heilman. A Web-based kernel function for measuring the similarity of short text snippets. WWW Conference, pp. 377–386, 2006.Google Scholar
  15. [423]
    G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval, Technical Report 87–881, Cornell University, 1987.
  16. [424]
    G. Salton and M. J. McGill. Introduction to modern information retrieval. McGraw Hill, 1986.Google Scholar
  17. [430]
    S. Sarawagi. Information extraction. Foundations and Trends in Satabases, 1(3), pp. 261–377, 2008.CrossRefGoogle Scholar
  18. [438]
    H. Schütze and C. Silverstein. Projections for Efficient Document Clustering. ACM SIGIR Conference, pp. 74–81, 1997.CrossRefGoogle Scholar
  19. [450]
    A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. ACM SIGIR Conference, pp. 21–29, 1996.Google Scholar
  20. [453]
    K. Spärck Jones. A statistical interpretation of term specificity and its application in information retrieval. Journal of Documentation, 28(1), pp. 11–21, 1972.CrossRefGoogle Scholar
  21. [461]
    A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. Workshop on Artificial Intelligence for Web Search, 2000.
  22. [481]
    C.J. van Rijsbergen, S.E. Robertson, and M.F. Porter. New models in probabilistic information retrieval. London: British Library. (British Library Research and Development Report, no. 5587), 1980.
  23. [491]
    S. Weiss, N. Indurkhya, and T. Zhang. Fundamentals of predictive text mining. Springer, 2015.Google Scholar
  24. [530]
    Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. World Wide Web Conference, pp. 76–85, 2005.Google Scholar
  25. [545]
    J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys (CSUR), 38(2), 6, 2006.CrossRefGoogle Scholar
  26. [547]
  27. [548]
  28. [550]
  29. [551]
  30. [552]
  31. [553]
  32. [554]
  33. [556]

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.IBM T. J. Watson Research CenterYorktown HeightsUSA

Personalised recommendations