Skip to main content

Text Preparation and Similarity Computation

  • Chapter
  • First Online:
Machine Learning for Text
  • 10k Accesses

Abstract

Text data is often found in highly unstructured environments, and is frequently created by human participants. In many cases, text is embedded within Web documents, which is contaminated with elements such as HyperText Markup Language (HTML) tags, misspellings, ambiguous words, and so on. Furthermore, a single Web page may contain multiple blocks, most of which might be advertisements or other unrelated content.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 84.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Small differences are caused by regularization effects. Without regularization, the same results will be obtained in a method like linear regression, no matter how one scales the attributes.

  2. 2.

    When a user queries for “eat”, documents containing “eating” are also useful. The main issue here is that a set of query keywords is an extremely small document, and stemming helps in reducing the effect of sparsity.

Bibliography

  1. R. Baeza-Yates, and B. Ribeiro-Neto. Modern information retrieval. ACM press, 2011.

    Google Scholar 

  2. S. Chakrabarti. Mining the Web: Discovering knowledge from hypertext data. Morgan Kaufmann, 2003.

    Google Scholar 

  3. W. B. Croft and D. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35(4), pp. 285–295, 1979.

    Article  Google Scholar 

  4. M. Hearst. TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1), pp. 33–64, 1997.

    Google Scholar 

  5. A. Huang. Similarity measures for text document clustering. Sixth New Zealand Computer Science Research Student Conference, pp. 49–56, 2008.

    Google Scholar 

  6. L. V. Lita, A. Ittycheriah, S. Roukos, and N. Kambhatla. Truecasing. ACL Conference, pp. 152–159, 2003.

    Google Scholar 

  7. B. Liu. Web data mining: exploring hyperlinks, contents, and usage data. Springer, New York, 2007.

    Google Scholar 

  8. C. Mackenzie. Coded character sets: History and development. Addison-Wesley Longman Publishing Co., Inc., 1980.

    Google Scholar 

  9. C. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, Cambridge, 2008.

    Google Scholar 

  10. C. Manning and H. Schütze. Foundations of statistical natural language processing. MIT Press, 1999.

    Google Scholar 

  11. A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996.

  12. D. Metzler, S. Dumais, and C. Meek. Similarity measures for short segments of text. European Conference on Information Retrieval, pp. 16-27, 2007.

    Google Scholar 

  13. S. Robertson and K. Spärck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), pp. 129–146, 1976.

    Article  Google Scholar 

  14. M. Sahami and T. D. Heilman. A Web-based kernel function for measuring the similarity of short text snippets. WWW Conference, pp. 377–386, 2006.

    Google Scholar 

  15. G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval, Technical Report 87–881, Cornell University, 1987. https://ecommons.cornell.edu/bitstream/handle/1813/6721/87-881.pdf?sequence=1

  16. G. Salton and M. J. McGill. Introduction to modern information retrieval. McGraw Hill, 1986.

    Google Scholar 

  17. S. Sarawagi. Information extraction. Foundations and Trends in Satabases, 1(3), pp. 261–377, 2008.

    Article  Google Scholar 

  18. H. Schütze and C. Silverstein. Projections for Efficient Document Clustering. ACM SIGIR Conference, pp. 74–81, 1997.

    Article  Google Scholar 

  19. A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. ACM SIGIR Conference, pp. 21–29, 1996.

    Google Scholar 

  20. K. Spärck Jones. A statistical interpretation of term specificity and its application in information retrieval. Journal of Documentation, 28(1), pp. 11–21, 1972.

    Article  Google Scholar 

  21. A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. Workshop on Artificial Intelligence for Web Search, 2000. http://www.aaai.org/Papers/Workshops/2000/WS-00-01/WS00-01-011.pdf

  22. C.J. van Rijsbergen, S.E. Robertson, and M.F. Porter. New models in probabilistic information retrieval. London: British Library. (British Library Research and Development Report, no. 5587), 1980. https://tartarus.org/martin/PorterStemmer/

  23. S. Weiss, N. Indurkhya, and T. Zhang. Fundamentals of predictive text mining. Springer, 2015.

    Google Scholar 

  24. Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. World Wide Web Conference, pp. 76–85, 2005.

    Google Scholar 

  25. J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys (CSUR), 38(2), 6, 2006.

    Article  Google Scholar 

  26. http://snowballstem.org/

  27. http://opennlp.apache.org/index.html

  28. http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

  29. https://cran.r-project.org/web/packages/tm/

  30. https://www.ibm.com/developerworks/community/blogs/nlp/entry/tokenization?lang=en

  31. http://www.cs.waikato.ac.nz/ml/weka/

  32. http://nlp.stanford.edu/software/

  33. http://www.nltk.org/

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Aggarwal, C.C. (2018). Text Preparation and Similarity Computation. In: Machine Learning for Text. Springer, Cham. https://doi.org/10.1007/978-3-319-73531-3_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73531-3_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73530-6

  • Online ISBN: 978-3-319-73531-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics