Text Preparation and Similarity Computation

Aggarwal, Charu C.

doi:10.1007/978-3-319-73531-3_2

Charu C. Aggarwal²

10k Accesses

Abstract

Text data is often found in highly unstructured environments, and is frequently created by human participants. In many cases, text is embedded within Web documents, which is contaminated with elements such as HyperText Markup Language (HTML) tags, misspellings, ambiguous words, and so on. Furthermore, a single Web page may contain multiple blocks, most of which might be advertisements or other unrelated content.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Small differences are caused by regularization effects. Without regularization, the same results will be obtained in a method like linear regression, no matter how one scales the attributes.
2.
When a user queries for “eat”, documents containing “eating” are also useful. The main issue here is that a set of query keywords is an extremely small document, and stemming helps in reducing the effect of sparsity.

Bibliography

R. Baeza-Yates, and B. Ribeiro-Neto. Modern information retrieval. ACM press, 2011.
Google Scholar
S. Chakrabarti. Mining the Web: Discovering knowledge from hypertext data. Morgan Kaufmann, 2003.
Google Scholar
W. B. Croft and D. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35(4), pp. 285–295, 1979.
Article Google Scholar
M. Hearst. TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1), pp. 33–64, 1997.
Google Scholar
A. Huang. Similarity measures for text document clustering. Sixth New Zealand Computer Science Research Student Conference, pp. 49–56, 2008.
Google Scholar
L. V. Lita, A. Ittycheriah, S. Roukos, and N. Kambhatla. Truecasing. ACL Conference, pp. 152–159, 2003.
Google Scholar
B. Liu. Web data mining: exploring hyperlinks, contents, and usage data. Springer, New York, 2007.
Google Scholar
C. Mackenzie. Coded character sets: History and development. Addison-Wesley Longman Publishing Co., Inc., 1980.
Google Scholar
C. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, Cambridge, 2008.
Google Scholar
C. Manning and H. Schütze. Foundations of statistical natural language processing. MIT Press, 1999.
Google Scholar
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996.
D. Metzler, S. Dumais, and C. Meek. Similarity measures for short segments of text. European Conference on Information Retrieval, pp. 16-27, 2007.
Google Scholar
S. Robertson and K. Spärck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), pp. 129–146, 1976.
Article Google Scholar
M. Sahami and T. D. Heilman. A Web-based kernel function for measuring the similarity of short text snippets. WWW Conference, pp. 377–386, 2006.
Google Scholar
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval, Technical Report 87–881, Cornell University, 1987. https://ecommons.cornell.edu/bitstream/handle/1813/6721/87-881.pdf?sequence=1
G. Salton and M. J. McGill. Introduction to modern information retrieval. McGraw Hill, 1986.
Google Scholar
S. Sarawagi. Information extraction. Foundations and Trends in Satabases, 1(3), pp. 261–377, 2008.
Article Google Scholar
H. Schütze and C. Silverstein. Projections for Efficient Document Clustering. ACM SIGIR Conference, pp. 74–81, 1997.
Article Google Scholar
A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. ACM SIGIR Conference, pp. 21–29, 1996.
Google Scholar
K. Spärck Jones. A statistical interpretation of term specificity and its application in information retrieval. Journal of Documentation, 28(1), pp. 11–21, 1972.
Article Google Scholar
A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. Workshop on Artificial Intelligence for Web Search, 2000. http://www.aaai.org/Papers/Workshops/2000/WS-00-01/WS00-01-011.pdf
C.J. van Rijsbergen, S.E. Robertson, and M.F. Porter. New models in probabilistic information retrieval. London: British Library. (British Library Research and Development Report, no. 5587), 1980. https://tartarus.org/martin/PorterStemmer/
S. Weiss, N. Indurkhya, and T. Zhang. Fundamentals of predictive text mining. Springer, 2015.
Google Scholar
Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. World Wide Web Conference, pp. 76–85, 2005.
Google Scholar
J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys (CSUR), 38(2), 6, 2006.
Article Google Scholar
http://snowballstem.org/
http://opennlp.apache.org/index.html
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
https://cran.r-project.org/web/packages/tm/
https://www.ibm.com/developerworks/community/blogs/nlp/entry/tokenization?lang=en
http://www.cs.waikato.ac.nz/ml/weka/
http://nlp.stanford.edu/software/
http://www.nltk.org/

Download references

Author information

Authors and Affiliations

IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
Charu C. Aggarwal

Authors

Charu C. Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Aggarwal, C.C. (2018). Text Preparation and Similarity Computation. In: Machine Learning for Text. Springer, Cham. https://doi.org/10.1007/978-3-319-73531-3_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-73531-3_2
Published: 20 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73530-6
Online ISBN: 978-3-319-73531-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics