Abstract
Text data is often found in highly unstructured environments, and is frequently created by human participants. In many cases, text is embedded within Web documents, which is contaminated with elements such as HyperText Markup Language (HTML) tags, misspellings, ambiguous words, and so on. Furthermore, a single Web page may contain multiple blocks, most of which might be advertisements or other unrelated content.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Small differences are caused by regularization effects. Without regularization, the same results will be obtained in a method like linear regression, no matter how one scales the attributes.
- 2.
When a user queries for “eat”, documents containing “eating” are also useful. The main issue here is that a set of query keywords is an extremely small document, and stemming helps in reducing the effect of sparsity.
Bibliography
R. Baeza-Yates, and B. Ribeiro-Neto. Modern information retrieval. ACM press, 2011.
S. Chakrabarti. Mining the Web: Discovering knowledge from hypertext data. Morgan Kaufmann, 2003.
W. B. Croft and D. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35(4), pp. 285–295, 1979.
M. Hearst. TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1), pp. 33–64, 1997.
A. Huang. Similarity measures for text document clustering. Sixth New Zealand Computer Science Research Student Conference, pp. 49–56, 2008.
L. V. Lita, A. Ittycheriah, S. Roukos, and N. Kambhatla. Truecasing. ACL Conference, pp. 152–159, 2003.
B. Liu. Web data mining: exploring hyperlinks, contents, and usage data. Springer, New York, 2007.
C. Mackenzie. Coded character sets: History and development. Addison-Wesley Longman Publishing Co., Inc., 1980.
C. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, Cambridge, 2008.
C. Manning and H. Schütze. Foundations of statistical natural language processing. MIT Press, 1999.
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996.
D. Metzler, S. Dumais, and C. Meek. Similarity measures for short segments of text. European Conference on Information Retrieval, pp. 16-27, 2007.
S. Robertson and K. Spärck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), pp. 129–146, 1976.
M. Sahami and T. D. Heilman. A Web-based kernel function for measuring the similarity of short text snippets. WWW Conference, pp. 377–386, 2006.
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval, Technical Report 87–881, Cornell University, 1987. https://ecommons.cornell.edu/bitstream/handle/1813/6721/87-881.pdf?sequence=1
G. Salton and M. J. McGill. Introduction to modern information retrieval. McGraw Hill, 1986.
S. Sarawagi. Information extraction. Foundations and Trends in Satabases, 1(3), pp. 261–377, 2008.
H. Schütze and C. Silverstein. Projections for Efficient Document Clustering. ACM SIGIR Conference, pp. 74–81, 1997.
A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. ACM SIGIR Conference, pp. 21–29, 1996.
K. Spärck Jones. A statistical interpretation of term specificity and its application in information retrieval. Journal of Documentation, 28(1), pp. 11–21, 1972.
A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. Workshop on Artificial Intelligence for Web Search, 2000. http://www.aaai.org/Papers/Workshops/2000/WS-00-01/WS00-01-011.pdf
C.J. van Rijsbergen, S.E. Robertson, and M.F. Porter. New models in probabilistic information retrieval. London: British Library. (British Library Research and Development Report, no. 5587), 1980. https://tartarus.org/martin/PorterStemmer/
S. Weiss, N. Indurkhya, and T. Zhang. Fundamentals of predictive text mining. Springer, 2015.
Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. World Wide Web Conference, pp. 76–85, 2005.
J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys (CSUR), 38(2), 6, 2006.
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
https://www.ibm.com/developerworks/community/blogs/nlp/entry/tokenization?lang=en
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Aggarwal, C.C. (2018). Text Preparation and Similarity Computation. In: Machine Learning for Text. Springer, Cham. https://doi.org/10.1007/978-3-319-73531-3_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-73531-3_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73530-6
Online ISBN: 978-3-319-73531-3
eBook Packages: Computer ScienceComputer Science (R0)