Abstract
This paper describes new machine learning approaches to predict the correct homepage in response to a user’s homepage finding query. This involves two phases. In the first phase, a decision tree is generated to predict whether a URL is a homepage URL or not. The decision tree then is used to filter out non-homepages from the web pages returned by a standard vector space information retrieval system. In the second phase, a logistic regression analysis is used to combine multiple sources of evidence based on the homepages remaining from the first step to predict which homepage is most relevant to a user’s query. 100 queries are used to train the logistic regression model and another 145 testing queries are used to evaluate the model derived. Our results show that about 84% of the testing queries had the correct homepage returned within the top 10 pages. This shows that our machine learning approaches are effective since without any machine learning approaches, only 59% of the testing queries had their correct answers returned within the top 10 hits.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
G. Attardi, A. Gull, and F. Sebastiani. Automatic Web page categorization by link and context analysis. In Proceedings of the First European Symposium on Telematics, Hypermedia and Artificial Intelligence, pp. 105–119. Varese, Italy, 1999.
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Proceedings of the 7 th International WWW Conference, pp. 107–117. 1998. http://www7.scu.edu.au/programme/fullpapers/l921/com1921.htm
Chen. “A comparison of regression, neural net, and pattern recognition approaches to IR,” in Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management (CIKM’ 98), pp. 140–147. New York: ACM, 1998.
N. Craswell, D. Hawking and S. Robertson. Effective Site Finding using Link Anchor Information. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 250–257. 2001.
N. Fuhr. Integration of Probabilistic fact and text retrieval. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 211–222, 1992.
S. Fujita. More Reflections on “Aboutness”. TREC-2001 Evaluation Experiments at Justsystem. In Proceedings of the Tenth Text Retrieval Conference (TREC2001). NIST Special Publication 500-250. 2002.
F.C. Gey, A. Chen, J. He, and J. Meggs. Logistic regression at TREC4: Probabilistic retrieval from full text document collections. In Proceedings of the Fourth Text Retrieval Conference (TREC 4). NIST Special Publication 500-236. 1996.
G Kazai, M Lalmas and T Roelleke. A Model for the Representation and Focused Retrieval of Structured Documents based on Fuzzy Aggregation, In Proceedings of the 8 th International Symposium on String Processing and Information Retrieval, pp. 123–135, Laguna de San Rafael, Chile, 2001.
J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proceedings of the 9 th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 668–677. 1998. http://www.cs.cornell.edu/home/kleinber/auth.ps
J.H. Lee. Combining multiple evidence from different properties of weighting schemes. In Proceedings of the 18 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 180–188. New York: ACM, 1995.
J.H. Lee. Analyses of Multiple Evidence Combination. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 267–276. New York: ACM, 1997.
J. Savoy and Y. Rasolofo. Report on the TREC-10 Experiment: Distributed Collections and Entrypage Searching. In Proceedings of the Tenth Text Retrieval Conference (TREC 2001). NIST Special Publication 500-250. 2002.
M. F. Porter. An algorithm for suffix stripping. Program 14, pp. 130–137. 1980.
T. Sakai and K. Sparck-Jones. Generic Summaries for Indexing in Information Retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 190–198. 2001.
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. New York: McGrawHill, 1983.
C.C. Vogt and G.W. Cottrell. “Fusion via linear combination for the routing problem”. In Proceedings of the Sixth Text Retrieval Conference (TREC 2001). NIST Special Publication 500-250. 1998.
C.C. Vogt and G.W. Cottrell. “Predicting the performance of linearly combined IR systems”. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 190–196. New York: ACM, 1998.
E. Voorhees and D.K. Harman. Overview of the Ninth Text Retrieval Conference (TREC-9). In Proceedings of the Ninth Text Retrieval Conference (TREC-9), pp. 1–28. NIST Special Publication 500-249. 2001.
T. Westerveld and D. Hiemstra. More Retrieving Web Pages Using Content, Links, URLs and Anchors. In Proceedings of the Tenth Text Retrieval Conference (TREC2001). NIST Special Publication 500-250. 2002.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xi, W., Fox, E.A., Tan, R.P., Shu, J. (2002). Machine Learning Approach for Homepage Finding Task. In: Laender, A.H.F., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2002. Lecture Notes in Computer Science, vol 2476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45735-6_14
Download citation
DOI: https://doi.org/10.1007/3-540-45735-6_14
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44158-8
Online ISBN: 978-3-540-45735-0
eBook Packages: Springer Book Archive