Skip to main content

Machine Learning Approach for Homepage Finding Task

  • Conference paper
  • First Online:
String Processing and Information Retrieval (SPIRE 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2476))

Included in the following conference series:

Abstract

This paper describes new machine learning approaches to predict the correct homepage in response to a user’s homepage finding query. This involves two phases. In the first phase, a decision tree is generated to predict whether a URL is a homepage URL or not. The decision tree then is used to filter out non-homepages from the web pages returned by a standard vector space information retrieval system. In the second phase, a logistic regression analysis is used to combine multiple sources of evidence based on the homepages remaining from the first step to predict which homepage is most relevant to a user’s query. 100 queries are used to train the logistic regression model and another 145 testing queries are used to evaluate the model derived. Our results show that about 84% of the testing queries had the correct homepage returned within the top 10 pages. This shows that our machine learning approaches are effective since without any machine learning approaches, only 59% of the testing queries had their correct answers returned within the top 10 hits.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. G. Attardi, A. Gull, and F. Sebastiani. Automatic Web page categorization by link and context analysis. In Proceedings of the First European Symposium on Telematics, Hypermedia and Artificial Intelligence, pp. 105–119. Varese, Italy, 1999.

    Google Scholar 

  2. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Proceedings of the 7 th International WWW Conference, pp. 107–117. 1998. http://www7.scu.edu.au/programme/fullpapers/l921/com1921.htm

  3. Chen. “A comparison of regression, neural net, and pattern recognition approaches to IR,” in Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management (CIKM’ 98), pp. 140–147. New York: ACM, 1998.

    Chapter  Google Scholar 

  4. N. Craswell, D. Hawking and S. Robertson. Effective Site Finding using Link Anchor Information. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 250–257. 2001.

    Google Scholar 

  5. N. Fuhr. Integration of Probabilistic fact and text retrieval. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 211–222, 1992.

    Google Scholar 

  6. S. Fujita. More Reflections on “Aboutness”. TREC-2001 Evaluation Experiments at Justsystem. In Proceedings of the Tenth Text Retrieval Conference (TREC2001). NIST Special Publication 500-250. 2002.

    Google Scholar 

  7. F.C. Gey, A. Chen, J. He, and J. Meggs. Logistic regression at TREC4: Probabilistic retrieval from full text document collections. In Proceedings of the Fourth Text Retrieval Conference (TREC 4). NIST Special Publication 500-236. 1996.

    Google Scholar 

  8. G Kazai, M Lalmas and T Roelleke. A Model for the Representation and Focused Retrieval of Structured Documents based on Fuzzy Aggregation, In Proceedings of the 8 th International Symposium on String Processing and Information Retrieval, pp. 123–135, Laguna de San Rafael, Chile, 2001.

    Google Scholar 

  9. J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proceedings of the 9 th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 668–677. 1998. http://www.cs.cornell.edu/home/kleinber/auth.ps

  10. J.H. Lee. Combining multiple evidence from different properties of weighting schemes. In Proceedings of the 18 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 180–188. New York: ACM, 1995.

    Chapter  Google Scholar 

  11. J.H. Lee. Analyses of Multiple Evidence Combination. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 267–276. New York: ACM, 1997.

    Chapter  Google Scholar 

  12. J. Savoy and Y. Rasolofo. Report on the TREC-10 Experiment: Distributed Collections and Entrypage Searching. In Proceedings of the Tenth Text Retrieval Conference (TREC 2001). NIST Special Publication 500-250. 2002.

    Google Scholar 

  13. M. F. Porter. An algorithm for suffix stripping. Program 14, pp. 130–137. 1980.

    Google Scholar 

  14. T. Sakai and K. Sparck-Jones. Generic Summaries for Indexing in Information Retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 190–198. 2001.

    Google Scholar 

  15. G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. New York: McGrawHill, 1983.

    MATH  Google Scholar 

  16. C.C. Vogt and G.W. Cottrell. “Fusion via linear combination for the routing problem”. In Proceedings of the Sixth Text Retrieval Conference (TREC 2001). NIST Special Publication 500-250. 1998.

    Google Scholar 

  17. C.C. Vogt and G.W. Cottrell. “Predicting the performance of linearly combined IR systems”. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 190–196. New York: ACM, 1998.

    Chapter  Google Scholar 

  18. E. Voorhees and D.K. Harman. Overview of the Ninth Text Retrieval Conference (TREC-9). In Proceedings of the Ninth Text Retrieval Conference (TREC-9), pp. 1–28. NIST Special Publication 500-249. 2001.

    Google Scholar 

  19. T. Westerveld and D. Hiemstra. More Retrieving Web Pages Using Content, Links, URLs and Anchors. In Proceedings of the Tenth Text Retrieval Conference (TREC2001). NIST Special Publication 500-250. 2002.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Xi, W., Fox, E.A., Tan, R.P., Shu, J. (2002). Machine Learning Approach for Homepage Finding Task. In: Laender, A.H.F., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2002. Lecture Notes in Computer Science, vol 2476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45735-6_14

Download citation

  • DOI: https://doi.org/10.1007/3-540-45735-6_14

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44158-8

  • Online ISBN: 978-3-540-45735-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics