Skip to main content

On the Construction of a Large Scale Chinese Web Test Collection

  • Conference paper
Information Retrieval Technology (AIRS 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4993))

Included in the following conference series:

  • 1392 Accesses

Abstract

The lack of a large scale Chinese test collection is an obstacle to the Chinese information retrieval development. In order to address this issue, we built such a collection composed of millions of Chinese web pages, known as the Chinese Web Test collection with 100 gigabyte (CWT100g) in data volume, which is the largest Chinese web test collection as of this writing, and has been used by several dozen research groups besides being adopted in the evaluation of the SEWM-2004 Chinese Web Track[1] and the HTRDPE-2004[2]. We present the total solution for constructing a large scale test collection like the CWT100g. Further, we found that: 1) the distribution of the number of pages within sites obeys a Zipf-like law instead of a power law proposed by Adamic and Huberman [3, 4]; 2) and an appropriate filtering method on host alias will economize resources for about 25% while crawling pages. The Zipf-like law and the method of filtering host alias proposed in the paper will facilitate both to model the Web and to perfect a search engine. Finally, we report on the results of the SEWM-2004 Chinese Web Track.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. CWT100g, Chinese Web test collection (2004)

    Google Scholar 

  2. HTRDPE. HTRDP Chinese Information Processing and Intelligent Human-Machine Interface Technology Evaluation (2004)

    Google Scholar 

  3. Huberman, B.A., Adamic, L.A.: Growth dynamics of the World-Wide Web. Nature 401, 131 (1999)

    Google Scholar 

  4. Adamic, L.A., Huberman, B.A.: Zipf’s law and the Internet. Glottometrics 3, 143–150 (2002)

    Google Scholar 

  5. CSIRO, TREC Web Tracks Homepage (2004)

    Google Scholar 

  6. NTCIR, NTCIR (NII-NACSIS Test Collection for IR Systems) Project (2004)

    Google Scholar 

  7. Cleverdon, C.W.: The significance of the Cranfield tests on index languages. In: Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, Chicago, Illinois, United States (1991)

    Google Scholar 

  8. Kennedy, G.: An Introduction to Corpus Linguistics, vol. 280. Longman, London (1998)

    Google Scholar 

  9. Huang, C., Li, J.: Linguistic corpse: Business publisher (2002)

    Google Scholar 

  10. Jones, K.S., Rijsbergen, C.v.: Report on the need for and provision of an ’deal’ information retrieval test collection. British Library Research and Development Report 5266, Computer Laboratory, University of Cambridge (1975)

    Google Scholar 

  11. Tianwang, Tianwang Search Engine (2004)

    Google Scholar 

  12. Craswell, N., et al.: Overview of the TREC-2003 Web Track. In: Proceedings of TREC 2003, Gaithersburg, Maryland USA (2003)

    Google Scholar 

  13. Hawking, D., Craswell, N.: Very Large Scale Retrieval and Web Search (Preprint version) (2004)

    Google Scholar 

  14. Lawrence, S., Giles, C.L.: Accessibility of information on the web. Nature 400(6740), 107–109 (1999)

    Article  Google Scholar 

  15. Broder, A., et al.: Graph structure in the web: experiments and models. In: Proceedings of the 9th World-Wide Web Conference, Amsterdam (2000)

    Google Scholar 

  16. Yan, H.F., Li, X.: On the Structure of Chinese Web 2002. Journal of Computer Research and Development, 2002 39(8), 958–967 (2002)

    Google Scholar 

  17. Meng, T., Yan, H.F., Li, X.: An Evaluation Model on Information Coverage of Search Engines. ACTA Electronica Sinaca 31(8), 1168–1172 (2003)

    Google Scholar 

  18. Adamic, L.A.: Zipf, power-laws, and pareto - a ranking tutorial, Tech. Rep., Xerox Palo Alto Research Center (2000)

    Google Scholar 

  19. Breslau, L., et al.: Web Caching and Zipf-like Distributions: Evidence and Implications. Proc. IEEE Infocom 99, 126–134 (1999)

    Google Scholar 

  20. Yan, H.F., et al.: A New Data Storage and Service Model of China Web InfoMall. In: the 4th International Web Archiving Workshop (IWAW 2004) of 8th European Conference on Research and Advanced Technologies for Digital Libraries (ECDL 2008), Bath, UK (2004)

    Google Scholar 

  21. TSE, Homepage of Tiny Search Engine (2004)

    Google Scholar 

  22. Bailey, P., Craswell, N., Hawking, D.: Engineering a multi-purpose test collection for Web retrieval experiments. Information Processing & Management 39(6), 853–871 (2003)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Hang Li Ting Liu Wei-Ying Ma Tetsuya Sakai Kam-Fai Wong Guodong Zhou

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yan, H., Chen, C., Peng, B., Li, X. (2008). On the Construction of a Large Scale Chinese Web Test Collection. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-68636-1_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68633-0

  • Online ISBN: 978-3-540-68636-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics