Abstract
The lack of a large scale Chinese test collection is an obstacle to the Chinese information retrieval development. In order to address this issue, we built such a collection composed of millions of Chinese web pages, known as the Chinese Web Test collection with 100 gigabyte (CWT100g) in data volume, which is the largest Chinese web test collection as of this writing, and has been used by several dozen research groups besides being adopted in the evaluation of the SEWM-2004 Chinese Web Track[1] and the HTRDPE-2004[2]. We present the total solution for constructing a large scale test collection like the CWT100g. Further, we found that: 1) the distribution of the number of pages within sites obeys a Zipf-like law instead of a power law proposed by Adamic and Huberman [3, 4]; 2) and an appropriate filtering method on host alias will economize resources for about 25% while crawling pages. The Zipf-like law and the method of filtering host alias proposed in the paper will facilitate both to model the Web and to perfect a search engine. Finally, we report on the results of the SEWM-2004 Chinese Web Track.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
CWT100g, Chinese Web test collection (2004)
HTRDPE. HTRDP Chinese Information Processing and Intelligent Human-Machine Interface Technology Evaluation (2004)
Huberman, B.A., Adamic, L.A.: Growth dynamics of the World-Wide Web. Nature 401, 131 (1999)
Adamic, L.A., Huberman, B.A.: Zipf’s law and the Internet. Glottometrics 3, 143–150 (2002)
CSIRO, TREC Web Tracks Homepage (2004)
NTCIR, NTCIR (NII-NACSIS Test Collection for IR Systems) Project (2004)
Cleverdon, C.W.: The significance of the Cranfield tests on index languages. In: Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, Chicago, Illinois, United States (1991)
Kennedy, G.: An Introduction to Corpus Linguistics, vol. 280. Longman, London (1998)
Huang, C., Li, J.: Linguistic corpse: Business publisher (2002)
Jones, K.S., Rijsbergen, C.v.: Report on the need for and provision of an ’deal’ information retrieval test collection. British Library Research and Development Report 5266, Computer Laboratory, University of Cambridge (1975)
Tianwang, Tianwang Search Engine (2004)
Craswell, N., et al.: Overview of the TREC-2003 Web Track. In: Proceedings of TREC 2003, Gaithersburg, Maryland USA (2003)
Hawking, D., Craswell, N.: Very Large Scale Retrieval and Web Search (Preprint version) (2004)
Lawrence, S., Giles, C.L.: Accessibility of information on the web. Nature 400(6740), 107–109 (1999)
Broder, A., et al.: Graph structure in the web: experiments and models. In: Proceedings of the 9th World-Wide Web Conference, Amsterdam (2000)
Yan, H.F., Li, X.: On the Structure of Chinese Web 2002. Journal of Computer Research and Development, 2002 39(8), 958–967 (2002)
Meng, T., Yan, H.F., Li, X.: An Evaluation Model on Information Coverage of Search Engines. ACTA Electronica Sinaca 31(8), 1168–1172 (2003)
Adamic, L.A.: Zipf, power-laws, and pareto - a ranking tutorial, Tech. Rep., Xerox Palo Alto Research Center (2000)
Breslau, L., et al.: Web Caching and Zipf-like Distributions: Evidence and Implications. Proc. IEEE Infocom 99, 126–134 (1999)
Yan, H.F., et al.: A New Data Storage and Service Model of China Web InfoMall. In: the 4th International Web Archiving Workshop (IWAW 2004) of 8th European Conference on Research and Advanced Technologies for Digital Libraries (ECDL 2008), Bath, UK (2004)
TSE, Homepage of Tiny Search Engine (2004)
Bailey, P., Craswell, N., Hawking, D.: Engineering a multi-purpose test collection for Web retrieval experiments. Information Processing & Management 39(6), 853–871 (2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yan, H., Chen, C., Peng, B., Li, X. (2008). On the Construction of a Large Scale Chinese Web Test Collection. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_12
Download citation
DOI: https://doi.org/10.1007/978-3-540-68636-1_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68633-0
Online ISBN: 978-3-540-68636-1
eBook Packages: Computer ScienceComputer Science (R0)