Abstract
There are currently many news sites providing online news articles, and many Web news portals arise to provide clustered news categories for users to browse more related news reports and realize the news events in depth. However, to the best of our knowledge, most Web news portals only provide monolingual news clustering services. In this paper, we study the cross-lingual Web news taxonomy integration problem in which news articles of the same news event reported in different languages are to be integrated into one category. Our study is based on cross-lingual classification research results and the cross-training concept to construct SVM-based classifiers for cross-lingual Web news taxonomy integration. We have conducted several experiments with the news articles from Google News as the experimental data sets. From the experimental results, we find that the proposed cross-training classifiers outperforms the traditional SVM classifiers in an all-round manner. We believe that the proposed framework can be applied to different bilingual environments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Altavista News (2006), http://www.altavista.com/news/default
Google News (2006), http://news.google.com/
BBC News: First impressions count for web (2006), English version available at, http://bbc.co.uk/2/hi/technology/4616700.stm , Chinese version available at, http://news.bbc.co.uk/chinese/trad/hi/newsid4610000/newsid4618500/4618552.stm
Agrawal, R., Srikant, R.: On Integrating Catalogs. In: Proceedings of the 10th International Conference on World Wide Web, pp. 603–612 (2001)
Sarawagi, S., Chakrabarti, S., Godbole, S.: Cross-training: Learning Probabilistic Mappings between Topics. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 177–186 (2003)
Zhang, D., Lee, W.S.: Web Taxonomy Integration using Support Vector Machines. In: Proceedings of the 13th international conference on World Wide Web, pp. 472–481 (2004)
Zhang, D., Lee, W.S.: Web Taxonomy Integration Through Co-Bootstrapping. In: Proceedings of the 27th annual international ACM SIGIR Conference on Research and development in information retrieval, pp. 410–417 (2004)
Wu, C.W., Tsai, T.H., Hsu, W.L.: Learning to Integrate Web Taxonomies with Fine-Grained Relations: A Case Study Using Maximum Entropy Model. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 190–205. Springer, Heidelberg (2005)
Chen, I.X., Ho, J.C., Yang, C.Z.: An Iterative Approach for Web Catalog Integration with Support Vector Machines. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 703–708. Springer, Heidelberg (2005)
Rogati, M., Yang, Y.: Resrouce Selection for Domain-Specific ross-Lingual IR. In: Proceedings of the 27th annual international ACM SIGIR Conference on Research and development in information retrieval, pp. 154–161 (2004)
Chen, H.H., Kuo, J.J., Su, T.C.: Clustering and Visualization in a Multi-lingual Multidocument Summarization System. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 266–280. Springer, Heidelberg (2003)
Yahoo! News (2006), http://news.yahoo.com/
Jenkins, C., Inman, D.: Adaptive Automatic Classification on the Web. In: Proc. of the 11th International Workshop on Database and Expert Systems Applications, Greenwich, London, UK, pp. 504–511 (2000)
Chen, I.X., Shih, C.H., Yang, C.Z.: Web Catalog Integration using Support Vector Machines. In: Proceedings of the 1st Workshop on Intelligent Web Technology (IWT 2004), Taipei, Taiwan, pp. 7–13 (2004)
Nie, J.Y., Ren, F.: Chinese Information Retrieval: Using Characters or Words. Information Processing and Management 35(4), 443–162 (1999)
Nie, J.Y., Gao, J., Zhang, J., Zhou, M.: On the Use of Words and N-grams for Chinese Information Retrieval. In: Proceedings of the 5th International Workshop on Information Retrieval with Asian Languages, pp. 141–148 (2000)
Foo, S., Li, H.: Chinese Word Segmentation and Its Effect on Information Retrieval. Information Processing and Management 40(1), 161–190 (2004)
Tseng, Y.H.: Automatic Thesaurus Generation for Chinese Documents. Journal of the American Society for Information Science and Technology 53(13), 1130–1138 (2002)
The Association for Computational Linguistics and Chinese Language Processing (2006), http://www.aclclp.org.tw/use_ssc.php
Thorsten Joachims: SVMlight (2006), http://svmlight.joachims.org/
Linguistic Data Consortium (2006), http://projects.ldc.upenn.edu/Chinese/LDCch.htm
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yang, CZ., Chen, CM., Chen, IX. (2006). A Cross-Lingual Framework for Web News Taxonomy Integration. In: Ng, H.T., Leong, MK., Kan, MY., Ji, D. (eds) Information Retrieval Technology. AIRS 2006. Lecture Notes in Computer Science, vol 4182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880592_21
Download citation
DOI: https://doi.org/10.1007/11880592_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45780-0
Online ISBN: 978-3-540-46237-8
eBook Packages: Computer ScienceComputer Science (R0)