Skip to main content

Chinese Categorization and Novelty Mining

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6635))

Included in the following conference series:

  • 2431 Accesses

Abstract

The categorization and novelty mining of chronologically ordered documents is an important data mining problem. This paper focuses on the entire process of Chinese novelty mining, from preprocessing and categorization to the actual detection of novel information, which has rarely been studied. First, preprocessing techniques for detecting novel Chinese text are discussed and compared. Next, we investigate the categorization and novelty mining performance between English and Chinese sentences and also discuss the novelty mining performance based on the retrieval results. Moreover, we propose new novelty mining evaluation measures, Novelty-Precision, Novelty-Recall, Novelty-F Score, and Sensitivity, which measures the sensitivity of the novelty mining system to the incorrectly classified sentences. The results indicate that Chinese novelty mining at the sentence level is similar to English if the sentences are perfectly categorized. Using our new evaluation measures of Novelty-Precision, Novelty-Recall, Novelty-F Score, and Sensitivity, we can more fairly assess how the performance of novelty mining is influenced by the retrieval results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Allan, J., Wade, C., Bolivar, A.: Retrieval and novelty detection at the sentence level. In: SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 314–321 (2003)

    Google Scholar 

  2. Diaz, F., Metzler, D.: Improving the estimation of relevance models using large external corpora. In: SIGIR 2006, Seattle, USA, pp. 154–161 (2006)

    Google Scholar 

  3. Gao, J., Li, M., Wu, A., Huang, C.-N.: Chinese word segmentation and named entity recognition: A pragmatic approach. Computational Linguistics 31(4), 531–574 (2005)

    Article  MATH  Google Scholar 

  4. Kwee, A.T., Tsai, F.S., Tang, W.: Sentence-level novelty detection in english and malay. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 40–51. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  5. Li, Y., Taylor, J.S.: The SVM with uneven margins and Chinese document categorisation. In: Proceedings of the 17th Pacific Asia Conference on Language, Information and Computation, pp. 216–227 (2003)

    Google Scholar 

  6. Liang, H., Tsai, F.S., Kwee, A.T.: Detecting novel business blogs. In: ICICS 2009 - Conference Proceedings of the 7th International Conference on Information, Communications and Signal Processing (2009)

    Google Scholar 

  7. Ng, K.W., Tsai, F.S., Chen, L., Goh, K.C.: Novelty detection for text documents using named entity recognition. In: 2007 6th International Conference on Information, Communications and Signal Processing, ICICS (2007)

    Google Scholar 

  8. Ong, C.L., Kwee, A., Tsai, F.: Database optimization for novelty detection. In: ICICS 2009 - Conference Proceedings of the 7th International Conference on Information, Communications and Signal Processing (2009)

    Google Scholar 

  9. PKU and CAS, Chinese POS tagging criterion (1999), http://icl.pku.edu.cn/icl_groups/corpus/addition.htm

  10. Rocchio, J.: Relevance feedback in information retrieval. In: The SMART Retrieval System: Experiments in Automatic Document Processing, pp. 313–323 (1971)

    Google Scholar 

  11. Soboroff, I.: Overview of the TREC 2004 Novelty Track. In: Proceedings of TREC 2004 - the 13th Text Retrieval Conference, pp. 1–16 (2004)

    Google Scholar 

  12. Tan, R., Tsai, F.S.: Authorship identification for online text. In: International Conference on Cyberworlds, pp. 155–162 (2010)

    Google Scholar 

  13. Tang, W., Tsai, F.S., Chen, L.: Blended metrics for novel sentence mining. Expert Syst. Appl. 37(7), 5172–5177 (2010)

    Article  Google Scholar 

  14. Tsai, F.S.: Review of techniques for intelligent novelty mining. Information Technology Journal 9(6), 1255–1261 (2010)

    Article  Google Scholar 

  15. Tsai, F.S.: Dimensionality reduction techniques for blog visualization. Expert Systems With Applications 38(3), 2766–2773 (2011)

    Article  Google Scholar 

  16. Tsai, F.S.: A tag-topic model for blog mining. Expert Systems With Applications 38(5), 5330–5335 (2011)

    Article  Google Scholar 

  17. Tsai, F.S., Chan, K.L.: Detecting cyber security threats in weblogs using probabilistic models. In: Yang, C.C., Zeng, D., Chau, M., Chang, K., Yang, Q., Cheng, X., Wang, J., Wang, F.-Y., Chen, H. (eds.) PAISI 2007. LNCS, vol. 4430, pp. 46–57. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  18. Tsai, F.S., Chan, K.L.: Dimensionality reduction techniques for data exploration. In: 2007 6th International Conference on Information, Communications and Signal Processing, ICICS 2007, pp. 1568–1572 (2007)

    Google Scholar 

  19. Tsai, F.S., Chan, K.L.: Redundancy and novelty mining in the business blogosphere. The Learning Organization 17(6), 490–499 (2010)

    Article  Google Scholar 

  20. Tsai, F.S., Chan, K.L.: An intelligent system for sentence retrieval and novelty mining. International Journal of Knowledge Engineering and Data Mining 1(3), 235–253 (2011)

    Article  Google Scholar 

  21. Tsai, F.S., Tang, W., Chan, K.L.: Evaluation of metrics for sentence-level novelty mining. Information Sciences 180(12), 2359–2374 (2010)

    Article  Google Scholar 

  22. Tsai, F.S., Zhang, Y.: D2S: Document-to-sentence framework for novelty detection. Knowledge and Information Systems (2011)

    Google Scholar 

  23. Zhang, H.-P., Liu, Q., Cheng, X.-Q., Zhang, H., Yu, H.-K.: Chinese lexical analysis using hierarchical hidden markov model. In: Second SIGHAN Workshop Affiliated with 41th ACL, pp. 63–70 (2003)

    Google Scholar 

  24. Zhang, Y., Tsai, F.S.: Combining named entities and tags for novel sentence detection. In: Proceedings of the WSDM 2009 ACM Workshop on Exploiting Semantic Annotations in Information Retrieval, ESAIR 2009, pp. 30–34 (2009)

    Google Scholar 

  25. Zheng, W., Zhang, Y., Zou, B., Hong, Y., Liu, T.: Research of Chinese topic tracking based on relevance model (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Tsai, F.S., Zhang, Y. (2011). Chinese Categorization and Novelty Mining. In: Huang, J.Z., Cao, L., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science(), vol 6635. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20847-8_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20847-8_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20846-1

  • Online ISBN: 978-3-642-20847-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics