CNME: A System for Chinese News Meta-Data Extraction

  • Junbo XiaEmail author
  • Fei Xie
  • Mengdi Zhang
  • Yu Su
  • Huanbo Luan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9544)


News mining has gained increasing attention because of the overwhelming news produced everyday. Lots of news portals such as Sina ( and Chinanews ( develop tools to manage the billions of news and provide services to meet all kinds of needs. News analysis applications conduct news mining work and reveal valuable information. What they all need is news meta-data, the fundamental element to support news analysis work. To extract and maintain meta-data of news becomes an important and challenging task. In this paper, we present a system specialized for Chinese news meta-data extraction. It can identify 28 kinds of meta-data and provides not only a pipeline to extract them but also a systematic way for management. It facilitates the organizing and conducting of news mining processes and improves efficiency by avoiding duplication of work. More specifically, it introduces an innovative way to categorize news based on words’ ability to represent category. It also adapts existing methods to extract keywords, entities and event elements. Integration of our system on news mining applications has proved its valuable contribution for news analysis work.


News analysis Meta-data extraction Keyword extraction Entity linking 



The work is supported by 973 Program (No. 2014CB340504), NSFC-ANR (No. 61261130588), Tsinghua University Initiative Scientific Research Program (No. 20131089256), THU-NUS NExT Co-Lab and National Natural Science Foundation of China (No. 61303075).


  1. 1.
    Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. pp. 337–348. ACM (2003)Google Scholar
  2. 2.
    Garrido, A.L., Gómez, O., Ilarri, S., Mena, E.: An experience developing a semantic annotation system in a media group. In: Bouma, G., Ittoo, A., Métais, E., Wortmann, H. (eds.) NLDB 2012. LNCS, vol. 7337, pp. 333–338. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  3. 3.
    Hou, L., Li, J., Wang, Z., Tang, J., Zhang, P., Yang, R., Zheng, Q.: Newsminer: multifaceted news analysis for event search. Knowl.-Based Syst. 76, 17–29 (2015)CrossRefGoogle Scholar
  4. 4.
    Johnson, D.E., Oles, F.J., Zhang, T., Goetz, T.: A decision-tree-based symbolic rule induction system for text categorization. IBM Syst. J. 41(3), 428–437 (2002)CrossRefGoogle Scholar
  5. 5.
    Krishnalal, G., Rengarajan, S.B., Srinivasagan, K.: A new text mining approach based on HMM-SVM for web news classification. Int. J. Comput. Appl. 1(19), 98–104 (2010)Google Scholar
  6. 6.
    Lee, L.H., Wan, C.H., Rajkumar, R., Isa, D.: An enhanced support vector machine classification framework by using euclidean distance function for text document categorization. Appl. Intell. 37(1), 80–99 (2012)CrossRefGoogle Scholar
  7. 7.
    Li, J., Zhang, K., et al.: Keyword extraction based on tf/idf for Chinese news document. Wuhan Univ. J. Nat. Sci. 12(5), 917–921 (2007)CrossRefGoogle Scholar
  8. 8.
    McCallum, A., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: ICML, vol. 98, pp. 359–367 (1998)Google Scholar
  9. 9.
    Pawar, P.Y., Gawande, S.: A comparative study on different types of approaches to text categorization. Int. J. Mach. Learn. Comput. 2(4), 423–426 (2012)CrossRefGoogle Scholar
  10. 10.
    Piskorski, J., Yangarber, R.: Information extraction: past, present and future. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds.) Multi-source, Multilingual Information Extraction and Summarization. TANLP, pp. 23–49. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  11. 11.
    Shan, D., Zhao, W.X., Chen, R., Shu, B., Wang, Z., Yao, J., Yan, H., Li, X.: Eventsearch: a system for event discovery and retrieval on multi-type historical data. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1564–1567. ACM (2012)Google Scholar
  12. 12.
    Trampuš, M., Novak, B.: Internals of an aggregated web news feed. In: Proceedings of the 15th International Information Science Conference IS SiKDD 2012, pp. 431–434 (2012)Google Scholar
  13. 13.
    Vadrevu, S., Nagarajan, S., Gelgi, F., Davulcu, H.: Automated metadata and instance extraction from news web sites. In: Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence, 2005, pp. 38–41. IEEE (2005)Google Scholar
  14. 14.
    Wang, W., Zhao, D., Zou, L., Wang, D., Zheng, W.: Extracting 5W1H event semantic elements from Chinese online news. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds.) WAIM 2010. LNCS, vol. 6184, pp. 644–655. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  15. 15.
    Zheng, Q., Li, J., Wang, Z., Hou, L.: Co-mention and context-based entity linking. In: Li, J., Qi, G., Zhao, D., Nejdl, W., Zheng, H.-T. (eds.) Semantic Web and Web Science. SPC, pp. 117–129. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  16. 16.
    Zhou, Y., Li, Y., Xia, S.: An improved KNN text classification algorithm based on clustering. J. Comput. 4(3), 230–237 (2009)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Junbo Xia
    • 1
    • 2
    Email author
  • Fei Xie
    • 1
    • 2
  • Mengdi Zhang
    • 1
    • 2
  • Yu Su
    • 1
    • 2
  • Huanbo Luan
    • 1
    • 2
  1. 1.Knowledge Engineering Group, Department of Computer Science and TechnologyTsinghua UniversityBeijingPeople’s Republic of China
  2. 2.Communication Technology BureauXinhua News AgencyBeijingChina

Personalised recommendations