Research and Improvement of TF-IDF Algorithm Based on Information Theory

  • Long Cheng
  • Yang YangEmail author
  • Kang Zhao
  • Zhipeng Gao
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 905)


With the development of information technology and the increasing richness of network information, people can more and more easily search for and obtain the required information from the network. However, how to quickly obtain the required information in the massive network information is very important. Therefore, information retrieval technology emerges, One of the important supporting technologies is keyword extraction technology. Currently, the most widely used keyword extraction technique is the TF-IDFs algorithm (Term Frequency-Inverse Document Frequency). The basic principle of the TF-IDF algorithm is to calculate the number of occurrences of words and the frequency of words. It ranks and selects the top few words as keywords. The TF-IDF algorithm has features such as simplicity and high reliability, but there are also deficiencies. This paper analyzes its shortcomings for an improved TFIDF algorithm, and optimizes it from the information theory point of view. It uses the information entropy and relative entropy in information theory as the calculation factor, adds to the above improved TFIDF algorithm, optimizes its performance, and passes Simulation experiments verify its performance.


Word frequence Information theory Information entropy Relative entropy 


  1. 1.
    Saltong, G., Mcgill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)Google Scholar
  2. 2.
    Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (1972)CrossRefGoogle Scholar
  3. 3.
    Saltong, G., Yu, C.T.: On the construction of effective vocabularies for information retrieval. In: Proceedings of the 1973 Meeting on Programming Languages and Information Retrieval, p. 11. ACM, New York (1973)Google Scholar
  4. 4.
    Saltong, G., Fox, E.A., Wu, H.: Extended Boolean information retrieval. Commun. ACM 26(11), 1022–1036 (1983)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Saltong, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. In: Information Processing and Management, pp. 513–523 (1988)CrossRefGoogle Scholar
  6. 6.
    Basili, R., Pazienzam, M.: A test classifier based on linguistic processing. In: Proceedings of IJCAIp 1999, Machine Learning for Information Filtering (1999)Google Scholar
  7. 7.
    How, B.C., Narayanan, K.: An empirical study of feature selection for text categorization based on term weight age. In: Proceedings of the 2004 IEEE W/IC/ACM International Conference on Web Intelligence, pp. 599–602. IEEE Computer Society, Washington, DC (2004)Google Scholar
  8. 8.
    Guo, A., Yang, T.: Research and improvement of feature words weight based on TFIDF algorithm. In: Information Technology, Networking, Electronic and Automation Control Conference, pp. 415–419. IEEE (2016)Google Scholar
  9. 9.
    Zuo, R.: Information theory, information view, and software testing. In: Seventh International Conference on Information Technology: New Generations, pp. 998–1003. IEEE Computer Society (2010)Google Scholar
  10. 10.
    Salton, G., Fox, E.A., Wu, H.: Extended Boolean information retrieval. Cornell University (1982)Google Scholar
  11. 11.
    Lin, F.L., Ning, B.: Relative entropy and torsion coupling. Phys. Rev. D 94(12), 126007 (2016)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.State Key Laboratory of Networking and Switching TechnologyBeijingChina

Personalised recommendations