Combining Contents and Citations for Scientific Document Classification

  • Minh Duc Cao
  • Xiaoying Gao
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3809)


This paper introduces a classification system that exploits the content information as well as citation structure for scientific paper classification. The system first applies a content-based statistical classification method which is similar to general text classification. We investigate several classification methods including K-nearest neighbours, nearest centroid, naive Bayes and decision trees. Among those methods, the K-nearest neighbours is found to outperform others while the rest perform comparably. Using phrases in addition to words and a good feature selection strategy such as information gain can improve system accuracy and reduce training time in comparison with using words only. To combine citation links for classification, the system proposes an iterative method to update the labellings of classified instances using citation links. Our results show that, combining contents and citations significantly improves the system performance.


Feature Selection Information Gain Feature Selection Method Inductive Logic Programming Citation Link 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Borko, H., Bernick, M.: Automatic document classification. J. ACM 10, 151–162 (1963)zbMATHCrossRefGoogle Scholar
  2. 2.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)CrossRefMathSciNetGoogle Scholar
  3. 3.
    Han, E.-H., Karypis, G.: Centroid-Based Document Classification: Analysis and Experimental Results. Principles of Data Mining and Knowledge Discovery, 424–431 (2000)Google Scholar
  4. 4.
    Witten, I.H., Frank, E.: Data Mining, Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, San Francisco (2000)Google Scholar
  5. 5.
    Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  6. 6.
    Nigam, K., Lafferty, J., McCallum, A.: Using maximum entropy for text classification. In: IJCAI-1999 Workshop on Machine Learning for Information Filtering, pp. 61–67 (1999)Google Scholar
  7. 7.
    Wiener, E., Pedersen, L.O., Weigend, A.S.: A neural network approach to topic spotting. In: Proc. of the Symposium on Document Analysis and Information Retrieval, pp. 317–332 (1995)Google Scholar
  8. 8.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  9. 9.
    Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)zbMATHGoogle Scholar
  10. 10.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30, 107–117 (1998)CrossRefGoogle Scholar
  11. 11.
    Getoor, L., Friedman, N., Koller, D., Taskar, B.: Learning probabilistic models of link structure. J. Mach. Learn. Res. 3, 679–707 (2003)zbMATHCrossRefMathSciNetGoogle Scholar
  12. 12.
    Taskar, B., Segal, E., Koller, D.: Probabilistic classification and clustering in relational data. In: Nebel, B. (ed.) Proceeding of IJCAI-2001, 17th International Joint Conference on Artificial Intelligence, Seattle, US, pp. 870–878 (2001)Google Scholar
  13. 13.
    Craven, M., Slattery, S.: Relational learning with statistical predicate invention: Better models for hypertext. Mach. Learn. 43, 97–119 (2001)zbMATHCrossRefGoogle Scholar
  14. 14.
    Quinlan, J.R.: Learning logical definitions from relations. Mach. Learn. 5, 239–266 (1990)Google Scholar
  15. 15.
    Cohen, W.: Learning to classify English text with ILP methods. In: Advances in Inductive Logic Programming, pp. 124–143. IOS Press, Amsterdam (1996)Google Scholar
  16. 16.
    Junker, M., Sintek, M., Rinck, M.: Learning for text categorization and information extraction with ILP. In: Cussens, J. (ed.) Proceedings of the 1st Workshop on Learning Language in Logic, Bled, Slovenia, pp. 84–93 (1999)Google Scholar
  17. 17.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML-1997, 14th International Conference on Machine Learning, Nashville, US, pp. 412–420. Morgan Kaufmann Publishers, San Francisco (1997)Google Scholar
  18. 18.
    Porter, M.F.: An algorithm for suffix stripping. Readings in Information Retrieval, 313–316 (1997)Google Scholar
  19. 19.
    McCallum, A.K., Nigam, K., Rennie, J., Seymore, K.: Automating the construction of internet portals with machine learning. Information Retrieval 3, 127–163 (2000)CrossRefGoogle Scholar
  20. 20.
    McCallum, A., Nigam, K.: A comparison of event models for naive Bayes text classification. In: AAAI-1998 Workshop on Learning for Text Categorization (1998)Google Scholar
  21. 21.
    Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of SDAIR-1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, US, pp. 81–93 (1994)Google Scholar
  22. 22.
    Lewis, D.: An evaluation of prasal and clustered representation of text categorisation tasks. In: Proceedings of SIGIR-1992, 15th ACM International Conference on Reseach and Deveplopment in Information Retrieval, pp. 289–297 (1992)Google Scholar
  23. 23.
    Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: SIGMOD 1998: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp. 307–318. ACM Press, New York (1998)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Minh Duc Cao
    • 1
  • Xiaoying Gao
    • 1
  1. 1.School of Mathematics, Statistics & Computer ScienceVictoria University of WellingtonWellingtonNew Zealand

Personalised recommendations