A New Approach for Semi-supervised Online News Classification

  • Hon-Man Ko
  • Wai Lam
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3597)


Due to the dramatic increasing of information on the Web, text categorization becomes a useful tool to organize the information. Traditional text categorization problem uses a training set from online sources with pre-defined class labels for text documents. Typically a large amount of online training news should be provided in order to learn a satisfactory categorization scheme. We investigate an innovative way to alleviate the problem. For each category, only a small amount of positive training examples for a set of the major concepts associated with the category are needed. We develop a technique which makes use of unlabeled documents since those documents can be easily collected, such as online news from the Web. Our technique exploits the inherent structure in the set of positive training documents guided by the provided concepts of the category. An algorithm for training document adaptation is developed for automatically seeking representative training examples from the unlabeled data collected from the new online source. Some preliminary experiments on real-world news collection have been conducted to demonstrate the effectiveness of our approach.


Text Categorization Relevance Feedback Unlabeled Data Voter Turnout Positive Training 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pp. 92–100 (1998)Google Scholar
  2. 2.
    Bockhorst, J., Craven, M.: Exploiting relations among concepts to acquire weakly labeled training data. In: Proceedings of the IEEE International Conference on Machine Learning, pp. 43–50 (2002)Google Scholar
  3. 3.
    Cohen, I., Cozman, F., Sebe, N., Cirelo, M., Huang, T.: Semisupervised learning of classifiers: Theory, algorithms, and their application to human computer interaction. IEEE Transaction on Pattern Analysis and Machine Intelligence 26(12), 1553–1567 (2004)CrossRefGoogle Scholar
  4. 4.
    Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of 10th European Conference on Machine Learning, pp. 137–142 (1998)Google Scholar
  5. 5.
    Lam, W., Chan, K., Radev, D., Saggion, H., Teufel, S.: Context-based generic cross-lingual retrieval of documents and automated summaries. Journal of American Society for Information Science and Technology, 129–139 (2005)Google Scholar
  6. 6.
    Lam, W., Han, Y.: Automatic textual document categorization based on generalised instance sets and a metamodel. IEEE Transaction on Pattern Analysis and Machine Intelligence 25(5), 628–633 (2003)CrossRefGoogle Scholar
  7. 7.
    Lam, W., Keung, C., Liu, D.: Discovering useful concept prototypes for classification based on filtering and abstraction. IEEE Transaction on Pattern Analysis and Machine Intelligence 24(8), 1075–1090 (2002)CrossRefGoogle Scholar
  8. 8.
    Lam, W., Ruiz, M., Srinivasan, P.: Automatic text categorization and its application to text retrieval. IEEE Transaction on Knowledge and Data Engineering 11(6), 865–879 (1999)CrossRefGoogle Scholar
  9. 9.
    Lam, W., Wang, W., Yue, C.: Web discovery and filtering based on textual relevance feedback learning. Computational Intelligence 19(2), 136–163 (2003)CrossRefGoogle Scholar
  10. 10.
    Lam, W., Yu, K.: High-dimensional learning framework for adaptive document filtering. Computational Intelligence 19(1), 42–63 (2003)CrossRefGoogle Scholar
  11. 11.
    Li, X., Liu, B.: Learning to classify texts using positive and unlabeled data. In: Proceedings of Eighteenth Internatinoal Joint Conferences on Artifical Intelligence, pp. 587–594 (2003)Google Scholar
  12. 12.
    Liu, B., Lee, W., Yu, P., Li, X.: Partially supervised classification of text documents. In: Proceedings of the IEEE International Conference on Machine Learning, pp. 387–394 (2002)Google Scholar
  13. 13.
    Liu, B., Yang, D., Li, X., Lee, W., Yu, P.: Building text classifiers using positive and unlabeled examples. In: Proceedings of the IEEE International Conference on Data Mining, pp. 179–188 (2003)Google Scholar
  14. 14.
    Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using em. Machine Learning 39(2), 103–134 (2000)zbMATHCrossRefGoogle Scholar
  15. 15.
    Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: Proceedings of the IEEE International Conference on Machine Learning, pp. 412–420 (1997)Google Scholar
  16. 16.
    Yu, H., Han, J., Chang, K.: PEBL: Positive examples based learning for web page classification using svm. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 239–248 (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Hon-Man Ko
    • 1
  • Wai Lam
    • 1
  1. 1.Department of Systems Engineering and Engineering ManagementThe Chinese University of Hong KongHong Kong

Personalised recommendations